Skip to content

Evaluation and Improvement

Under Development

This section is under active development. Documentation may not be fully accurate. Please contact us if you have any questions.

You've created an Agent and it's working well. But when you want to modify settings, a voice in your head might say:

"After making changes, will questions that were answered correctly become wrong?"

This is the value of Test Suites — they help you record "how the Agent should answer," and with one click after each change, verify that quality hasn't regressed.

Why Do You Need Test Cases?

Imagine this scenario:

  1. Your Agent correctly answers "What is the return policy?"
  2. You modify the Instruction to make answers more concise
  3. After modification, the Agent answers new questions better
  4. But you didn't notice that the "return policy" answer became incomplete

With test cases, you can:

  • Before modifications: Record important Q&As
  • After modifications: Run tests with one click and immediately know if there's any "regression"

Step 1: Discover Issues from Conversation History

Review Conversation History

In the Agent Editor or conversation history, find responses that are "not ideal enough":

  • Agent answers incorrectly or incompletely
  • Tone doesn't match brand image
  • Doesn't use the correct data source
  • Answer is too verbose or too brief

Analyze Issue Types

Issue Type Possible Cause Improvement Direction
Incorrect Answer Instruction not clear enough Add more specific rules
Incomplete Information Missing data sources Supplement knowledge base content
Wrong Tone No tone guidelines defined Add tone requirements to Instruction
Off-topic Answer No scope limitation set Define Agent's area of expertise

Step 2: Create Test Cases

Record "how this question should be answered" as test cases to serve as quality standards.

Access Test Suite Page

  1. Enter workspace and select the Agent to test
  2. Click the "Test Suite" tab

Add Test Cases

Click "Add Case" and fill in:

Field Description Example
Input Questions users will ask "What is the return policy?"
Ideal Response Ideal answer content (optional) "Returns accepted within 7 days of purchase, items must be kept intact..."
Standard Criteria for scoring "Must mention 7-day period and completeness requirement"

Recommendation

Start by creating test cases for "the most important questions." You don't need to create many at once; 5-10 core cases can provide good protection.

Batch Import

If you have many cases, you can batch import with a CSV file:

question,expected_output
"What is the return policy?","Returns accepted within 7 days of purchase"
"What are the business hours?","Monday to Friday 9:00-18:00"

Click "Upload CSV" to upload the file.

Step 3: Run Tests

Manual Execution

  1. Select test cases to run (or click "Run All" to execute all)
  2. Select validation rule (Validator)
  3. Click "Run" to start testing

View Results

After testing completes, you'll see:

Item Description
AI Response Agent's actual answer
Grade Degree of meeting standards (0-100%)
Explanation Why this score was given

Interpret Scores

  • 80% and above: Answer meets standards
  • 50-80%: Partially meets, may need fine-tuning
  • Below 50%: Clearly doesn't meet, need to review settings

Step 4: Adjust Settings

Based on test results, adjust Agent settings:

Improve Instruction

Issue Found in Testing Adjustment Method
Answer too verbose Add "Please answer concisely, within 100 words"
Forgets to use correct language Emphasize at the beginning "Please answer in Traditional Chinese"
Misses key information Add "When answering, be sure to mention..."
Tone too formal/casual Define tone examples

Adjust Knowledge Sources

If Agent cannot answer certain questions:

  1. Confirm related data is connected to workspace
  2. Check if data sync status is complete
  3. Confirm the data source is selected in Agent settings

Step 5: Retest and Publish

  1. Rerun tests: Confirm all cases pass
  2. Publish new version: After confirming no issues, publish Agent update
  3. Save version: Codeer.ai automatically records version history for easy rollback later

Advanced: Configure Validation Rules (Validator)

Validation rules define "what makes a good answer," making test results more consistent and aligned with your standards.

Built-in Validation Rules

Codeer.ai provides several common validation methods:

  • Keyword Check: Whether the answer contains specific keywords
  • Similarity Comparison: Similarity between answer and expected answer
  • AI Scoring: Use AI to judge answer quality

Custom Validation Rules

Click "Manage Validators" to create custom rules:

  1. Give the rule a name (e.g., "Customer Service Tone Check")
  2. Set scoring criteria (e.g., "Must use polite language, no typos")
  3. Save and then use it when testing

Best Practices

When Should You Create Test Cases?

  • ✅ When you find Agent answers are not ideal, create cases before fixing
  • ✅ Before important features go live, create test cases for core Q&As
  • ✅ When receiving user feedback, convert issues into test cases

How Many Test Cases Should You Create?

Agent Complexity Recommended Cases
Simple (single purpose) 5-10
Medium (multiple Q&A types) 10-30
Complex (multiple domains) 30+

How Often Should You Run Tests?

  • After every Instruction modification: Must run
  • After updating knowledge sources: Recommended to run
  • Regular checks: Run once per week or month to ensure quality stability

Common Questions

What score is good?

This depends on your use case. It's recommended to run tests once to establish a "baseline," and then aim to "not fall below the baseline." If most cases are above 80%, it indicates the Agent is performing stably.

Do test cases affect the Agent's answers?

No. Test cases are only used for verification and don't affect the Agent's actual behavior.

Can I test different versions of the Agent?

Yes. When running tests, you can select which Agent version to test, making it easy to compare performance across versions.

How can the team maintain test cases together?

Invite team members to the workspace, and they can add and edit test cases together.

Next Steps

After tests pass, you can: