Evaluation and Improvement

Under Development

This section is under active development. Documentation may not be fully accurate. Please contact us if you have any questions.

You've created an Agent and it's working well. But when you want to modify settings, a voice in your head might say:

"After making changes, will questions that were answered correctly become wrong?"

This is the value of Test Suites — they help you record "how the Agent should answer," and with one click after each change, verify that quality hasn't regressed.

Why Do You Need Test Cases?

Imagine this scenario:

Your Agent correctly answers "What is the return policy?"
You modify the Instruction to make answers more concise
After modification, the Agent answers new questions better
But you didn't notice that the "return policy" answer became incomplete

With test cases, you can:

Before modifications: Record important Q&As
After modifications: Run tests with one click and immediately know if there's any "regression"

Step 1: Discover Issues from Conversation History

Review Conversation History

In the Agent Editor or conversation history, find responses that are "not ideal enough":

Agent answers incorrectly or incompletely
Tone doesn't match brand image
Doesn't use the correct data source
Answer is too verbose or too brief

Analyze Issue Types

Issue Type	Possible Cause	Improvement Direction
Incorrect Answer	Instruction not clear enough	Add more specific rules
Incomplete Information	Missing data sources	Supplement knowledge base content
Wrong Tone	No tone guidelines defined	Add tone requirements to Instruction
Off-topic Answer	No scope limitation set	Define Agent's area of expertise

Step 2: Create Test Cases

Record "how this question should be answered" as test cases to serve as quality standards.

Access Test Suite Page

Enter workspace and select the Agent to test
Click the "Test Suite" tab

Add Test Cases

Click "Add Case" and fill in:

Field	Description	Example
Input	Questions users will ask	"What is the return policy?"
Ideal Response	Ideal answer content (optional)	"Returns accepted within 7 days of purchase, items must be kept intact..."
Standard	Criteria for scoring	"Must mention 7-day period and completeness requirement"

Recommendation

Start by creating test cases for "the most important questions." You don't need to create many at once; 5-10 core cases can provide good protection.

Batch Import

If you have many cases, you can batch import with a CSV file:

question,expected_output
"What is the return policy?","Returns accepted within 7 days of purchase"
"What are the business hours?","Monday to Friday 9:00-18:00"

Click "Upload CSV" to upload the file.

Step 3: Run Tests

Manual Execution

Select test cases to run (or click "Run All" to execute all)
Select validation rule (Validator)
Click "Run" to start testing

View Results

After testing completes, you'll see:

Item	Description
AI Response	Agent's actual answer
Grade	Degree of meeting standards (0-100%)
Explanation	Why this score was given

Interpret Scores

80% and above: Answer meets standards
50-80%: Partially meets, may need fine-tuning
Below 50%: Clearly doesn't meet, need to review settings

Step 4: Adjust Settings

Based on test results, adjust Agent settings:

Improve Instruction

Issue Found in Testing	Adjustment Method
Answer too verbose	Add "Please answer concisely, within 100 words"
Forgets to use correct language	Emphasize at the beginning "Please answer in Traditional Chinese"
Misses key information	Add "When answering, be sure to mention..."
Tone too formal/casual	Define tone examples

Adjust Knowledge Sources

If Agent cannot answer certain questions:

Confirm related data is connected to workspace
Check if data sync status is complete
Confirm the data source is selected in Agent settings

Step 5: Retest and Publish

Rerun tests: Confirm all cases pass
Publish new version: After confirming no issues, publish Agent update
Save version: Codeer.ai automatically records version history for easy rollback later

Advanced: Configure Validation Rules (Validator)

Validation rules define "what makes a good answer," making test results more consistent and aligned with your standards.

Built-in Validation Rules

Codeer.ai provides several common validation methods:

Keyword Check: Whether the answer contains specific keywords
Similarity Comparison: Similarity between answer and expected answer
AI Scoring: Use AI to judge answer quality

Custom Validation Rules

Click "Manage Validators" to create custom rules:

Give the rule a name (e.g., "Customer Service Tone Check")
Set scoring criteria (e.g., "Must use polite language, no typos")
Save and then use it when testing

Best Practices

When Should You Create Test Cases?

✅ When you find Agent answers are not ideal, create cases before fixing
✅ Before important features go live, create test cases for core Q&As
✅ When receiving user feedback, convert issues into test cases

How Many Test Cases Should You Create?

Agent Complexity	Recommended Cases
Simple (single purpose)	5-10
Medium (multiple Q&A types)	10-30
Complex (multiple domains)	30+

How Often Should You Run Tests?

After every Instruction modification: Must run
After updating knowledge sources: Recommended to run
Regular checks: Run once per week or month to ensure quality stability

Common Questions

What score is good?

This depends on your use case. It's recommended to run tests once to establish a "baseline," and then aim to "not fall below the baseline." If most cases are above 80%, it indicates the Agent is performing stably.

Do test cases affect the Agent's answers?

No. Test cases are only used for verification and don't affect the Agent's actual behavior.

Can I test different versions of the Agent?

Yes. When running tests, you can select which Agent version to test, making it easy to compare performance across versions.

How can the team maintain test cases together?

Invite team members to the workspace, and they can add and edit test cases together.

Next Steps

After tests pass, you can:

Publish Agent - Let users start using it
Collaborate with Team - Invite team members to maintain together