Simple Ground Truth LLM Evaluation

Evaluate your prompt's performance by comparing its output against a provided ground truth dataset.

Ground Truth Evaluation Task
Define the prompt and upload ground truth data for evaluation. Results are pre-populated.

Ground Truth Evaluation Results
Detailed breakdown of your prompt's performance against ground truth.

Overall Score

88.20

Ground Truth JSON Parsing

99%

Scores per Dimension

Score Distribution (Frequency)

Individual Example Scores

Ground Truth Item ID / IndexScore
Ground Truth Item 190
Ground Truth Item 285
Ground Truth Item 395
Ground Truth Item 480
Ground Truth Item 596
Scores for individual items compared against ground truth.