Terraphim AI | Automata Evaluation

The evaluation framework measures how accurately the Aho-Corasick automata classify terms in real documents. Give it a JSON file of hand-labeled documents and it returns micro-averaged precision, recall, and F1, a per-term breakdown, and a list of terms that consistently produce false positives.

When to use it

After changing a thesaurus, to verify the automata still behave as expected.
To find patterns that match too broadly (systematic false positives).
To quantify recall gaps before adding new synonyms.

Ground-truth file format

[
  {
    "id": "doc1",
    "text": "tokio powers async rust applications",
    "expected_terms": [
      { "term": "tokio",  "category": null },
      { "term": "rust",   "category": null },
      { "term": "async",  "category": null }
    ]
  }
]

Each term must match the normalized term value (nterm) stored in the thesaurus.

Running an evaluation

use terraphim_automata::evaluation::{evaluate, load_ground_truth};

let docs = load_ground_truth(Path::new("ground_truth.json"))?;
let result = evaluate(&docs, thesaurus);

println!("F1: {:.2}", result.overall.f1);
for report in &result.per_term {
    println!("  {} precision={:.2} recall={:.2}",
        report.term, report.metrics.precision, report.metrics.recall);
}
for err in &result.systematic_errors {
    println!("  SYSTEMATIC FP: {} in {} documents", err.term, err.false_positive_count);
}

How metrics are computed

Metrics are micro-averaged: true positives, false positives, and false negatives are summed across all documents before dividing.

A term is flagged as a SystematicError when it appears as a false positive in 2 or more documents. Matching is case-insensitive; each term is counted at most once per document.

Source: crates/terraphim_automata/src/evaluation.rs