A CLI-first data curation engine for ML engineers. Curates your fine-tuning data so you train on signal, not noise — reducing dataset size by 50–80% without sacrificing accuracy.
Fine-tuning datasets are full of near-duplicate rows, low-quality examples, and conflicting instructions. Truva finds and removes them before they corrupt your model.
Identifies near-duplicate rows using embedding similarity and Union-Find clustering. Keeps only the most representative example from each group.
Rates every row on a 1–10 educational value scale using an LLM judge. Supports Ollama, OpenAI, and Anthropic — drop rows below your quality threshold.
Uses a local NLI model to discover rows teaching conflicting information within semantically similar clusters. No API key required.
Runs the complete pipeline — deduplication, scoring, and contradiction detection — in a single command and outputs a curated gold dataset with a detailed report.
Install with pip and run against any JSONL dataset.
| Command | Description |
|---|---|
| truva health | Quick health check — size, field stats, format validation |
| truva dedupe | Remove near-duplicate rows using embedding similarity |
| truva score | Rate rows 1–10 for educational value using an LLM judge |
| truva contradict | Detect rows with conflicting information via NLI model |
| truva audit | Full pipeline: dedupe + score + contradiction detection |
| truva inspect | Examine clusters or contradictions from a prior audit report |
| truva embed | Compute vector embeddings for dataset rows |
| truva export | Convert the curated gold dataset to JSONL or CSV |
| Requirement | Detail |
|---|---|
| Python | 3.10+ |
| Platform | macOS (Apple Silicon) · Linux |
| License | Apache 2.0 |
| LLM judge (score/audit) | Ollama (local) · OpenAI · Anthropic — optional |
| Contradiction detection | Local NLI model — no API key needed |
Common workflows for cleaning and curating fine-tuning datasets.
Run truva dedupe against your JSONL dataset. Truva embeds every row, clusters by cosine similarity, and keeps only the most representative example from each cluster.
--threshold (e.g. 0.85) to be more aggressive about removing similar-but-not-identical rows. Default is 0.95.Use truva score to have a language model rate each row's educational value on a 1–10 scale. Rows below your quality threshold are dropped automatically.
--interactive --sample 50 to preview scores on a sample before processing your full dataset. No rows are written in this mode.truva audit chains deduplication, quality scoring, and contradiction detection into a single command. The output directory contains a curated gold dataset and a full JSON report.
Inspect the results from the audit report:
./results/ directory contains gold.jsonl (your curated dataset) and report.json (full audit trail with cluster maps, scores, and contradiction pairs).Convert your gold dataset to any format for downstream training pipelines.
Common flags available across Truva commands.
| Flag | Description |
|---|---|
| --threshold | Cosine similarity threshold (default: 0.95) |
| --provider | Embedding provider: local or api |
| --model | Embedding model name |
| --text-field | Column to embed (default: auto-detected) |
| --output, -o | Output file path |
| --report | Path for JSON deduplication report |
| Flag | Description |
|---|---|
| --provider | Judge provider: ollama · openai · anthropic |
| --model | Judge model name |
| --mode | fast (default) or thorough |
| --min-quality | Drop rows below this score (1–10) |
| --calibration | Path to calibration YAML |
| --interactive | Dry-run preview mode — no rows written |
| --sample | Number of rows to sample in interactive mode |
| Flag | Description |
|---|---|
| --confidence | Min NLI confidence threshold (default: 0.8) |
| --nli-model | NLI model name |
| --dedupe-threshold | Clustering similarity threshold |
| --output, -o | Output path for contradiction report |
Deduplicate, score, and audit your fine-tuning datasets before they corrupt your model. Cut GPU costs by 50–80% — without sacrificing accuracy.
pip install truva