Published on PyPI

truva

A CLI-first data curation engine for ML engineers. Curates your fine-tuning data so you train on signal, not noise — reducing dataset size by 50–80% without sacrificing accuracy.

$ pip install truva

What Truva does

Fine-tuning datasets are full of near-duplicate rows, low-quality examples, and conflicting instructions. Truva finds and removes them before they corrupt your model.

🧹

Deduplicate

Identifies near-duplicate rows using embedding similarity and Union-Find clustering. Keeps only the most representative example from each group.

Score

Rates every row on a 1–10 educational value scale using an LLM judge. Supports Ollama, OpenAI, and Anthropic — drop rows below your quality threshold.

Detect Contradictions

Uses a local NLI model to discover rows teaching conflicting information within semantically similar clusters. No API key required.

🔍

Full Audit

Runs the complete pipeline — deduplication, scoring, and contradiction detection — in a single command and outputs a curated gold dataset with a detailed report.

Quick Start

Install with pip and run against any JSONL dataset.

# Install pip install truva # Run a quick health check on your dataset truva health ./data.jsonl # Deduplicate truva dedupe ./data.jsonl --threshold 0.95 # Score quality (requires Ollama running locally) truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6 # Full pipeline audit truva audit ./data.jsonl --provider openai --model gpt-4o-mini --output ./results/

Commands

CommandDescription
truva healthQuick health check — size, field stats, format validation
truva dedupeRemove near-duplicate rows using embedding similarity
truva scoreRate rows 1–10 for educational value using an LLM judge
truva contradictDetect rows with conflicting information via NLI model
truva auditFull pipeline: dedupe + score + contradiction detection
truva inspectExamine clusters or contradictions from a prior audit report
truva embedCompute vector embeddings for dataset rows
truva exportConvert the curated gold dataset to JSONL or CSV

Requirements

RequirementDetail
Python3.10+
PlatformmacOS (Apple Silicon) · Linux
LicenseApache 2.0
LLM judge (score/audit)Ollama (local) · OpenAI · Anthropic — optional
Contradiction detectionLocal NLI model — no API key needed

Examples

Common workflows for cleaning and curating fine-tuning datasets.

Example 1 — Deduplicate

Removing near-duplicate training examples

Run truva dedupe against your JSONL dataset. Truva embeds every row, clusters by cosine similarity, and keeps only the most representative example from each cluster.

truva dedupe ./data.jsonl --threshold 0.95 --output ./deduped.jsonl
Lower the --threshold (e.g. 0.85) to be more aggressive about removing similar-but-not-identical rows. Default is 0.95.
Example 2 — Score

Filtering low-quality rows with an LLM judge

Use truva score to have a language model rate each row's educational value on a 1–10 scale. Rows below your quality threshold are dropped automatically.

# Local judge with Ollama (no API cost) truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6 # Preview scores before committing (dry run) truva score ./data.jsonl --interactive --sample 50 # Using OpenAI truva score ./data.jsonl --provider openai --model gpt-4o-mini --min-quality 7
Use --interactive --sample 50 to preview scores on a sample before processing your full dataset. No rows are written in this mode.
Example 3 — Full Audit

Running the complete curation pipeline

truva audit chains deduplication, quality scoring, and contradiction detection into a single command. The output directory contains a curated gold dataset and a full JSON report.

truva audit ./data.jsonl \ --provider openai --model gpt-4o-mini \ --dedupe-threshold 0.95 \ --min-quality 6 \ --detect-contradictions \ --output ./results/

Inspect the results from the audit report:

# View deduplicated clusters truva inspect ./results/report.json --what clusters --top 10 # View detected contradictions truva inspect ./results/report.json --what contradictions --top 10
The ./results/ directory contains gold.jsonl (your curated dataset) and report.json (full audit trail with cluster maps, scores, and contradiction pairs).
Example 4 — Export

Exporting the curated dataset

Convert your gold dataset to any format for downstream training pipelines.

# Export as CSV truva export ./results/gold.jsonl --format csv -o ./gold.csv # Export as JSONL with selected columns truva export ./results/gold.jsonl --format jsonl --columns prompt,completion -o ./gold.jsonl

Flag Reference

Common flags available across Truva commands.

dedupe

FlagDescription
--thresholdCosine similarity threshold (default: 0.95)
--providerEmbedding provider: local or api
--modelEmbedding model name
--text-fieldColumn to embed (default: auto-detected)
--output, -oOutput file path
--reportPath for JSON deduplication report

score

FlagDescription
--providerJudge provider: ollama · openai · anthropic
--modelJudge model name
--modefast (default) or thorough
--min-qualityDrop rows below this score (1–10)
--calibrationPath to calibration YAML
--interactiveDry-run preview mode — no rows written
--sampleNumber of rows to sample in interactive mode

contradict

FlagDescription
--confidenceMin NLI confidence threshold (default: 0.8)
--nli-modelNLI model name
--dedupe-thresholdClustering similarity threshold
--output, -oOutput path for contradiction report

Train on signal, not noise.

Deduplicate, score, and audit your fine-tuning datasets before they corrupt your model. Cut GPU costs by 50–80% — without sacrificing accuracy.

pip install truva