Published on PyPI

truva

A CLI-first data curation engine for ML engineers. Curates your fine-tuning data so you train on signal, not noise — reducing dataset size by 50–80% without sacrificing accuracy.

View on PyPI →Quick Start

$ pip install truva

What Truva does

Fine-tuning datasets are full of near-duplicate rows, low-quality examples, and conflicting instructions. Truva finds and removes them before they corrupt your model.

🧹

Deduplicate

Identifies near-duplicate rows using embedding similarity and Union-Find clustering. Keeps only the most representative example from each group.

⭐

Score

Rates every row on a 1–10 educational value scale using an LLM judge. Supports Ollama, OpenAI, and Anthropic — drop rows below your quality threshold.

⚡

Detect Contradictions

Uses a local NLI model to discover rows teaching conflicting information within semantically similar clusters. No API key required.

🔍

Full Audit

Runs the complete pipeline — deduplication, scoring, and contradiction detection — in a single command and outputs a curated gold dataset with a detailed report.

Quick Start

Install with pip and run against any JSONL dataset.

# Install pip install truva # Run a quick health check on your dataset truva health ./data.jsonl # Deduplicate truva dedupe ./data.jsonl --threshold 0.95 # Score quality (requires Ollama running locally) truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6 # Full pipeline audit truva audit ./data.jsonl --provider openai --model gpt-4o-mini --output ./results/

Commands

Command	Description
truva health	Quick health check — size, field stats, format validation
truva dedupe	Remove near-duplicate rows using embedding similarity
truva score	Rate rows 1–10 for educational value using an LLM judge
truva contradict	Detect rows with conflicting information via NLI model
truva audit	Full pipeline: dedupe + score + contradiction detection
truva inspect	Examine clusters or contradictions from a prior audit report
truva embed	Compute vector embeddings for dataset rows
truva export	Convert the curated gold dataset to JSONL or CSV

Requirements

Requirement	Detail
Python	3.10+
Platform	macOS (Apple Silicon) · Linux
License	Apache 2.0
LLM judge (score/audit)	Ollama (local) · OpenAI · Anthropic — optional
Contradiction detection	Local NLI model — no API key needed

Examples

Common workflows for cleaning and curating fine-tuning datasets.

Example 1 — Deduplicate

Removing near-duplicate training examples

Run truva dedupe against your JSONL dataset. Truva embeds every row, clusters by cosine similarity, and keeps only the most representative example from each cluster.

truva dedupe ./data.jsonl --threshold 0.95 --output ./deduped.jsonl

Lower the --threshold (e.g. 0.85) to be more aggressive about removing similar-but-not-identical rows. Default is 0.95.

Example 2 — Score

Filtering low-quality rows with an LLM judge

Use truva score to have a language model rate each row's educational value on a 1–10 scale. Rows below your quality threshold are dropped automatically.

# Local judge with Ollama (no API cost) truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6 # Preview scores before committing (dry run) truva score ./data.jsonl --interactive --sample 50 # Using OpenAI truva score ./data.jsonl --provider openai --model gpt-4o-mini --min-quality 7

Use --interactive --sample 50 to preview scores on a sample before processing your full dataset. No rows are written in this mode.

Example 3 — Full Audit

Running the complete curation pipeline

truva audit chains deduplication, quality scoring, and contradiction detection into a single command. The output directory contains a curated gold dataset and a full JSON report.

truva audit ./data.jsonl \ --provider openai --model gpt-4o-mini \ --dedupe-threshold 0.95 \ --min-quality 6 \ --detect-contradictions \ --output ./results/

Inspect the results from the audit report:

# View deduplicated clusters truva inspect ./results/report.json --what clusters --top 10 # View detected contradictions truva inspect ./results/report.json --what contradictions --top 10

The ./results/ directory contains gold.jsonl (your curated dataset) and report.json (full audit trail with cluster maps, scores, and contradiction pairs).

Example 4 — Export

Exporting the curated dataset

Convert your gold dataset to any format for downstream training pipelines.

# Export as CSV truva export ./results/gold.jsonl --format csv -o ./gold.csv # Export as JSONL with selected columns truva export ./results/gold.jsonl --format jsonl --columns prompt,completion -o ./gold.jsonl

Flag Reference

Common flags available across Truva commands.

dedupe

Flag	Description
--threshold	Cosine similarity threshold (default: 0.95)
--provider	Embedding provider: local or api
--model	Embedding model name
--text-field	Column to embed (default: auto-detected)
--output, -o	Output file path
--report	Path for JSON deduplication report

score

Flag	Description
--provider	Judge provider: ollama · openai · anthropic
--model	Judge model name
--mode	fast (default) or thorough
--min-quality	Drop rows below this score (1–10)
--calibration	Path to calibration YAML
--interactive	Dry-run preview mode — no rows written
--sample	Number of rows to sample in interactive mode

contradict

Flag	Description
--confidence	Min NLI confidence threshold (default: 0.8)
--nli-model	NLI model name
--dedupe-threshold	Clustering similarity threshold
--output, -o	Output path for contradiction report

Train on signal, not noise.

Deduplicate, score, and audit your fine-tuning datasets before they corrupt your model. Cut GPU costs by 50–80% — without sacrificing accuracy.

View on PyPI →

pip install truva