Free · Published on PyPI

feedloop

Collect human preference feedback on LLM outputs — directly from your Python code, with zero configuration. Submit pairs of model responses, pick the better one in a local browser UI, and export a training dataset ready for DPO fine-tuning.

Think of it as the SQLite of human feedback: local-first, pip-installable, and out of your way.

$ pip install feedloop

Requires Python 3.10 or higher. No other setup needed.

Who is this for?

feedloop is built for solo AI developers who want to:

  • Compare two models (e.g. GPT-4o vs Claude) and collect human judgments
  • Build a preference dataset for DPO fine-tuning without a cloud platform
  • Run fast evaluation loops locally during model development
  • Understand where their model falls short by seeing which outputs humans prefer

If you've ever wanted to collect preference data but felt LangSmith, Humanloop, or Argilla were too heavy for a one-person project — feedloop is for you.

What feedloop does

A complete local feedback loop — from raw model outputs to a fine-tuning-ready dataset — with no cloud accounts, no API keys, and no configuration files.

🏠

Local-first

Runs entirely on your machine. No data ever sent to a server. Results stored in a local SQLite database.

🔌

Model-agnostic

Works with OpenAI, Anthropic, Hugging Face, Ollama, or any model you can call from Python.

🖥

Browser review UI

Side-by-side comparison at localhost:7856. Keyboard shortcuts, progress bar, and auto-updating feed.

📦

DPO-ready export

One call to export JSONL compatible with TRL, OpenRLHF, and any custom fine-tuning pipeline.

Quick Start

The complete workflow from zero to a preference dataset in under 10 minutes.

1

Start feedloop

import feedloop feedloop.start()

This does three things:

  • Starts a local web server in a background thread (your script keeps running)
  • Opens http://localhost:7856 in your browser
  • Creates a SQLite database at ~/.feedloop/feedloop.db to store results

feedloop: server running at http://localhost:7856

2

Submit comparisons

Call feedloop.compare() with a prompt and exactly two model outputs. It returns immediately — your script does not pause.

comparison_id = feedloop.compare( prompt="Explain recursion to a 10-year-old.", outputs=[ "Recursion is when a function calls itself...", # Output A "Imagine you're looking for a book in a library...", # Output B ] )

Each comparison appears in the browser UI instantly. You can submit as many as you like before anyone starts reviewing.

3

Review in the browser

Switch to the browser tab that opened. You will see the prompt and both outputs side by side. Click Left or Right to choose the better response, or Skip if neither is clearly better. The UI updates automatically as new comparisons come in — no page refresh needed.

Screenshot coming soon
4

Export your data

Once all comparisons are rated, export to a JSONL file:

feedloop.export("preferences.jsonl")

Or click Download preferences.jsonl in the browser UI. That's it.

The Review UI

The UI has three states:

Waiting

No comparisons yet. Shows a prompt to call feedloop.compare().

Reviewing

Side-by-side comparison with the prompt at the top, two response cards (Left / Right), keyboard shortcuts (1 = prefer Left, 2 = prefer Right, S = skip), and a progress bar showing how many comparisons remain.

Done

All comparisons are rated. Shows a summary (X rated, Y skipped), a Download button for preferences.jsonl, a Download button for train_dpo.py (a ready-to-run fine-tuning script), and a step-by-step guide for training your model.

Tip: The Left/Right positions are randomised for each comparison to reduce position bias. feedloop tracks which output was which internally and exports the correct chosen/rejected pair regardless of which side you clicked.
Screenshot coming soon — Done screen with download buttons

Exporting Your Data

Export all human-rated comparisons with one call.

From Python

feedloop.export("preferences.jsonl") # feedloop: exported 42 preferences to preferences.jsonl

From the browser

Click Download preferences.jsonl on the Done screen.

What the file looks like

Each line is a JSON object with three fields:

{"prompt": "Explain recursion to a 10-year-old.", "chosen": "Imagine you're looking for a book...", "rejected": "Recursion is when a function calls itself..."} {"prompt": "Write a haiku about autumn.", "chosen": "Leaves fall silently...", "rejected": "Autumn brings cool air..."}

Compatible frameworks

FrameworkUsage
TRL (Hugging Face)DPOTrainer
OpenRLHFNative support
Custom pipelinesStandard JSONL — easy to parse

Full API Reference

All six functions in feedloop's public API.

feedloop.start()

Launches the feedback server. Safe to call multiple times — calling it twice reuses the running server.

feedloop.start( port=7856, # port to run the server on db_path=None, # path to SQLite DB (uses ~/.feedloop/feedloop.db if not set) open_browser=True, # auto-open browser on start uncertainty_threshold=0.0, # see Uncertainty-Based Filtering below )

feedloop.compare()

Submits a pair of outputs for human review. Non-blocking — returns immediately.

comparison_id = feedloop.compare( prompt="Your prompt here", outputs=["Response A", "Response B"], # exactly 2 outputs required uncertainty=None, # optional float 0.0–1.0, see Uncertainty section metadata=None, # optional dict, stored with the record )

Returns a comparison_id string you can use with feedloop.wait().

With metadata:

feedloop.compare( prompt="Summarise this article.", outputs=[response_gpt4, response_claude], metadata={"model_a": "gpt-4o", "model_b": "claude-3-5-sonnet", "temperature": 0.7} )

feedloop.wait()

Blocks until feedback is received.

# Wait for a specific comparison result = feedloop.wait(comparison_id) # Returns: {"prompt": "...", "chosen": "...", "rejected": "...", "auto_skipped": False} # Wait for ALL pending comparisons to be rated result = feedloop.wait() # Returns: {"completed": 42, "total": 42} # With a timeout (returns None if timed out) result = feedloop.wait(comparison_id, timeout=60)

For comparisons that were auto-skipped due to uncertainty filtering, wait() resolves immediately:

{"auto_skipped": True, "prompt": "..."}

feedloop.status()

Returns a live count of comparisons in the current session.

feedloop.status() # {"pending": 3, "completed": 10, "skipped": 1, "auto_skipped": 5, "total": 19}
FieldMeaning
pendingWaiting for human review
completedHuman has rated
skippedHuman clicked Skip
auto_skippedFiltered out by uncertainty threshold
totalAll of the above combined

feedloop.export()

Exports all human-rated comparisons (completed only — skipped and auto-skipped are excluded).

count = feedloop.export( path="preferences.jsonl", # output file path format="dpo", # only "dpo" supported in v1.5 ) # feedloop: exported 42 preferences to preferences.jsonl

feedloop.stop()

Shuts down the server. Called automatically when your script exits.

feedloop.stop()

Uncertainty-Based Filtering

In a typical evaluation loop, not every comparison is equally useful for training. If your model is very confident about an output, sending it to a human for review may waste their time. feedloop lets you filter comparisons by an uncertainty score — only surfacing the ones where the model is unsure.

How it works

You pass an uncertainty score (0.0 to 1.0) to feedloop.compare(). If the score is below your threshold, the comparison is automatically skipped — it never appears in the UI. If it is at or above the threshold, it goes to the human reviewer as normal.

feedloop.start(uncertainty_threshold=0.6) # This will be shown to the human (uncertainty >= threshold) feedloop.compare(prompt, outputs=[a, b], uncertainty=0.85) # This will be auto-skipped (uncertainty < threshold) feedloop.compare(prompt, outputs=[a, b], uncertainty=0.3)

Threshold guide

ThresholdEffect
0.0 (default)All comparisons go to human — same as v1 behaviour
0.5Only comparisons where the model is at least 50% uncertain are reviewed
0.9Only the highest-uncertainty comparisons are reviewed — maximum efficiency

How to compute uncertainty

feedloop is model-agnostic — it accepts any float you provide.

Option 1 — OpenAI logprobs (most accurate)

response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], logprobs=True, top_logprobs=5, ) import math token_probs = [ [math.exp(lp.logprob) for lp in token.top_logprobs] for token in response.choices[0].logprobs.content[:10] ] avg_entropy = sum( -sum(p * math.log(p + 1e-9) for p in probs) for probs in token_probs ) / len(token_probs) # Normalise to 0–1 (rough approximation) uncertainty = min(avg_entropy / math.log(5), 1.0)

Option 2 — Consistency sampling (works with any API, including Claude)

# Ask the model the same prompt N times and measure disagreement responses = [model.generate(prompt) for _ in range(3)] # Simple proxy: if all 3 responses are identical, uncertainty is low unique = len(set(responses)) uncertainty = (unique - 1) / 2 # 0.0 if all same, 1.0 if all different

Option 3 — Always send everything (skip uncertainty)

# Just don't pass uncertainty — feedloop sends every comparison to the human feedloop.compare(prompt, outputs=[a, b])
Note: feedloop does not validate the meaning of the uncertainty score — it trusts whatever value you pass. The scale only matters relative to your own threshold.

Fine-Tuning Your Model

Once you have collected preferences, feedloop gives you a complete path to fine-tuning your model with DPO (Direct Preference Optimization).

1

Download the training script

When all comparisons are reviewed, the Done screen shows a Download train_dpo.py button. Click it. This gives you a ready-to-run Python script pre-configured for TRL's DPOTrainer.

2

Install training dependencies

pip install trl transformers datasets torch accelerate
3

Configure the script

Open train_dpo.py and set BASE_MODEL:

BASE_MODEL = "mistralai/Mistral-7B-v0.1" # change this PREFERENCES_FILE = "preferences.jsonl" # your exported data OUTPUT_DIR = "./dpo-finetuned" # where to save the model NUM_EPOCHS = 1 LEARNING_RATE = 1e-5
4

Run training

Make sure preferences.jsonl is in the same directory, then:

python train_dpo.py
5

Test the fine-tuned model

from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("./dpo-finetuned") tokenizer = AutoTokenizer.from_pretrained("./dpo-finetuned") inputs = tokenizer("Your prompt here", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
6

Keep improving (the feedback loop)

Run your prompts through both the old and fine-tuned model. Collect comparisons. Export. Fine-tune again. Repeat. This is the core loop that makes RLHF work: generate → compare → fine-tune → repeat.

import feedloop feedloop.start() for prompt in your_test_prompts: old_response = old_model.generate(prompt) new_response = new_model.generate(prompt) feedloop.compare(prompt, outputs=[old_response, new_response]) feedloop.wait() feedloop.export("preferences_v2.jsonl")

Real-World Examples

Four complete workflows ready to adapt to your own setup.

Example 1

Compare two OpenAI models

import feedloop from openai import OpenAI client = OpenAI() prompts = [ "Explain black holes to a high schooler.", "Write a Python function that checks if a string is a palindrome.", "What are the pros and cons of microservices architecture?", ] feedloop.start(db_path="openai_comparison.db") for prompt in prompts: response_mini = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], ).choices[0].message.content response_full = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], ).choices[0].message.content feedloop.compare( prompt=prompt, outputs=[response_mini, response_full], metadata={"model_a": "gpt-4o-mini", "model_b": "gpt-4o"}, ) print(f"Review {len(prompts)} comparisons in the browser, then press Enter...") input() feedloop.export("gpt4o_vs_mini.jsonl")
Example 2

Use uncertainty to focus human attention

import feedloop import math from openai import OpenAI client = OpenAI() def get_response_and_uncertainty(prompt: str) -> tuple[str, float]: resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], logprobs=True, top_logprobs=5, ) text = resp.choices[0].message.content token_logprobs = resp.choices[0].logprobs.content[:10] entropy_values = [] for token in token_logprobs: probs = [math.exp(lp.logprob) for lp in token.top_logprobs] entropy = -sum(p * math.log(p + 1e-9) for p in probs) entropy_values.append(entropy) avg_entropy = sum(entropy_values) / len(entropy_values) if entropy_values else 0.0 uncertainty = min(avg_entropy / math.log(5), 1.0) return text, uncertainty prompts = [...] # your list of prompts # Only review comparisons where the model is uncertain (>= 0.6) feedloop.start(uncertainty_threshold=0.6) for prompt in prompts: response_a, uncertainty_a = get_response_and_uncertainty(prompt) response_b, uncertainty_b = get_response_and_uncertainty(prompt) avg_uncertainty = (uncertainty_a + uncertainty_b) / 2 feedloop.compare( prompt=prompt, outputs=[response_a, response_b], uncertainty=avg_uncertainty, ) s = feedloop.status() print(f"Sent {s['pending']} to human, auto-skipped {s['auto_skipped']}") feedloop.wait() feedloop.export("preferences_high_uncertainty.jsonl")
Example 3

Non-blocking pipeline with per-comparison wait

import feedloop feedloop.start(open_browser=False) comparison_ids = [] for prompt, a, b in my_dataset: cid = feedloop.compare(prompt, outputs=[a, b]) comparison_ids.append(cid) # Later, collect results one by one results = [] for cid in comparison_ids: result = feedloop.wait(cid, timeout=300) # 5 min timeout per comparison if result and not result.get("auto_skipped"): results.append(result) print(f"Collected {len(results)} preferences")
Example 4

Jupyter notebook workflow

# Cell 1 — start feedloop import feedloop feedloop.start(db_path="./notebook_feedback.db")
# Cell 2 — submit comparisons prompts = ["What is a transformer?", "Explain attention mechanisms."] for prompt in prompts: a = model_a.generate(prompt) b = model_b.generate(prompt) feedloop.compare(prompt, outputs=[a, b])
# Cell 3 — wait for all feedback, then export feedloop.wait() feedloop.export("notebook_preferences.jsonl")

Advanced Usage

Persisting Data Across Sessions

By default, feedloop stores data in ~/.feedloop/feedloop.db. This file persists between runs — your data is never lost when the script exits.

To use a custom location (useful for separating experiments):

feedloop.start(db_path="./experiments/run_1.db")

To export data from a previous session in a new script:

feedloop.start(db_path="./experiments/run_1.db", open_browser=False) feedloop.export("run_1_preferences.jsonl") feedloop.stop()

Running feedloop as a Standalone Server

You can run feedloop from the terminal without embedding it in a Python script:

feedloop # feedloop: server running at http://localhost:7856 # feedloop: Press Ctrl+C to stop

CLI options:

feedloop --port 8080 # run on a different port feedloop --db ./feedback.db # use a specific database file feedloop --no-browser # don't auto-open the browser

SSH port forwarding (remote machine):

# On your laptop — forward remote port 7856 to localhost ssh -L 7856:localhost:7856 your-server # On the server — start feedloop without opening a browser feedloop --no-browser # On your laptop — open http://localhost:7856 in your browser

FAQ

Do I need an internet connection?

No. feedloop runs entirely on your local machine. No data is ever sent to any server.

Can I use feedloop with any LLM?

Yes. feedloop is model-agnostic. It works with OpenAI, Anthropic, Hugging Face, Ollama, or any model you can call from Python.

What happens if I submit comparisons before anyone starts reviewing?

They queue up. The UI polls every second and shows new comparisons as they arrive. You can submit all comparisons first, then review at your own pace.

Can multiple people review at the same time?

Not recommended in v1.5 — feedloop is designed for solo developers. Concurrent reviewers would see the same comparison and could double-rate it.

What does "skipped" mean vs "auto-skipped"?

"skipped" — a human saw the comparison and clicked Skip. "auto_skipped" — the comparison was filtered out by uncertainty_threshold before it reached the human. Neither appears in the exported JSONL.

Can I use feedloop in a Jupyter notebook?

Yes. Call feedloop.start() in one cell and feedloop.compare() in subsequent cells. See Example 4 above.

How much data do I need before fine-tuning?

TRL's DPOTrainer can work with as few as 100–200 preference pairs for small models. For best results, aim for 500+ pairs with diverse prompts.

Does feedloop work on Windows?

Yes — it is pure Python with no platform-specific dependencies.

Troubleshooting

Browser opens but shows a blank page

You may have an older version installed. Upgrade:

pip install --upgrade feedloop

Then verify the version:

import feedloop print(feedloop.__version__) # should be 1.5.0 or higher

Port already in use

feedloop automatically scans ports 7856–7866. If all are taken, specify a different port:

feedloop.start(port=9000)

“Call feedloop.start() first” error

You called feedloop.compare() or another function before feedloop.start(). Always call feedloop.start() first.

UI shows data from a previous session

feedloop correctly scopes the UI to the current session. If you are seeing old data, make sure you are running feedloop 1.5.0 or higher:

pip install --upgrade feedloop

feedloop.wait() never returns

The comparison is pending and no one has reviewed it. Either open the browser and rate it, or pass a timeout:

feedloop.wait(cid, timeout=60)

Start collecting feedback in minutes.

No cloud account, no config files, no friction. Just pip install and go.

pip install feedloop