Collect human preference feedback on LLM outputs — directly from your Python code, with zero configuration. Submit pairs of model responses, pick the better one in a local browser UI, and export a training dataset ready for DPO fine-tuning.
Think of it as the SQLite of human feedback: local-first, pip-installable, and out of your way.
Requires Python 3.10 or higher. No other setup needed.
feedloop is built for solo AI developers who want to:
If you've ever wanted to collect preference data but felt LangSmith, Humanloop, or Argilla were too heavy for a one-person project — feedloop is for you.
A complete local feedback loop — from raw model outputs to a fine-tuning-ready dataset — with no cloud accounts, no API keys, and no configuration files.
Runs entirely on your machine. No data ever sent to a server. Results stored in a local SQLite database.
Works with OpenAI, Anthropic, Hugging Face, Ollama, or any model you can call from Python.
Side-by-side comparison at localhost:7856. Keyboard shortcuts, progress bar, and auto-updating feed.
One call to export JSONL compatible with TRL, OpenRLHF, and any custom fine-tuning pipeline.
The complete workflow from zero to a preference dataset in under 10 minutes.
This does three things:
feedloop: server running at http://localhost:7856
Call feedloop.compare() with a prompt and exactly two model outputs. It returns immediately — your script does not pause.
Each comparison appears in the browser UI instantly. You can submit as many as you like before anyone starts reviewing.
Switch to the browser tab that opened. You will see the prompt and both outputs side by side. Click Left or Right to choose the better response, or Skip if neither is clearly better. The UI updates automatically as new comparisons come in — no page refresh needed.
Once all comparisons are rated, export to a JSONL file:
Or click Download preferences.jsonl in the browser UI. That's it.
The UI has three states:
No comparisons yet. Shows a prompt to call feedloop.compare().
Side-by-side comparison with the prompt at the top, two response cards (Left / Right), keyboard shortcuts (1 = prefer Left, 2 = prefer Right, S = skip), and a progress bar showing how many comparisons remain.
All comparisons are rated. Shows a summary (X rated, Y skipped), a Download button for preferences.jsonl, a Download button for train_dpo.py (a ready-to-run fine-tuning script), and a step-by-step guide for training your model.
Export all human-rated comparisons with one call.
Click Download preferences.jsonl on the Done screen.
Each line is a JSON object with three fields:
| Framework | Usage |
|---|---|
| TRL (Hugging Face) | DPOTrainer |
| OpenRLHF | Native support |
| Custom pipelines | Standard JSONL — easy to parse |
All six functions in feedloop's public API.
Launches the feedback server. Safe to call multiple times — calling it twice reuses the running server.
Submits a pair of outputs for human review. Non-blocking — returns immediately.
Returns a comparison_id string you can use with feedloop.wait().
With metadata:
Blocks until feedback is received.
For comparisons that were auto-skipped due to uncertainty filtering, wait() resolves immediately:
Returns a live count of comparisons in the current session.
| Field | Meaning |
|---|---|
| pending | Waiting for human review |
| completed | Human has rated |
| skipped | Human clicked Skip |
| auto_skipped | Filtered out by uncertainty threshold |
| total | All of the above combined |
Exports all human-rated comparisons (completed only — skipped and auto-skipped are excluded).
Shuts down the server. Called automatically when your script exits.
In a typical evaluation loop, not every comparison is equally useful for training. If your model is very confident about an output, sending it to a human for review may waste their time. feedloop lets you filter comparisons by an uncertainty score — only surfacing the ones where the model is unsure.
You pass an uncertainty score (0.0 to 1.0) to feedloop.compare(). If the score is below your threshold, the comparison is automatically skipped — it never appears in the UI. If it is at or above the threshold, it goes to the human reviewer as normal.
| Threshold | Effect |
|---|---|
| 0.0 (default) | All comparisons go to human — same as v1 behaviour |
| 0.5 | Only comparisons where the model is at least 50% uncertain are reviewed |
| 0.9 | Only the highest-uncertainty comparisons are reviewed — maximum efficiency |
feedloop is model-agnostic — it accepts any float you provide.
Option 1 — OpenAI logprobs (most accurate)
Option 2 — Consistency sampling (works with any API, including Claude)
Option 3 — Always send everything (skip uncertainty)
Once you have collected preferences, feedloop gives you a complete path to fine-tuning your model with DPO (Direct Preference Optimization).
When all comparisons are reviewed, the Done screen shows a Download train_dpo.py button. Click it. This gives you a ready-to-run Python script pre-configured for TRL's DPOTrainer.
Open train_dpo.py and set BASE_MODEL:
Make sure preferences.jsonl is in the same directory, then:
Run your prompts through both the old and fine-tuned model. Collect comparisons. Export. Fine-tune again. Repeat. This is the core loop that makes RLHF work: generate → compare → fine-tune → repeat.
Four complete workflows ready to adapt to your own setup.
By default, feedloop stores data in ~/.feedloop/feedloop.db. This file persists between runs — your data is never lost when the script exits.
To use a custom location (useful for separating experiments):
To export data from a previous session in a new script:
You can run feedloop from the terminal without embedding it in a Python script:
CLI options:
SSH port forwarding (remote machine):
Do I need an internet connection?
No. feedloop runs entirely on your local machine. No data is ever sent to any server.
Can I use feedloop with any LLM?
Yes. feedloop is model-agnostic. It works with OpenAI, Anthropic, Hugging Face, Ollama, or any model you can call from Python.
What happens if I submit comparisons before anyone starts reviewing?
They queue up. The UI polls every second and shows new comparisons as they arrive. You can submit all comparisons first, then review at your own pace.
Can multiple people review at the same time?
Not recommended in v1.5 — feedloop is designed for solo developers. Concurrent reviewers would see the same comparison and could double-rate it.
What does "skipped" mean vs "auto-skipped"?
"skipped" — a human saw the comparison and clicked Skip. "auto_skipped" — the comparison was filtered out by uncertainty_threshold before it reached the human. Neither appears in the exported JSONL.
Can I use feedloop in a Jupyter notebook?
Yes. Call feedloop.start() in one cell and feedloop.compare() in subsequent cells. See Example 4 above.
How much data do I need before fine-tuning?
TRL's DPOTrainer can work with as few as 100–200 preference pairs for small models. For best results, aim for 500+ pairs with diverse prompts.
Does feedloop work on Windows?
Yes — it is pure Python with no platform-specific dependencies.
You may have an older version installed. Upgrade:
Then verify the version:
feedloop automatically scans ports 7856–7866. If all are taken, specify a different port:
You called feedloop.compare() or another function before feedloop.start(). Always call feedloop.start() first.
feedloop correctly scopes the UI to the current session. If you are seeing old data, make sure you are running feedloop 1.5.0 or higher:
The comparison is pending and no one has reviewed it. Either open the browser and rate it, or pass a timeout:
No cloud account, no config files, no friction. Just pip install and go.
pip install feedloop