why ai2's dr tulu isn't on the index (yet)
ai2 dropped dr tulu last month — the first open recipe for training deep research models. 8 billion parameters. beats models 4x its size. matches openai deep research on some benchmarks.
so why isn't it on the index?
because you can't just call it. dr tulu is a training recipe, not an api. to use it, you need gpus, infrastructure, and a decent amount of setup. that's a different product category than gemini, openai, or perplexity — which is what the index tracks.
but dr tulu matters anyway, and it's worth understanding why.
what makes this interesting
it's the first end-to-end open recipe. previous open deep research models (search-r1, webthinker, webexplorer) trained on short-form qa and hoped it transferred to long-form research. dr tulu trains directly on long-form tasks.1
the training method is novel. normal rl training has a problem: models learn to game whatever metric you're optimizing for. they find patterns in the evaluation and exploit them instead of actually getting better.
dr tulu's solution is called RLER (reinforcement learning with evolving rubrics). the evaluation criteria change during training based on what the model is actually doing. gaming one rubric just causes a new rubric to appear that catches the gaming. it's an arms race where the test keeps adapting.2
it proves the gap is closable. an 8b open model competing with proprietary systems — using the same search apis anyone can access — suggests the moat around deep research apis is thinner than it looks.
what you actually get
- model weights (qwen3-8b based)
- full training code and data
- an mcp-based agent library for tool integration
- evaluation suite
to run it: 1-2 gpus, mcp server setup, api keys for serper (search), semantic scholar (papers), and jina (web browsing). not plug-and-play, but everything's documented.
the benchmark picture
here's where it gets nuanced.
dr tulu's headline number is scholarqa-csv2, a scientific literature benchmark. it scores 86.8 — crushing webexplorer-8b (42.5), webthinker-32b (32.9), and even openai deep research (79.6).4
but scholarqa-csv2 is part of astabench, which ai2 developed. dr tulu was trained on similar scientific synthesis tasks. that's not cheating, but it is home field advantage.
on deepresearchbench, a general-domain benchmark covering tech, policy, and broader topics, the picture shifts:
| model | deepresearchbench |
|---|---|
| gpt-5 + search | 50.7 |
| gemini deep research | 48.8 |
| openai deep research | 46.9 |
| dr tulu-8b | 43.4 |
| tongyi dr-30b | 40.6 |
for 8b parameters against 30b+ systems, 43.4 is solid. but it's not the "matches proprietary" story.5
the takeaway: dr tulu excels at scientific literature synthesis (its training domain) and is competitive on general research (where it wasn't specifically optimized).
cost structure
the $0.00008 per query number floating around is real — but it only counts api calls to external services, not the cost of running the model itself.
| dr tulu | openai deep research | |
|---|---|---|
| api costs | $0.00008 | ~$1.80 |
| infrastructure | self-hosted (2x gpus, mcp setup) | cloud-hosted |
if you're already running inference infrastructure, dr tulu is dramatically cheaper per query. if you're not, the comparison doesn't apply.6
where open models might win
one result worth expanding: ai2 tested dr tulu on geneticdiseasesqa, a clinical benchmark they built for assessing pathogenic gene variants. the task requires searching bioinformatics databases, papers, and case reports, then synthesizing sparse evidence into a coherent assessment.
dr tulu beat claude sonnet and gemini 3 pro on evidence synthesis — the ability to connect findings across multiple sources coherently. it trailed on final answer correctness (smaller model, less parametric knowledge), but the synthesis quality was stronger.7
for specialized domains where you need fine-grained control over sources and retrieval, and you have the infrastructure to self-host, the open approach has real advantages.
what would change our mind
dr tulu becomes relevant to the index when:
- someone builds a hosted api on top of it
- cloud providers offer it as a managed service
- the setup becomes genuinely plug-and-play
until then, it's infrastructure for builders, not a product for users choosing between apis.
if you're evaluating deep research apis today, dr tulu doesn't change your decision — it's not in that category. if you're building your own system or considering self-hosted options, it's the most complete open starting point available.
if a hosted service launches using dr tulu or similar open models, it'll be evaluated and added to the index using the same criteria as existing providers. we're tracking this space closely.
Footnotes
-
comparison of open deep research training approaches in technical report table 7 (page 18). dr tulu is the first to combine long-form training, multi-tool search, and citation output. ↩
-
rler mechanism detailed in section 3 of the technical report (pages 4-6). the key components: "persistent rubrics" from initial search, "positive rubrics" that reward new strategies discovered during training, and "negative rubrics" that penalize exploitation patterns. see also ai2 blog post for accessible explanation. ↩
-
model weights, training code, and agent library available at github.com/rlresearch/dr-tulu. huggingface collection at huggingface.co/collections/rl-research/dr-tulu. ↩
-
dr tulu technical report, table 2 (page 10). scholarqa-csv2 is part of astabench, developed by ai2. ↩
-
deepresearchbench scores from technical report table 2 (page 10). gemini deep research score (48.8) reported by original benchmark authors in table 3 (page 11). ↩
-
cost analysis from technical report section 5.2 (page 12) and table 4. openai deep research cost of ~$1.80 per scholarqa query from the same table. ↩
-
geneticdiseasesqa results in technical report figure 5 (page 13) and section 5.3. dataset derived from expert-curated data from the n=1 collaborative. evidence synthesis measures cross-source coherence; final answer measures whether expert-annotated key facts appear in the response. ↩