why ai2's dr tulu isn't on the index (yet)

january 2, 2026

ai2 dropped dr tulu last month — the first open recipe for training deep research models. 8 billion parameters. beats models 4x its size. matches openai deep research on some benchmarks.

so why isn't it on the index?

because you can't just call it. dr tulu is a training recipe, not an api. to use it, you need gpus, infrastructure, and a decent amount of setup. that's a different product category than gemini, openai, or perplexity — which is what the index tracks.

but dr tulu matters anyway, and it's worth understanding why.

what makes this interesting

it's the first end-to-end open recipe. previous open deep research models (search-r1, webthinker, webexplorer) trained on short-form qa and hoped it transferred to long-form research. dr tulu trains directly on long-form tasks.¹

the training method is novel. normal rl training has a problem: models learn to game whatever metric you're optimizing for. they find patterns in the evaluation and exploit them instead of actually getting better.

dr tulu's solution is called RLER (reinforcement learning with evolving rubrics). the evaluation criteria change during training based on what the model is actually doing. gaming one rubric just causes a new rubric to appear that catches the gaming. it's an arms race where the test keeps adapting.²

it proves the gap is closable. an 8b open model competing with proprietary systems — using the same search apis anyone can access — suggests the moat around deep research apis is thinner than it looks.

what you actually get

the release includes:³

model weights (qwen3-8b based)
full training code and data
an mcp-based agent library for tool integration
evaluation suite

to run it: 1-2 gpus, mcp server setup, api keys for serper (search), semantic scholar (papers), and jina (web browsing). not plug-and-play, but everything's documented.

the benchmark picture

here's where it gets nuanced.

dr tulu's headline number is scholarqa-csv2, a scientific literature benchmark. it scores 86.8 — crushing webexplorer-8b (42.5), webthinker-32b (32.9), and even openai deep research (79.6).⁴

but scholarqa-csv2 is part of astabench, which ai2 developed. dr tulu was trained on similar scientific synthesis tasks. that's not cheating, but it is home field advantage.

on deepresearchbench, a general-domain benchmark covering tech, policy, and broader topics, the picture shifts:

model	deepresearchbench
gpt-5 + search	50.7
gemini deep research	48.8
openai deep research	46.9
dr tulu-8b	43.4
tongyi dr-30b	40.6

for 8b parameters against 30b+ systems, 43.4 is solid. but it's not the "matches proprietary" story.⁵

the takeaway: dr tulu excels at scientific literature synthesis (its training domain) and is competitive on general research (where it wasn't specifically optimized).

cost structure

the $0.00008 per query number floating around is real — but it only counts api calls to external services, not the cost of running the model itself.

	dr tulu	openai deep research
api costs	$0.00008	~$1.80
infrastructure	self-hosted (2x gpus, mcp setup)	cloud-hosted

if you're already running inference infrastructure, dr tulu is dramatically cheaper per query. if you're not, the comparison doesn't apply.⁶

where open models might win

one result worth expanding: ai2 tested dr tulu on geneticdiseasesqa, a clinical benchmark they built for assessing pathogenic gene variants. the task requires searching bioinformatics databases, papers, and case reports, then synthesizing sparse evidence into a coherent assessment.

dr tulu beat claude sonnet and gemini 3 pro on evidence synthesis — the ability to connect findings across multiple sources coherently. it trailed on final answer correctness (smaller model, less parametric knowledge), but the synthesis quality was stronger.⁷

for specialized domains where you need fine-grained control over sources and retrieval, and you have the infrastructure to self-host, the open approach has real advantages.

what would change our mind

dr tulu becomes relevant to the index when:

someone builds a hosted api on top of it
cloud providers offer it as a managed service
the setup becomes genuinely plug-and-play

until then, it's infrastructure for builders, not a product for users choosing between apis.

if you're evaluating deep research apis today, dr tulu doesn't change your decision — it's not in that category. if you're building your own system or considering self-hosted options, it's the most complete open starting point available.

if a hosted service launches using dr tulu or similar open models, it'll be evaluated and added to the index using the same criteria as existing providers. we're tracking this space closely.

Footnotes

comparison of open deep research training approaches in technical report table 7 (page 18). dr tulu is the first to combine long-form training, multi-tool search, and citation output. ↩
rler mechanism detailed in section 3 of the technical report (pages 4-6). the key components: "persistent rubrics" from initial search, "positive rubrics" that reward new strategies discovered during training, and "negative rubrics" that penalize exploitation patterns. see also ai2 blog post for accessible explanation. ↩
model weights, training code, and agent library available at github.com/rlresearch/dr-tulu. huggingface collection at huggingface.co/collections/rl-research/dr-tulu. ↩
dr tulu technical report, table 2 (page 10). scholarqa-csv2 is part of astabench, developed by ai2. ↩
deepresearchbench scores from technical report table 2 (page 10). gemini deep research score (48.8) reported by original benchmark authors in table 3 (page 11). ↩
cost analysis from technical report section 5.2 (page 12) and table 4. openai deep research cost of ~$1.80 per scholarqa query from the same table. ↩
geneticdiseasesqa results in technical report figure 5 (page 13) and section 5.3. dataset derived from expert-curated data from the n=1 collaborative. evidence synthesis measures cross-source coherence; final answer measures whether expert-annotated key facts appear in the response. ↩