Research the current state of claim-source verification (also called citation faithfulness or attribution evaluation) in LLM-based research and RAG systems. I am building a deep-research skill whose differentiator is an inline engine that verifies each claim against the full text of its cited source before shipping it. I need the existing landscape and the open problems. Cover, with sources for each: 1. The conceptual distinction between citation CORRECTNESS (does the cited doc support the statement) and citation FAITHFULNESS (does the model's reliance on the source actually drive the claim, vs. post-hoc rationalization). Who has formalized this and how. 2. Documented failure modes in deployed deep-research agents (OpenAI, Perplexity, Gemini, Grok): citation hallucination (fabricated references) vs. statement/claim hallucination (real source, unsupported claim) vs. misattribution. Measured rates where available. 3. Existing verification methods and where each breaks: deterministic string/span matching, NLI/entailment-based checking, LLM-as-a-judge, retrieval-augmented validation. Strengths and known weaknesses of each. 4. Existing benchmarks and metrics for faithfulness/attribution, what gold labels they use, and their critiques. 5. The unsolved problems specifically — what every existing approach still fails at, and where the consensus says the hard part actually is. For each major claim, give the source title and link. Flag where sources disagree. Prioritize 2025–2026 work.
| metric | ||
|---|---|---|
| format | prose | prose |
| word count | 3,276 | 613 |
| sources | 50 | 40 |
| processing time | 0s | 132s |
| has images | no | no |
| has tables | no | no |
| citation style | — | — |
ai-generated content. verify independently. preserved in the museum of queries.