MARCH | ACL 2026 Findings

Abstract

Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain.

Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity. We introduce MARCH, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. To address this challenge, we propose CLARION, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning.

Motivation

Multi-hop QA requires constructing logical chains across multiple documents. When ambiguity is introduced, uncertainty scales exponentially—ambiguity can emerge at any step, often remaining latent until prior steps are resolved.

Semantic

Interpret

Homonyms or entity-name collisions yielding disjoint evidence trails. The wrong entity invalidates downstream hops.

"Mustang" → Ford (car) vs. Fender (guitar)

Syntactic

Resolve

Multiple valid parses induce different inter-hop dependencies, changing which intermediate evidence is needed.

Instrumental vs. Attributive reading of telescope question

Constraint

Generalize

An over-specific modifier causes a valid chain to be pruned early. Relaxing the constraint recovers the path.

"highest mountain in Europe" → Mont Blanc vs. Elbrus

Real-world prevalence: Analysis of lmsys-chat-1m reveals 48.4% of questions are ambiguous, 17.7% involve multi-hop reasoning, and 13.3% overlap—yet no benchmark specifically targets this intersection.

MARCH Benchmark

MARCH (Multi-hop Ambiguity Reasoning CHain) contains 2,209 multi-hop ambiguous questions derived from MuSiQue, each paired with clarified questions, per-interpretation answers, evidence passages, and synthesized long answers.

Pipeline — **Figure 2.** Four-stage MARCH construction pipeline.

Detection

4 LLMs vote with full-agreement rule

Collection

Sub-questions retrieved from Wikipedia per interpretation

Generation

Per-interpretation short + synthesized long answers

Filtering

3-LLM verification, unanimous only

Human Validation (5 annotators)

0.92

Detection κ

0.89

Clarification κ

0.95

Answer κ

>90%

Long-answer validity

CLARION: Agentic Framework

CLARION (CLarifying Ambiguity with a Reasoning and InstructiON) decouples ambiguity planning from evidence retrieval.

🗺Planning Agent

Ambiguity Detection — is the question ambiguous?
Type Classification — Semantic / Syntactic / Constraint
Question Clarification — rewrite into clarified variants

⚡Acting Agent (ReAct)

Search — retrieve documents per interpretation
Planning — self-correct when inconsistencies arise
Answer — synthesize output covering all interpretations

Experiments

Setup: Qwen3-235B, Gemini-2.5-Flash, DeepSeek-v3.1 · Qwen3-Embedding-8B retriever · STR-EM, Disambig-F1, LLM-as-a-Judge

Model	Method	STR-EM↑	Disambig-F1↑	Avg↑	LLM-Judge↑
Qwen3-235B
	No Retrieval	20.98	21.19	21.09	3.083
	CoT	21.51	22.32	21.91	2.897
	NaiveRAG	25.10	26.20	25.65	2.752
	CoT w/ RAG	25.63	26.61	26.12	2.947
	DIVA	28.82	22.73	25.78	3.015
	ReAct	20.98	21.00	20.99	2.832
	CLARION (Ours)	38.73	28.38	33.56	3.474
Gemini-2.5-Flash
	No Retrieval	15.59	20.10	17.85	2.307
	NaiveRAG	22.16	28.63	25.40	2.297
	CoT w/ RAG	23.15	27.31	25.23	2.373
	DIVA	18.82	20.29	19.56	2.303
	ReAct	21.32	22.37	21.84	2.428
	CLARION (Ours)	29.12	26.30	27.71	2.752
DeepSeek-v3.1
	No Retrieval	17.75	18.72	18.24	2.683
	NaiveRAG	20.20	25.03	22.62	2.084
	CoT w/ RAG	21.33	23.18	22.25	2.632
	DIVA	18.82	20.66	19.74	2.636
	ReAct	23.17	24.78	23.97	2.723
	CLARION (Ours)	31.47	27.03	29.25	3.042
Human Performance		73.00	62.00	67.50	—

Table 1. MARCH results. Bold=best per model. Purple=CLARION.

The Human Gap: Humans with unrestricted search achieved STR-EM=73.0, while CLARION's best is 38.73—confirming MARCH poses a significant, unsolved challenge for current systems.

Why is MARCH Hard?

Challenge analysis — **Figure 4.** Performance drops under ambiguity and multi-hop reasoning. The combination creates a compounding challenge that exceeds either difficulty in isolation.

Path-dependent failure: Committing to an early interpretation conditions downstream retrieval. A misresolved ambiguity at hop 1 fixes an incorrect bridge entity, steering hop 2 toward an irrelevant trajectory. RAG baselines over-focus on the dominant interpretation.

BibTeX

@article{park2025mirage,
  title={MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions},
  author={Park, Jeonghyun and Baek, Ingeol and Yoon, Seunghyun and Jang, Haeun and Garimella, Aparna and Jain, Akriti and Lipka, Nedim and Lee, Hwanhee},
  journal={arXiv preprint arXiv:2509.22750},
  year={2025}
}