ACL 2026

MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference

Jeonghyun Park1*,  Ingeol Baek1*,  Seunghyun Yoon2,  Haeun Jang1,  Aparna Garimella2,  Akriti Jain2,  Nedim Lipka2,  Hwanhee Lee1†
1Chung-Ang University   2Adobe Research
Paper arXiv Code Dataset

Abstract

Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain.

Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity. We introduce MARCH, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. To address this challenge, we propose CLARION, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning.

Motivation

Multi-hop QA requires constructing logical chains across multiple documents. When ambiguity is introduced, uncertainty scales exponentially—ambiguity can emerge at any step, often remaining latent until prior steps are resolved.

Motivation
Figure 1. Multi-hop ambiguity example. The second-hop ambiguity ("pickup") is latent—it only surfaces if the first hop ("Mustang" as guitar) is preserved. Current LLMs commit early to the car reading and prune the valid guitar → magnetic-pickup branch.
Semantic

Interpret

Homonyms or entity-name collisions yielding disjoint evidence trails. The wrong entity invalidates downstream hops.

"Mustang" → Ford (car) vs. Fender (guitar)
Syntactic

Resolve

Multiple valid parses induce different inter-hop dependencies, changing which intermediate evidence is needed.

Instrumental vs. Attributive reading of telescope question
Constraint

Generalize

An over-specific modifier causes a valid chain to be pruned early. Relaxing the constraint recovers the path.

"highest mountain in Europe" → Mont Blanc vs. Elbrus
Real-world prevalence: Analysis of lmsys-chat-1m reveals 48.4% of questions are ambiguous, 17.7% involve multi-hop reasoning, and 13.3% overlap—yet no benchmark specifically targets this intersection.

MARCH Benchmark

MARCH (Multi-hop Ambiguity Reasoning CHain) contains 2,209 multi-hop ambiguous questions derived from MuSiQue, each paired with clarified questions, per-interpretation answers, evidence passages, and synthesized long answers.

Pipeline
Figure 2. Four-stage MARCH construction pipeline.
1

Detection

4 LLMs vote with full-agreement rule

2

Collection

Sub-questions retrieved from Wikipedia per interpretation

3

Generation

Per-interpretation short + synthesized long answers

4

Filtering

3-LLM verification, unanimous only

Human Validation (5 annotators)
0.92
Detection κ
0.89
Clarification κ
0.95
Answer κ
>90%
Long-answer validity

CLARION: Agentic Framework

CLARION (CLarifying Ambiguity with a Reasoning and InstructiON) decouples ambiguity planning from evidence retrieval.

CLARION
Figure 3. CLARION framework: a Planning Agent resolves ambiguity, an Acting Agent executes a ReAct loop.

🗺Planning Agent

  1. Ambiguity Detection — is the question ambiguous?
  2. Type Classification — Semantic / Syntactic / Constraint
  3. Question Clarification — rewrite into clarified variants

⚡Acting Agent (ReAct)

  • Search — retrieve documents per interpretation
  • Planning — self-correct when inconsistencies arise
  • Answer — synthesize output covering all interpretations

Experiments

Setup: Qwen3-235B, Gemini-2.5-Flash, DeepSeek-v3.1 · Qwen3-Embedding-8B retriever · STR-EM, Disambig-F1, LLM-as-a-Judge

ModelMethodSTR-EM↑Disambig-F1↑Avg↑LLM-Judge↑
Qwen3-235B
No Retrieval20.9821.1921.093.083
CoT21.5122.3221.912.897
NaiveRAG25.1026.2025.652.752
CoT w/ RAG25.6326.6126.122.947
DIVA28.8222.7325.783.015
ReAct20.9821.0020.992.832
CLARION (Ours)38.7328.3833.563.474
Gemini-2.5-Flash
No Retrieval15.5920.1017.852.307
NaiveRAG22.1628.6325.402.297
CoT w/ RAG23.1527.3125.232.373
DIVA18.8220.2919.562.303
ReAct21.3222.3721.842.428
CLARION (Ours)29.1226.3027.712.752
DeepSeek-v3.1
No Retrieval17.7518.7218.242.683
NaiveRAG20.2025.0322.622.084
CoT w/ RAG21.3323.1822.252.632
DIVA18.8220.6619.742.636
ReAct23.1724.7823.972.723
CLARION (Ours)31.4727.0329.253.042
Human Performance73.0062.0067.50—
Table 1. MARCH results. Bold=best per model. Purple=CLARION.
The Human Gap: Humans with unrestricted search achieved STR-EM=73.0, while CLARION's best is 38.73—confirming MARCH poses a significant, unsolved challenge for current systems.

Why is MARCH Hard?

Challenge analysis
Figure 4. Performance drops under ambiguity and multi-hop reasoning. The combination creates a compounding challenge that exceeds either difficulty in isolation.
Path-dependent failure: Committing to an early interpretation conditions downstream retrieval. A misresolved ambiguity at hop 1 fixes an incorrect bridge entity, steering hop 2 toward an irrelevant trajectory. RAG baselines over-focus on the dominant interpretation.

BibTeX

@article{park2025mirage, title={MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions}, author={Park, Jeonghyun and Baek, Ingeol and Yoon, Seunghyun and Jang, Haeun and Garimella, Aparna and Jain, Akriti and Lipka, Nedim and Lee, Hwanhee}, journal={arXiv preprint arXiv:2509.22750}, year={2025} }