ACL 2026

Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion

Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee
Chung-Ang University
Paper arXiv Code

Abstract

Multilingual Retrieval-Augmented Generation (mRAG) systems often exhibit a perceived preference for high-resource languages—particularly English—resulting in the widespread adoption of English pivoting. While prior studies attribute this advantage to the superior English-centric capabilities of LLMs, we find that such measurements are significantly distorted by structural priors inherent in evaluation benchmarks.

We identify exposure bias, a gold availability prior, and cultural priors as factors that hinder accurate assessment. To address these, we propose DeLP, a calibrated metric revealing that retrievers fundamentally favor monolingual alignment. Building on this, we introduce DELTA, a lightweight mRAG framework that consistently outperforms English pivoting and mRAG baselines across diverse languages.

The Myth of English Preference

We challenge the assumption that English pivoting works due to English-centric LLM capabilities, showing the gains stem from retrieval-side structural biases.

Motivation
Figure 1. Common causes of language preference in mRAG: gold-availability prior and cultural prior.
📊

Exposure Bias

High-resource corpora dominate top retrieval results regardless of the encoder's linguistic intent—a "popularity bias" that artificially inflates English performance.

📌

Gold Availability Prior

For most queries, English Wikipedia is the sole repository of ground-truth. Retrieval is forced into English because local-language gold doesn't exist.

🌏

Cultural Prior

Locale-tied queries contain native surface forms that act as retrieval anchors. A language appears "preferred" due to topic locality, not model tendency.

Key Finding: On MKQA, over 70% of gold passages exist only in English Wikipedia. English pivoting works because the evidence only exists in English.

DeLP: Debiased Language Preference

DeLP regresses out structural confounds via ridge regression. The residual is the true debiased preference.

DeLP
Figure 2. DeLP measures intrinsic language preference by regressing out exposure, gold-availability, and cultural priors.

Prior Feature Vector

For each language pair \((L_q, L_d)\), we stack covariates:

\[ \mathrm{DeLP}_e(L_q, L_d) = s_e(L_q, L_d) - \phi(L_q, L_d)^\top \hat\beta_e + \mu_e \]
Monolingual Alignment Emerges: After calibration, the strongest signal moves to the diagonal (\(L_q = L_d\)). The dominant English preference disappears—retrievers fundamentally favor monolingual alignment.

DELTA: Preference-Aligned Query Fusion

DELTA fuses global and local cues into a single preference-aligned query—with no corpus or retriever modifications.

DELTA
Figure 3. DELTA fuses global and local query segments via repetition-based weighting.

Fused Query Structure

[GLOB]English pivot for broad coverage
[LOCAL]Native-language query for monolingual alignment
[TITLE_BRIDGE]Paired titles for cross-lingual entity mapping
[ALIASES]Global/local aliases as retrieval anchors
[LOCALE_HINT]Region / disambiguation hint

Repetition-Based Weighting

Experiments

Setup: MKQA · BGE-m3 retriever/re-ranker · Top-50 → re-rank → Top-5 · Char 3-gram recall · Qwen3-235B, Gemini-2.5-Flash, DeepSeek-v3.1

MethodenareszhjadekothAVG↑Lat.↓
Qwen3-235B
Document Level
MultiRAG70.0547.7963.7637.5246.6063.8140.1440.7351.301.38
CrossRAG68.2143.9561.1437.8144.7560.1638.1342.8749.631.29
DKM-RAG69.1342.6962.1235.1343.9061.1339.4938.8849.063.80
QTT-RAG70.1146.4463.0237.6846.9462.7944.1342.1251.651.80
Query Level
Eng. Translation55.1461.9459.5359.2960.7254.5760.4658.811.17
DELTA (Ours)63.8562.5563.0362.5962.3862.8663.2662.5162.881.13
Gemini-2.5-Flash
Document Level
MultiRAG58.2640.7955.1130.8144.2653.5235.9731.6543.801.53
CrossRAG63.4041.8757.2429.7444.1456.8036.0932.4945.222.60
DKM-RAG64.2139.4159.2631.3443.4557.7437.2633.6445.795.63
QTT-RAG65.3242.6457.8131.5645.1856.2740.6535.9746.935.55
Query Level
Eng. Translation48.4455.8453.5953.6855.1747.6754.8652.751.55
DELTA (Ours)56.9756.4555.9555.8356.1855.9856.4456.4556.281.48
DeepSeek-v3.1
Document Level
MultiRAG60.7743.6456.2233.1444.7254.1634.2136.8045.462.56
CrossRAG67.8348.3462.2439.0549.2761.3339.8545.7051.702.64
DKM-RAG67.8444.0762.4937.6345.6661.6540.3040.3850.002.39
QTT-RAG68.2846.1361.8137.2447.3660.4841.0641.2950.461.93
Query Level
Eng. Translation50.9758.3256.1156.5256.9250.4957.5255.262.05
DELTA (Ours)59.8559.4658.6159.6759.0259.2553.5156.4558.231.13
Table 1. End-to-end mRAG (char 3-gram recall). Bold=best, underline=2nd. Yellow=DELTA.
Key Results: DELTA achieves the best average across all three generators, outperforming document-level methods. Gains are largest on non-Latin scripts (ar, zh, ja, ko, th), and DELTA is the fastest method.

BibTeX

@article{park2026enhancing, title={Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion}, author={Park, Jeonghyun and Kim, Byeongjeong and Hwang, Seojin and Lee, Hwanhee}, journal={arXiv preprint arXiv:2601.02956}, year={2026} }