ACL 2025

Investigating Language Preference of Multilingual RAG Systems

Jeonghyun Park, Hwanhee Lee
Chung-Ang University
Paper arXiv Code

Abstract

Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information. However, mRAG systems struggle with retrieving relevant information due to linguistic variations and generate inconsistent responses when multilingual sources conflict.

We systematically investigate language preferences in both retrieval and generation of mRAG. Our analysis indicates that retrievers prefer high-resource and query languages, yet this preference does not consistently improve generation. Generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose DKM-RAG, a framework that fuses translated multilingual passages with complementary model knowledge, mitigating language preference and enhancing performance across diverse linguistic settings.

Motivation

Language preference is a critical issue in mRAG systems that leads to inaccurate outputs. The retriever may prioritize particular languages—especially high-resource or query-language documents—at the expense of truly relevant information. Even when relevant documents are retrieved, the generator might favor passages in Latin scripts, ignoring essential evidence in other languages.

Motivation
Figure 1. Failure cases of multilingual RAG: the retriever prioritizes certain languages over relevant content (Case 1), and the generator ignores evidence in non-preferred languages (Case 2).
RQ1 §4

Retriever Preference

Which languages does the retriever prefer, and how do query-document language relationships affect ranking?

RQ2 §5

Generator Preference

Which languages does the generator prefer, and how do these preferences correlate with mRAG performance?

RQ3 §6

Mitigation

How can we mitigate language preference to improve overall mRAG system performance?

MLRS: MultiLingualRankShift

We introduce MLRS, a metric that quantifies language preference at the retriever level by measuring the shift in document rankings when non-query-language passages are translated into the query language. A high MLRS score indicates strong language preference.

MLRS
Figure 2. Overall framework of calculating MLRS: retrieve multilingual documents, translate non-query-language documents, re-rank, and measure rank improvements.
\[ \text{MLRS}_q = \frac{\Delta r_q}{\Delta r_q^{\max}} \times 100, \quad \text{where } \Delta r_d = \max(r_d^{\text{init}} - r_d^{\text{re-rank}}, 0) \]

Retriever Preference Findings

Monolingual Alignment

When query and document languages match (\(L_q = L_d\)), the retriever shows the highest preference. Direct linguistic alignment avoids cross-lingual mapping complexities.

English Dominance

When \(L_d\) is English, the retriever exhibits nearly the highest preference—often outperforming even monolingual configurations—due to abundance of English in pre-training.

Language Family Effect

Romance languages (fr, it, pt, es) maintain relatively high cross-lingual preference due to lexical and structural similarities, narrowing the gap with monolingual setups.

Document Resources Matter

The resource level of \(L_d\) has a pronounced impact. High-resource language documents consistently achieve the highest MLRS, regardless of query language.

Generator Language Preference

We measure multilingual answer consistency across eight languages by generating responses from the same retrieved documents and computing embedding similarity between each pair of answers.

Generator preference
Figure 3. Average generator language preference across three query languages (en, zh, ko). Latin-script languages show consistently higher consistency.
Latin Script Preference: Generators produce more consistent responses in Latin-script languages (en, fr, it, pt, es) compared to non-Latin languages (ko, zh, ja), suggesting structural advantages in token alignment.
Weak Correlation for Non-English: While generators generally prefer English passages, they achieve optimal performance when passages directly match the query language. Translating everything to English does not always help—linguistic compatibility matters more.

DKM-RAG: Dual Knowledge Multilingual RAG

DKM-RAG combines externally retrieved, translated passages with internally rewritten passages enriched by the model's knowledge. This dual approach filters inaccuracies from retrieval while leveraging LLM's parametric knowledge.

DKM-RAG
Figure 4. DKM-RAG pipeline: retrieve → translate to \(L_q\) → rewrite with LLM knowledge → concatenate both for final generation.
1

Retrieve & Re-rank

Top-50 docs from multilingual sources

2

Translate

All passages translated to query language \(L_q\)

3

Rewrite

LLM refines passages with parametric knowledge

4

Generate

Concat \(P_{\text{translated}} + P_{\text{refined}}\) for final answer

Key Insight: Translation alone can carry irrelevant content from high-resource languages. The rewriting step leverages the LLM's internal knowledge to filter inaccuracies and enrich passages with missing but relevant information (e.g., "the executive branch" appearing only in \(P_\text{refined}\)).

Experiments

Setup: MKQA benchmark (2.7K samples) · BGE-m3 retriever/re-ranker · Top-50 → Top-5 · Character 3-gram recall · Generators: aya-expanse-8B, Phi-4, Qwen2.5-7B, Llama3.1-8B

allenzhkofrjaitptesDKM-RAG
Lq = en
aya-expanse-8B80.0979.3463.0864.4676.1361.2075.4775.6576.3282.60
Phi-479.6978.8963.0652.3074.4348.8674.0274.3975.3282.59
Qwen2.5-7B80.1579.1150.3164.9076.2862.6275.4775.9776.5482.60
Llama3.1-8B80.2579.2861.9965.8176.4062.5875.8976.0976.4782.57
Lq = zh
aya-expanse-8B32.5525.6238.3126.6424.0025.2723.6323.6323.7944.57
Phi-416.7517.5736.7617.5018.1517.5618.1917.8918.4444.56
Qwen2.5-7B34.2827.3338.3127.9125.1527.7825.9025.3725.3044.70
Llama3.1-8B28.5024.3638.4823.8422.4823.7823.1823.3223.0244.51
Lq = ko
aya-expanse-8B40.6038.0826.0149.6625.3726.8224.9825.2625.5155.01
Phi-426.8020.2417.5449.2519.0317.9118.9319.1919.1954.82
Qwen2.5-7B36.5022.8720.0849.4421.7920.9421.6521.4421.5254.85
Llama3.1-8B37.1826.4822.8849.8724.4624.8625.2324.8725.2254.99
Table 1. Character 3-gram recall across passage languages. Yellow = matching \(L_q\). Bold = best per row. DKM-RAG outperforms all single-language and multilingual settings.
Results: DKM-RAG consistently outperforms all baselines across every generator and query language. For non-English queries, it bridges the language gap by +6 to +8 points over the best single-language setting. Even for English queries, it surpasses the all baseline by ~2.5 points.

BibTeX

@inproceedings{park2025investigating, title = {Investigating Language Preference of Multilingual {RAG} Systems}, author = {Park, Jeonghyun and Lee, Hwanhee}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2025} }