Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information. However, mRAG systems struggle with retrieving relevant information due to linguistic variations and generate inconsistent responses when multilingual sources conflict.
We systematically investigate language preferences in both retrieval and generation of mRAG. Our analysis indicates that retrievers prefer high-resource and query languages, yet this preference does not consistently improve generation. Generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose DKM-RAG, a framework that fuses translated multilingual passages with complementary model knowledge, mitigating language preference and enhancing performance across diverse linguistic settings.
Language preference is a critical issue in mRAG systems that leads to inaccurate outputs. The retriever may prioritize particular languages—especially high-resource or query-language documents—at the expense of truly relevant information. Even when relevant documents are retrieved, the generator might favor passages in Latin scripts, ignoring essential evidence in other languages.

Which languages does the retriever prefer, and how do query-document language relationships affect ranking?
Which languages does the generator prefer, and how do these preferences correlate with mRAG performance?
How can we mitigate language preference to improve overall mRAG system performance?
We introduce MLRS, a metric that quantifies language preference at the retriever level by measuring the shift in document rankings when non-query-language passages are translated into the query language. A high MLRS score indicates strong language preference.

When query and document languages match (\(L_q = L_d\)), the retriever shows the highest preference. Direct linguistic alignment avoids cross-lingual mapping complexities.
When \(L_d\) is English, the retriever exhibits nearly the highest preference—often outperforming even monolingual configurations—due to abundance of English in pre-training.
Romance languages (fr, it, pt, es) maintain relatively high cross-lingual preference due to lexical and structural similarities, narrowing the gap with monolingual setups.
The resource level of \(L_d\) has a pronounced impact. High-resource language documents consistently achieve the highest MLRS, regardless of query language.
We measure multilingual answer consistency across eight languages by generating responses from the same retrieved documents and computing embedding similarity between each pair of answers.

DKM-RAG combines externally retrieved, translated passages with internally rewritten passages enriched by the model's knowledge. This dual approach filters inaccuracies from retrieval while leveraging LLM's parametric knowledge.

Top-50 docs from multilingual sources
All passages translated to query language \(L_q\)
LLM refines passages with parametric knowledge
Concat \(P_{\text{translated}} + P_{\text{refined}}\) for final answer
Setup: MKQA benchmark (2.7K samples) · BGE-m3 retriever/re-ranker · Top-50 → Top-5 · Character 3-gram recall · Generators: aya-expanse-8B, Phi-4, Qwen2.5-7B, Llama3.1-8B
| all | en | zh | ko | fr | ja | it | pt | es | DKM-RAG | |
|---|---|---|---|---|---|---|---|---|---|---|
| Lq = en | ||||||||||
| aya-expanse-8B | 80.09 | 79.34 | 63.08 | 64.46 | 76.13 | 61.20 | 75.47 | 75.65 | 76.32 | 82.60 |
| Phi-4 | 79.69 | 78.89 | 63.06 | 52.30 | 74.43 | 48.86 | 74.02 | 74.39 | 75.32 | 82.59 |
| Qwen2.5-7B | 80.15 | 79.11 | 50.31 | 64.90 | 76.28 | 62.62 | 75.47 | 75.97 | 76.54 | 82.60 |
| Llama3.1-8B | 80.25 | 79.28 | 61.99 | 65.81 | 76.40 | 62.58 | 75.89 | 76.09 | 76.47 | 82.57 |
| Lq = zh | ||||||||||
| aya-expanse-8B | 32.55 | 25.62 | 38.31 | 26.64 | 24.00 | 25.27 | 23.63 | 23.63 | 23.79 | 44.57 |
| Phi-4 | 16.75 | 17.57 | 36.76 | 17.50 | 18.15 | 17.56 | 18.19 | 17.89 | 18.44 | 44.56 |
| Qwen2.5-7B | 34.28 | 27.33 | 38.31 | 27.91 | 25.15 | 27.78 | 25.90 | 25.37 | 25.30 | 44.70 |
| Llama3.1-8B | 28.50 | 24.36 | 38.48 | 23.84 | 22.48 | 23.78 | 23.18 | 23.32 | 23.02 | 44.51 |
| Lq = ko | ||||||||||
| aya-expanse-8B | 40.60 | 38.08 | 26.01 | 49.66 | 25.37 | 26.82 | 24.98 | 25.26 | 25.51 | 55.01 |
| Phi-4 | 26.80 | 20.24 | 17.54 | 49.25 | 19.03 | 17.91 | 18.93 | 19.19 | 19.19 | 54.82 |
| Qwen2.5-7B | 36.50 | 22.87 | 20.08 | 49.44 | 21.79 | 20.94 | 21.65 | 21.44 | 21.52 | 54.85 |
| Llama3.1-8B | 37.18 | 26.48 | 22.88 | 49.87 | 24.46 | 24.86 | 25.23 | 24.87 | 25.22 | 54.99 |
@inproceedings{park2025investigating,
title = {Investigating Language Preference of Multilingual
{RAG} Systems},
author = {Park, Jeonghyun and Lee, Hwanhee},
booktitle = {Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (ACL)},
year = {2025}
}