2026年4月22日

Qwen3 Embedding

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Embedding

使用[EOS]对应的hidden state作为embedding

{Instruction} {Query}<|endoftext|>

使用改进的InfoNCE Loss进行训练

  • 分子S = query 和 正例的余弦相似度
  • 分母Z = S + query和难负样本的余弦相似度+batch内正例与其它query的余弦相似度+batch内query与其它文档的余弦相似度
  • 温度系数
  • 防止假负例:相似度>S+0.1

大规模合成数据的弱监督训练

  • 任务类型:检索(Retrieval)、双语对齐(Bitext mining)、语意相似度(STS, semantic textual similarity)、分类(classification)
  • 查询重写:角色库top5的角色放入提示词中,提示词涵盖查询类型、查询长度、难度、语言等
  • 1.5亿对进行预训练,1200万高质量进行SFT

重写Prompt

Given a **Character**, **Passage**, and **Requirement**, generate a query from the **Character**’s perspective that satisfies the **Requirement** and can be used to retrieve the **Passage**. Please return the result in JSON format.

Here is an example:
<example>

Now, generate the **output** based on the **Character**, **Passage** and language, the **Character** and **Requirement** will be in English. **Requirement** from user, the **Passage** will be in {corpus_language} Ensure to generate only the JSON output, with the key in English and the value in {queries_language} language.

**Character**
{character}

**Passage**
{passage}

**Requirment**
– Type: {type};
– Difficulty: {difficulty};
– Length: the length of the generated sentences should be {length} words;
– Languange: the language in which the results are generated should be {language} language;

模型融合 Slerp:提升模型在不同数据分布下的鲁棒性和泛化能力

消融实验

性能

Table 2: Performance on MTEB Multilingual (2025.06.04)
Table 3: Performance on MTEB Engilish, MTEB Chinese, MTEB Code
Table 7: Results on MTEB(eng, v2)
Table 8: Results on C-MTEB

Reranking

输入Query和Document,最后一个token的hidden state,接LM head,输出Yes的Prob

训练方法:SFT+模型融合

性能:

Table 4: Evaluation results for reranking models.
先使用embedding模型检索得到top100,然后重排得到结果

相关文献

Compared Methods We compare our models with the most prominent open-source text embedding models and commercial API services. The open-source models include the GTE (Li et al.,2023; Zhang et al., 2024b), E5 (Wang et al., 2022), and BGE (Xiao et al., 2024) series, as well as NV-Embed-v2 (Lee et al., 2025a), GritLM-7B Muennighoff et al. (2025). The commercial APIs evaluated are text-embedding-3-large from OpenAI, Gemini-embedding from Google, and Cohere-embed-multilingual-v3.0. For reranking, we compare with the rerankers of jina1, mGTE (Zhang et al., 2024b) and BGE-m3 (Chen et al., 2024)

Improving general text embedding model: Tackling task conflict and data imbalance through model merging.

NV-embed: Improved techniques for training LLMs as generalist embedding models.

MTEB: Massive text embedding benchmark.

Rankvicuna: Zero-shot listwise document reranking with open-source large language models.

Gemini embedding: Generalizable embeddings from gemini

Share

You may also like...

发表评论

您的电子邮箱地址不会被公开。