Qwen3 Embedding

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Embedding

使用[EOS]对应的hidden state作为embedding

{Instruction} {Query}<|endoftext|>

使用改进的InfoNCE Loss进行训练

分子S = query 和正例的余弦相似度
分母Z = S + query和难负样本的余弦相似度+batch内正例与其它query的余弦相似度+batch内query与其它文档的余弦相似度
温度系数
防止假负例：相似度>S+0.1

大规模合成数据的弱监督训练

任务类型：检索(Retrieval)、双语对齐(Bitext mining)、语意相似度(STS, semantic textual similarity)、分类(classification)
查询重写：角色库top5的角色放入提示词中，提示词涵盖查询类型、查询长度、难度、语言等
1.5亿对进行预训练，1200万高质量进行SFT

重写Prompt
Given a **Character**, **Passage**, and **Requirement**, generate a query from the **Character**’s perspective that satisfies the **Requirement** and can be used to retrieve the **Passage**. Please return the result in JSON format.

Here is an example:
<example>

Now, generate the **output** based on the **Character**, **Passage** and language, the **Character** and **Requirement** will be in English. **Requirement** from user, the **Passage** will be in {corpus_language} Ensure to generate only the JSON output, with the key in English and the value in {queries_language} language.

**Character**
{character}

**Passage**
{passage}

**Requirment**
– Type: {type};
– Difficulty: {difficulty};
– Length: the length of the generated sentences should be {length} words;
– Languange: the language in which the results are generated should be {language} language;

模型融合 Slerp：提升模型在不同数据分布下的鲁棒性和泛化能力

性能

Table 2: Performance on MTEB Multilingual (2025.06.04)

Table 3: Performance on MTEB Engilish, MTEB Chinese, MTEB Code

Reranking

输入Query和Document，最后一个token的hidden state，接LM head，输出Yes的Prob

训练方法：SFT+模型融合

性能：

Table 4: Evaluation results for reranking models.
先使用embedding模型检索得到top100，然后重排得到结果

相关文献

Compared Methods We compare our models with the most prominent open-source text embedding models and commercial API services. The open-source models include the GTE (Li et al.,2023; Zhang et al., 2024b), E5 (Wang et al., 2022), and BGE (Xiao et al., 2024) series, as well as NV-Embed-v2 (Lee et al., 2025a), GritLM-7B Muennighoff et al. (2025). The commercial APIs evaluated are text-embedding-3-large from OpenAI, Gemini-embedding from Google, and Cohere-embed-multilingual-v3.0. For reranking, we compare with the rerankers of jina1, mGTE (Zhang et al., 2024b) and BGE-m3 (Chen et al., 2024)

Improving general text embedding model: Tackling task conflict and data imbalance through model merging.

NV-embed: Improved techniques for training LLMs as generalist embedding models.

MTEB: Massive text embedding benchmark.

Rankvicuna: Zero-shot listwise document reranking with open-source large language models.

Gemini embedding: Generalizable embeddings from gemini

★····论文

Qwen3 Embedding

Embedding

Reranking

相关文献

You may also like...

发表回复取消回复

★····论文

Qwen3 Embedding

Embedding

Reranking

相关文献

You may also like...

[略读]Big Bird: Transformers for Longer Sequences

[略读]Swin-Transformer

[小结]Bottom-Up Higher-Resolution Networks for Multi-Person Pose Estimation

发表回复 取消回复

发表回复取消回复