finetune_llm

2024-10-13

/posts/resumeessentials/2024-10-13-finetune_llm/ map[email:1522009317@qq.com name:fmh]

Table of Contents

简历中提到对模型进行私域数据微调（finetuning）的，要能够说出所以然来。

# bge-embedding finetuning

为什么要微调，要解决什么问题？适应和提高模型在下游任务的表现
微调和预训练分别指什么？主要区别是什么？预训练是“本科阶段学习”，微调是“研究生阶段专研”
SBERT是什么？哪些技术可以提高生成嵌入的质量？ SBERT是工具包，主要基于BERT等预训练模型来生成句子级别的嵌入向量

## 基础知识

注意BGE使用CLS的表征作为整个句子的表示，如果使用了错误的方式（如mean pooling)会导致效果很差。

### BGE-Embedding 模型的使用

from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

原始BERT模型(BERT论文精读笔记)生成的是词级别

的嵌入，直接用于句子级别任务时，也只能使用[CLS] token的表征作为整个句子的嵌入向量，可能无法很好的捕捉句子的全局语义。（BERT生成带有上下文语义的词嵌入表征）

SBERT：专门设计用于句子表示。SBERT 将句子通过 BERT 编码后，通常将输出的所有词的嵌入进行平均池化（mean pooling）(也可以使用CLS pooling)，生成一个固定长度的句子级别向量表示。

Structure of Sentence Transformer Models

A Sentence Transformer model consists of a collection of modules (docs) that are executed sequentially. The most common architecture is a combination of a Transformer module, a Pooling module, and optionally, a Dense module and/or a Normalize module.

Transformer: This module is responsible for processing the input text and generating contextualized embeddings.
Pooling: This module reduces the dimensionality of the output from the Transformer module by aggregating the embeddings. Common pooling strategies include mean pooling and CLS pooling.
Dense: This module contains a linear layer that post-processes the embedding output from the Pooling module.
Normalize: This module normalizes the embedding from the previous layer.

### 一般SBERT 模型的使用

For example, the popular all-MiniLM-L6-v2 model can also be loaded by initializing the 3 specific modules that make up that model:

from sentence_transformers import models, SentenceTransformer

transformer = models.Transformer("sentence-transformers/all-MiniLM-L6-v2", max_seq_length=256)
pooling = models.Pooling(transformer.get_word_embedding_dimension(), pooling_mode="mean")
normalize = models.Normalize()

model = SentenceTransformer(modules=[transformer, pooling, normalize])

## 数据格式

{"query": str, "pos": List[str], "neg":List[str]}

doc：如果没有负样本，可以从语料库随机抽取样本作为负样本。

用私域数据进行微调，是因为选用的开源模型是基础款，对任务领域的数据的表征能力不足（区分度不满足任务需求）。

构建正负样本的实际操作：预处理：对各个问题类别（上报时选的类别）进行样本分组（如：工业污染、排水设施问题、生活垃圾等等）。因为存在上报质量问题，因此对各个类别中不相符的样本进行手动过滤。

正样本：某问题类别样本；（句子意义相近的句子）
负样本：其他问题类别样本。（意义不相关的句子）

## Hard Negatives 方法

Hard Negative 是指那些与输入句子在表面上非常相似但实际含义相差较大的句子。因为它们很难被模型区分，所以在训练中使用这些难区分的负样本可以促使模型学习到更细致的语义差异。

降低过拟合和增加泛化能力：通过Hard Negative样本的对抗，模型需要学会捕捉更细腻的语义信息，从而提高在实际应用中的表现，尤其是在对相似句子的区分能力上。

如何获取“困难负样本”？在FAISS检索结果中筛选。