MTEB评估基准使用指北
文章目录
介绍
文本嵌入通常是在单一任务的少量数据集上进行评估,这些数据集未涵盖其可能应用于其他任务的情况,不清楚在语义文本相似性(semantic textual similarity, STS
)等任务上的最先进嵌入是否同样适用于聚类或重排序等其他任务。这使得该领域的进展难以跟踪,因为不断有各种模型被提出,而没有进行适当的评估。
为了解决这个问题,Hugging Face
团队推出了大规模文本嵌入基准(Massive Text Embedding Benchmark, MTEB
)。MTEB
涵盖了8
个嵌入任务,共58
个数据集和112
种语言,是目前迄今为止最全面的文本嵌入基准。
MTEB
源码:https://github.com/embeddings-benchmark/mteb
MTEB
论文:https://arxiv.org/abs/2210.07316
MTEB
排行榜:https://huggingface.co/spaces/mteb/leaderboard
C-MTEB
是当前最全面的中文语义向量评测基准,涵盖6大类评测任务(检索、排序、句子相似度、推理、分类、聚类),涉及35(31)
个相关数据集。
C-MTEB
评估代码:https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB
C-MTEB
论文:https://arxiv.org/abs/2309.07597
评估数据
由于众所周知的原因,Hugging Face官网访问无法直接,所以这篇文章提供了一个比较友好的代理方案来下载数据集。
由于
mteb
在1.12.4
的版本中使用了ISO
编码,导致task_langs
参数不太好使了,这里暂时使用1.1.1
版本。
pip install mteb==1.1.1
pip install C_MTEB
# -*- coding: utf-8 -*-
# Author : liyanpeng
# Email : yanpeng.li@cumt.edu.cn
# Datetime: 2024/5/28 18:23
# Filename: download_data.py
from mteb import MTEB
import os
import subprocess
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
data_path = '/root/data3/liyanpeng/hf_data'
def show_dataset():
evaluation = MTEB(task_langs=["zh", "zh-CN"])
dataset_list = []
for task in evaluation.tasks:
if task.description.get('name') not in dataset_list:
dataset_list.append(task.description.get('name'))
desc = 'name: {}\t\thf_name: {}\t\ttype: {}\t\tcategory: {}'.format(
task.description.get('name'), task.description.get('hf_hub_name'),
task.description.get('type'), task.description.get('category'),
)
print(desc)
print(len(dataset_list))
def download_dataset():
evaluation = MTEB(task_langs=["zh", "zh-CN"])
err_list = []
for task in evaluation.tasks:
# task.load_data()
# https://huggingface.co/datasets/
task_name = task.description.get('hf_hub_name')
print(task_name)
cmd = ['huggingface-cli', 'download', '--repo-type', 'dataset', '--resume-download',
'--local-dir-use-symlinks', 'False', task_name, '--local-dir', os.path.join(data_path, task_name)]
try:
result = subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e:
err_list.append(task_name)
print("{} is error".format(task_name))
if err_list:
print('download failed: \n', '\n'.join(err_list))
else:
print('download success.')
if __name__ == '__main__':
download_dataset()
show_dataset()
一共是31
个数据集:
评估模型
使用C-MTEB
来评估模型:
# -*- coding: utf-8 -*-
# Author : liyanpeng
# Email : yanpeng.li@cumt.edu.cn
# Datetime: 2024/7/14 13:04
# Filename: cmteb.py
from mteb import MTEB
from C_MTEB import ChineseTaskList, load_retrieval_data
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import os
from prettytable import PrettyTable
def print_table(task_names, scores):
tb = PrettyTable()
tb.field_names = task_names
tb.add_row(scores)
print(tb)
# os.environ['HF_DATASETS_CACHE'] = data_path
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
if __name__ == '__main__':
data_path = '/root/data3/liyanpeng/hf_data'
model_name = "/root/data3/emb_models/sensenova/piccolo-base-zh"
model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=ChineseTaskList)
for task in evaluation.tasks:
if task.description.get('type') == 'Retrieval':
task.corpus, task.queries, task.relevant_docs = load_retrieval_data(os.path.join(data_path, task.description["hf_hub_name"]),
task.description['eval_splits'])
else:
dataset = load_dataset(
path=os.path.join(data_path, task.description["hf_hub_name"]),
revision=task.description.get("revision", None),
# task_eval_splits=task.description.get("eval_splits", [])
)
task.dataset = dataset
task.data_loaded = True
results = evaluation.run(model, output_folder=f"zh_results/piccolo-base-zh")
print(results)
作者:夏小悠