文章目录

  • 介绍
  • 评估数据
  • 评估模型
  • 介绍

      文本嵌入通常是在单一任务的少量数据集上进行评估,这些数据集未涵盖其可能应用于其他任务的情况,不清楚在语义文本相似性(semantic textual similarity, STS)等任务上的最先进嵌入是否同样适用于聚类或重排序等其他任务。这使得该领域的进展难以跟踪,因为不断有各种模型被提出,而没有进行适当的评估。
      为了解决这个问题,Hugging Face团队推出了大规模文本嵌入基准(Massive Text Embedding Benchmark, MTEB)。MTEB涵盖了8个嵌入任务,共58个数据集和112种语言,是目前迄今为止最全面的文本嵌入基准。
      MTEB源码:https://github.com/embeddings-benchmark/mteb
      MTEB论文:https://arxiv.org/abs/2210.07316
      MTEB排行榜:https://huggingface.co/spaces/mteb/leaderboard
      C-MTEB是当前最全面的中文语义向量评测基准,涵盖6大类评测任务(检索、排序、句子相似度、推理、分类、聚类),涉及35(31)个相关数据集。
      C-MTEB评估代码:https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB
      C-MTEB论文:https://arxiv.org/abs/2309.07597

    评估数据

      由于众所周知的原因,Hugging Face官网访问无法直接,所以这篇文章提供了一个比较友好的代理方案来下载数据集。

      由于mteb1.12.4的版本中使用了ISO编码,导致task_langs参数不太好使了,这里暂时使用1.1.1版本。
      pip install mteb==1.1.1
      pip install C_MTEB

    # -*- coding: utf-8 -*-
    # Author  : liyanpeng
    # Email   : yanpeng.li@cumt.edu.cn
    # Datetime: 2024/5/28 18:23
    # Filename: download_data.py
    from mteb import MTEB
    
    import os
    import subprocess
    
    os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
    data_path = '/root/data3/liyanpeng/hf_data'
    
    def show_dataset():
        evaluation = MTEB(task_langs=["zh", "zh-CN"])
        dataset_list = []
        for task in evaluation.tasks:
            if task.description.get('name') not in dataset_list:
                dataset_list.append(task.description.get('name'))
                desc = 'name: {}\t\thf_name: {}\t\ttype: {}\t\tcategory: {}'.format(
                    task.description.get('name'), task.description.get('hf_hub_name'),
                    task.description.get('type'), task.description.get('category'),
                )
                print(desc)
        print(len(dataset_list))
    
    def download_dataset():
        evaluation = MTEB(task_langs=["zh", "zh-CN"])
        err_list = []
        for task in evaluation.tasks:
            # task.load_data()
            # https://huggingface.co/datasets/
            task_name = task.description.get('hf_hub_name')
            print(task_name)
            cmd = ['huggingface-cli', 'download', '--repo-type', 'dataset', '--resume-download',
                   '--local-dir-use-symlinks', 'False', task_name, '--local-dir', os.path.join(data_path, task_name)]
            try:
                result = subprocess.run(cmd, check=True)
            except subprocess.CalledProcessError as e:
                err_list.append(task_name)
                print("{} is error".format(task_name))
    
        if err_list:
            print('download failed: \n', '\n'.join(err_list))
        else:
            print('download success.')
    
    if __name__ == '__main__':
    	download_dataset()
    	show_dataset()
    

      一共是31个数据集:


    评估模型

      使用C-MTEB来评估模型:

    # -*- coding: utf-8 -*-
    # Author  : liyanpeng
    # Email   : yanpeng.li@cumt.edu.cn
    # Datetime: 2024/7/14 13:04
    # Filename: cmteb.py
    from mteb import MTEB
    from C_MTEB import ChineseTaskList, load_retrieval_data
    from sentence_transformers import SentenceTransformer
    from datasets import load_dataset
    import os
    from prettytable import PrettyTable
    
    
    def print_table(task_names, scores):
        tb = PrettyTable()
        tb.field_names = task_names
        tb.add_row(scores)
        print(tb)
    
    
    # os.environ['HF_DATASETS_CACHE'] = data_path
    # os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
    
    
    if __name__ == '__main__': 
        data_path = '/root/data3/liyanpeng/hf_data'
        model_name = "/root/data3/emb_models/sensenova/piccolo-base-zh"
        
        model = SentenceTransformer(model_name)
        evaluation = MTEB(tasks=ChineseTaskList)
        for task in evaluation.tasks:
            if task.description.get('type') == 'Retrieval':
                task.corpus, task.queries, task.relevant_docs = load_retrieval_data(os.path.join(data_path, task.description["hf_hub_name"]),
                                                                                    task.description['eval_splits'])
            else:
                dataset = load_dataset(
                    path=os.path.join(data_path, task.description["hf_hub_name"]),
                    revision=task.description.get("revision", None),
                    # task_eval_splits=task.description.get("eval_splits", [])
                )
                task.dataset = dataset
            task.data_loaded = True
        results = evaluation.run(model, output_folder=f"zh_results/piccolo-base-zh")
        print(results)
    

    作者:夏小悠

    物联沃分享整理
    物联沃-IOTWORD物联网 » MTEB评估基准使用指北

    发表回复