Huggingface模型库与数据集高效下载指南

快速方便地下载huggingface的模型库和数据集

  • 方法一:用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具
  • 特点
  • Usage
  • 方法二:模型下载【个人使用记录】
  • 保持目录结构
  • 数据集下载
  • 不足之处
  • 方法一:用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具

    来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。

    使用方法:将hfd.sh拷贝过去,然后参考下面的参考命令,下载数据集或者模型

    🤗Huggingface 模型下载器

    考虑到官方 huggingface-cli 缺乏多线程下载支持,以及错误处理不足在 hf_transfer 中,这个命令行工具巧妙地利用 wgetaria2 来处理 LFS 文件,并使用 git clone 来处理其余文件。

    特点

  • ⏯️ 从断点恢复:您可以随时重新运行它或按 Ctrl+C。
  • 🚀 多线程下载:利用多线程加速下载过程。
  • 🚫 文件排除:使用--exclude--include跳过或指定文件,为具有重复格式的模型(例如,*.bin*.safetensors)节省时间)。
  • 🔐 身份验证支持:对于需要 Huggingface 登录的门控模型,请使用 --hf_username--hf_token 进行身份验证。
  • 🪞 镜像站点支持:使用“HF_ENDPOINT”环境变量进行设置。
  • 🌍代理支持:使用“HTTPS_PROXY”环境变量进行设置。
  • 📦 简单:仅依赖gitaria2c/wget
  • Usage

    首先,下载 hfd.sh 或克隆此存储库,然后授予脚本执行权限。

    chmod a+x hfd.sh
    

    为了方便起见,您可以创建一个别名

    alias hfd="$PWD/hfd.sh"
    

    使用说明:

    $ ./hfd.sh -h
    Usage:
      hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
    
    Description:
      Downloads a model or dataset from Hugging Face using the provided repo ID.
    
    Parameters:
      repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
      --include       (Optional) Flag to specify a string pattern to include files for downloading.
      --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
      include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
      --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
      --hf_token      (Optional) Hugging Face token for authentication.
      --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
      -x              (Optional) Number of download threads for aria2c. Defaults to 4.
      --dataset       (Optional) Flag to indicate downloading a dataset.
      --local-dir     (Optional) Local directory path where the model or dataset will be stored.
    
    Example:
      hfd bigscience/bloom-560m --exclude *.safetensors
      hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
      hfd lavita/medical-qa-shared-task-v1-toy --dataset
    

    下载模型:

    hfd bigscience/bloom-560m
    

    下载模型需要登录

    从https://huggingface.co/settings/tokens获取huggingface令牌,然后

    hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN
    

    下载模型并排除某些文件(例如.safetensors):

    hfd bigscience/bloom-560m --exclude *.safetensors
    

    使用 aria2c 和多线程下载:

    hfd bigscience/bloom-560m
    

    输出
    下载过程中,将显示文件 URL:

    $ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
    ...
    Start Downloading lfs files, bash script:
    
    wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
    # wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
    wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
    ...
    
    # 安装包
    apt update
    apt-get install aria2
    apt-get install iftop
    apt-get install git-lfs 
    #参考命令
    bash /xxx/xxx/hfd.sh mmaaz60/ActivityNet-QA-Test-Videos --tool aria2c -x 16 --dataset --local-dir /xxx/xxx/ActivityNet
    

    hfd.sh

    #!/usr/bin/env bash
    # Color definitions
    RED='\033[0;31m'
    GREEN='\033[0;32m'
    YELLOW='\033[1;33m'
    NC='\033[0m' # No Color
    
    trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
    
    display_help() {
        cat << EOF
    Usage:
      hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]    
    
    Description:
      Downloads a model or dataset from Hugging Face using the provided repo ID.
    
    Parameters:
      repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
      --include       (Optional) Flag to specify a string pattern to include files for downloading.
      --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
      include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
      --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
      --hf_token      (Optional) Hugging Face token for authentication.
      --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
      -x              (Optional) Number of download threads for aria2c. Defaults to 4.
      --dataset       (Optional) Flag to indicate downloading a dataset.
      --local-dir     (Optional) Local directory path where the model or dataset will be stored.
    
    Example:
      hfd bigscience/bloom-560m --exclude *.safetensors
      hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
      hfd lavita/medical-qa-shared-task-v1-toy --dataset
    EOF
        exit 1
    }
    
    MODEL_ID=$1
    shift
    
    # Default values
    TOOL="aria2c"
    THREADS=4
    HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}
    
    while [[ $# -gt 0 ]]; do
        case $1 in
            --include) INCLUDE_PATTERN="$2"; shift 2 ;;
            --exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
            --hf_username) HF_USERNAME="$2"; shift 2 ;;
            --hf_token) HF_TOKEN="$2"; shift 2 ;;
            --tool) TOOL="$2"; shift 2 ;;
            -x) THREADS="$2"; shift 2 ;;
            --dataset) DATASET=1; shift ;;
            --local-dir) LOCAL_DIR="$2"; shift 2 ;;
            *) shift ;;
        esac
    done
    
    # Check if aria2, wget, curl, git, and git-lfs are installed
    check_command() {
        if ! command -v $1 &>/dev/null; then
            echo -e "${RED}$1 is not installed. Please install it first.${NC}"
            exit 1
        fi
    }
    
    # Mark current repo safe when using shared file system like samba or nfs
    ensure_ownership() {
        if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
            git config --global --add safe.directory "${PWD}"
            printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" 
        fi
    }
    
    [[ "$TOOL" == "aria2c" ]] && check_command aria2c
    [[ "$TOOL" == "wget" ]] && check_command wget
    check_command curl; check_command git; check_command git-lfs
    
    [[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
    
    if [[ -z "$LOCAL_DIR" ]]; then
        LOCAL_DIR="${MODEL_ID#*/}"
    fi
    
    if [[ "$DATASET" == 1 ]]; then
        MODEL_ID="datasets/$MODEL_ID"
    fi
    echo "Downloading to $LOCAL_DIR"
    
    if [ -d "$LOCAL_DIR/.git" ]; then
        printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
        cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
    else
        REPO_URL="$HF_ENDPOINT/$MODEL_ID"
        GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
        echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
        response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
        if [ "$response" == "401" ] || [ "$response" == "403" ]; then
            if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
                printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
                exit 1
            fi
            REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
        elif [ "$response" != "200" ]; then
            printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
            printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
            curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
        fi
        echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
    
        GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
    
        ensure_ownership
    
        while IFS= read -r file; do
            truncate -s 0 "$file"
        done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
    fi
    
    printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
    files=$(git lfs ls-files | cut -d ' ' -f 3-)
    declare -a urls
    
    while IFS= read -r file; do
        url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
        file_dir=$(dirname "$file")
        mkdir -p "$file_dir"
        if [[ "$TOOL" == "wget" ]]; then
            download_cmd="wget -c \"$url\" -O \"$file\""
            [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
        else
            download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
            [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
        fi
        [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
        [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
        printf "%s\n" "$download_cmd"
        urls+=("$url|$file")
    done <<< "$files"
    
    for url_file in "${urls[@]}"; do
        IFS='|' read -r url file <<< "$url_file"
        printf "${YELLOW}Start downloading ${file}.\n${NC}" 
        file_dir=$(dirname "$file")
        if [[ "$TOOL" == "wget" ]]; then
            [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
        else
            [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
        fi
        [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
    done
    
    printf "${GREEN}Download completed successfully.\n${NC}"
    
    

    方法二:模型下载【个人使用记录】

    这个代码不能保持目录结构,见下面的改进版

    import datetime
    import os
    import threading
    
    from huggingface_hub import hf_hub_url
    from huggingface_hub.hf_api import HfApi
    from huggingface_hub.utils import filter_repo_objects
    
    # 执行命令
    def execCmd(cmd):
        print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
        os.system(cmd)
        print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
    
    
    if __name__ == '__main__':
        # 需下载的hf库名称
        repo_id = "Salesforce/blip2-opt-2.7b"
        # 本地存储路径
        save_path = './blip2-opt-2.7b'
        
        # 获取项目信息
        _api = HfApi()
        repo_info = _api.repo_info(
            repo_id=repo_id,
            repo_type="model",
            revision='main',
            token=None,
        )
    
        # 获取文件信息
        filtered_repo_files = list(
            filter_repo_objects(
                items=[f.rfilename for f in repo_info.siblings],
                allow_patterns=None,
                ignore_patterns=None,
            )
        )
    
        cmds = []
        threads = []
    
        # 需要执行的命令列表
        for file in filtered_repo_files:
            # 获取路径
            url = hf_hub_url(repo_id=repo_id, filename=file)
            # 断点下载指令
            cmds.append(f'wget -c {url} -P {save_path}')
        print(cmds)
    
        print("程序开始%s" % datetime.datetime.now())
        for cmd in cmds:
            th = threading.Thread(target=execCmd, args=(cmd,))
            th.start()
            threads.append(th)
        for th in threads:
            th.join()
        print("程序结束%s" % datetime.datetime.now())
    

    保持目录结构

    import datetime
    import os
    import threading
    from pathlib import Path
    
    from huggingface_hub import hf_hub_url
    from huggingface_hub.hf_api import HfApi
    from huggingface_hub.utils import filter_repo_objects
    
    # 执行命令
    def execCmd(cmd):
        print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
        os.system(cmd)
        print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
    
    if __name__ == '__main__':
        # 需下载的hf库名称
        repo_id = "Salesforce/blip2-opt-2.7b"
        # 本地存储路径
        save_path = './blip2-opt-2.7b'
    
        # 创建本地保存目录
        Path(save_path).mkdir(parents=True, exist_ok=True)
    
        # 获取项目信息
        _api = HfApi()
        repo_info = _api.repo_info(
            repo_id=repo_id,
            repo_type="model",
            revision='main',
            token=None,
        )
    
        # 获取文件信息
        filtered_repo_files = list(
            filter_repo_objects(
                items=[f.rfilename for f in repo_info.siblings],
                allow_patterns=None,
                ignore_patterns=None,
            )
        )
    
        cmds = []
        threads = []
    
        # 需要执行的命令列表
        for file in filtered_repo_files:
            # 获取路径
            url = hf_hub_url(repo_id=repo_id, filename=file)
            # 在本地创建子目录
            local_file = os.path.join(save_path, file)
            local_dir = os.path.dirname(local_file)
            Path(local_dir).mkdir(parents=True, exist_ok=True)
            # 断点下载指令
            cmds.append(f'wget -c {url} -P {local_dir}')
        print(cmds)
    
        print("程序开始%s" % datetime.datetime.now())
        for cmd in cmds:
            th = threading.Thread(target=execCmd, args=(cmd,))
            th.start()
            threads.append(th)
        for th in threads:
            th.join()
        print("程序结束%s" % datetime.datetime.now())
    

    数据集下载

    import datetime
    import os
    import threading
    from pathlib import Path
    
    from huggingface_hub import HfApi
    from huggingface_hub.utils import filter_repo_objects
    
    # 执行命令
    def execCmd(cmd):
        print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
        os.system(cmd)
        print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
    
    if __name__ == '__main__':
        # 需下载的数据集ID
        dataset_id = "openai/webtext"
        # 本地存储路径
        save_path = './webtext'
    
        # 创建本地保存目录
        Path(save_path).mkdir(parents=True, exist_ok=True)
    
        # 获取数据集信息
        _api = HfApi()
        dataset_info = _api.dataset_info(
            dataset_id=dataset_id,
            revision='main',
            token=None,
        )
    
        # 获取文件信息
        filtered_dataset_files = list(
            filter_repo_objects(
                items=[f.rfilename for f in dataset_info.siblings],
                allow_patterns=None,
                ignore_patterns=None,
            )
        )
    
        cmds = []
        threads = []
    
        # 需要执行的命令列表
        for file in filtered_dataset_files:
            # 获取路径
            url = dataset_info.get_file_url(file)
            # 在本地创建子目录
            local_file = os.path.join(save_path, file)
            local_dir = os.path.dirname(local_file)
            Path(local_dir).mkdir(parents=True, exist_ok=True)
            # 断点下载指令
            cmds.append(f'wget -c {url} -P {local_dir}')
        print(cmds)
    
        print("程序开始%s" % datetime.datetime.now())
        for cmd in cmds:
            th = threading.Thread(target=execCmd, args=(cmd,))
            th.start()
            threads.append(th)
        for th in threads:
            th.join()
        print("程序结束%s" % datetime.datetime.now())
    

    不足之处

    不支持需要授权的库。

    文件太多可能会开很多线程。


    创作不易,观众老爷们请留步… 动起可爱的小手,点个赞再走呗 (๑◕ܫ←๑)

    欢迎大家关注笔者,你的关注是我持续更博的最大动力

    原创文章,转载告知,盗版必究





    ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠

    作者:旋转的油纸伞

    物联沃分享整理
    物联沃-IOTWORD物联网 » Huggingface模型库与数据集高效下载指南

    发表回复