代码收藏家技术教程 2025-02-20

Python 爬取 B 站视频弹幕

一、弹幕数据的来源

B 站的弹幕数据是通过视频的 cid（弹幕 ID）来获取的。每个视频对应一个 cid，而弹幕存储在一个 XML 文件中。只需要知道视频的 cid，就能通过 API 获取对应的弹幕。

获取弹幕的 URL:

https://api.bilibili.com/x/v1/dm/list.so?oid=<cid>

二、爬取弹幕的完整代码

以下是一个完整的 Python 脚本，从视频链接获取 cid，然后下载弹幕数据并解析：

1. 获取视频的 `cid`

cid 是获取弹幕的关键。可以通过爬取视频页面的 HTML，提取其中的 cid 字段。

def get_cid(video_url):
    """
    获取视频的 cid (弹幕 ID)
    """
    import requests
    import re

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(video_url, headers=headers)
    if response.status_code != 200:
        raise Exception(f"Failed to fetch video page: {response.status_code}")

    # 使用正则表达式提取 cid
    match = re.search(r'"cid":(\d+)', response.text)
    if match:
        return match.group(1)
    else:
        raise Exception("Failed to find cid in video page")

2. 获取弹幕 XML 数据

有了 cid 后，可以通过 B 站的弹幕 API 获取对应的视频弹幕。

def fetch_danmaku(cid):
    """
    根据 cid 获取弹幕数据
    """
    url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise Exception(f"Failed to fetch danmaku: {response.status_code}")
    return response.content

3. 解析弹幕数据

B 站的弹幕存储在 XML 格式中，弹幕的内容和属性位于 <d> 标签中。我们可以用 Python 的 xml.etree.ElementTree 模块解析这些数据。

import xml.etree.ElementTree as ET

def parse_danmaku(xml_content):
    """
    解析弹幕 XML 数据
    """
    danmaku_list = []
    root = ET.fromstring(xml_content)

    # 弹幕存储在 <d> 标签中
    for d in root.findall("d"):
        attributes = d.attrib
        content = d.text  # 弹幕内容
        p = attributes.get("p", "")
        if p:
            params = p.split(",")
            danmaku_list.append({
                "time": float(params[0]),  # 弹幕出现时间 (秒)
                "mode": int(params[1]),   # 弹幕模式
                "font_size": int(params[2]),  # 字体大小
                "color": int(params[3]),  # 颜色 (十进制 RGB)
                "timestamp": int(params[4]),  # 弹幕发送时间戳
                "pool": int(params[5]),  # 弹幕池
                "user_hash": params[6],  # 用户哈希
                "content": content       # 弹幕内容
            })
    return danmaku_list

4. 整合代码并运行

以下是整合后的完整脚本：

if __name__ == "__main__":
    video_url = input("请输入 B 站视频链接: ")
    cid = get_cid(video_url)
    print(f"视频 cid: {cid}")

    xml_content = fetch_danmaku(cid)
    danmaku_list = parse_danmaku(xml_content)

    # 打印部分弹幕
    for danmaku in danmaku_list[:10]:  # 打印前 10 条弹幕
        print(danmaku)

运行时输入一个 B 站视频链接，比如：https://www.bilibili.com/video/BV1Xx411C7mA，脚本会打印视频的前 10 条弹幕内容和属性。

三、弹幕数据格式详解

B 站弹幕数据存储在 XML 文件中，主要结构如下：

<i>
    <d p="60.456,1,25,16777215,1687245372,0,hash值">弹幕内容</d>
    ...
</i>

`<d>` 标签中的 `p` 属性解析：

p 属性包含弹幕的多个参数，用逗号分隔。各字段含义如下：

time: 弹幕出现时间，单位为秒。
mode: 弹幕模式：
1: 滚动弹幕。
4: 底部弹幕。
5: 顶部弹幕。
font_size: 弹幕字体大小。
color: 弹幕颜色，十进制 RGB 值。
timestamp: 弹幕发送时间戳。
pool: 弹幕池，0 表示普通弹幕，1 表示字幕弹幕。
user_hash: 用户的匿名哈希值。

示例解析结果

假设我们解析的弹幕数据如下：

<d p="60.456,1,25,16777215,1687245372,0,abc123">哈哈哈，真有趣！</d>

解析后的 Python 数据结构为：

{
    "time": 60.456,
    "mode": 1,
    "font_size": 25,
    "color": 16777215,
    "timestamp": 1687245372,
    "pool": 0,
    "user_hash": "abc123",
    "content": "哈哈哈，真有趣！"
}

四、运行效果

当输入一个 B 站视频链接后，脚本会输出前 10 条弹幕，例如：

视频 cid: 123456789
{'time': 12.34, 'mode': 1, 'font_size': 25, 'color': 16777215, 'timestamp': 1687245372, 'pool': 0, 'user_hash': 'abc123', 'content': '哈哈哈，真有趣！'}
{'time': 45.67, 'mode': 4, 'font_size': 25, 'color': 65280, 'timestamp': 1687245380, 'pool': 0, 'user_hash': 'xyz456', 'content': '这是啥？'}
...