代码收藏家技术教程 2024-09-04

喜马拉雅小说mp3资源批量获取攻略：可听即爬取

# 在网页端明明是可以听的，但是当我们要下载时就会提示必须要下载客户端才能下载，但是我并不想被胁迫着安装，怎么呢？不惯着，直接薅了它！！！！

一、爬虫分析

调整到一个爱听的小说，点击播放，打开控制面板，点击下一章，捕获请求！——在媒体部分可以看到一个请求得到的就是MP3音频内容，双击或者直接打开该网址可以正常得到MP3的内容！所以说只要得到这个网址是怎么来的，那么我们的爬虫分析就已经完成了一大半了。（这个请求有载荷，但是用的是get请求，单一搜索每个关键字，如sign等并未找到相关部分，这时可以怀疑是不是用网址加密）

当排查到请求格式是https://www.ximalaya.com/mobile-playpage/track/v3/baseInfo/******* https://www.ximalaya.com/mobile-playpage/track/v3/baseInfo/

时发现响应中playUrlList有一个url，而且每次点击下一章都会有这条请求，这是就可以支持网址加密的猜测了。

所以工作重心就到了模拟与逆向这个解密算法了。

二、js逆向

全局搜索关键字decrypt，在可疑的地方打上断点（可以多打，断住之后再删除不是的，但是任何在CSS中的都别打哈，那是设置样式的）

打好断点之后点击下一章，你猜怎么着！断住了！

在控制台输出一下e,再比较一下长度等基本就可以确定就是接口返回的网址密文，那就进到方法中分析解密算法。

看着不是标准算法，直接把getSoundCryptLink函数折叠然后复制，粘贴到js文件中，可以重新娶个函数名。

往里传参，打印返回的结果，你猜怎么着！报错啦！那就什么没有补什么！加上就行了！

const r = new Uint8Array([188, 174, 178, 234, 171, 147, 70, 82, 76, 72, 192, 132, 60, 17, 30, 127, 184, 233, 48, 105, 38, 232, 240, 21, 47, 252, 41, 229, 209, 213, 71, 40, 63, 152, 156, 88, 51, 141, 139, 145, 133, 2, 160, 191, 11, 100, 10, 78, 253, 151, 42, 166, 92, 22, 185, 140, 164, 91, 194, 175, 239, 217, 177, 75, 19, 225, 94, 107, 125, 138, 242, 31, 182, 150, 15, 24, 226, 29, 80, 116, 168, 118, 28, 1, 186, 220, 158, 79, 59, 244, 119, 9, 189, 161, 74, 130, 221, 56, 216, 241, 212, 26, 218, 170, 85, 165, 153, 69, 238, 93, 255, 142, 3, 159, 215, 67, 33, 249, 53, 176, 77, 254, 222, 25, 115, 101, 148, 16, 13, 237, 197, 5, 58, 157, 135, 248, 223, 61, 198, 211, 110, 44, 54, 111, 52, 227, 4, 46, 205, 7, 219, 136, 14, 87, 114, 64, 104, 50, 39, 203, 81, 196, 43, 163, 173, 109, 108, 187, 102, 195, 37, 235, 65, 190, 113, 149, 143, 8, 27, 155, 207, 134, 123, 224, 129, 245, 62, 66, 172, 122, 126, 12, 162, 214, 90, 247, 251, 124, 201, 236, 117, 183, 73, 95, 89, 246, 181, 179, 83, 228, 193, 99, 6, 45, 112, 32, 154, 128, 230, 131, 206, 243, 57, 84, 146, 0, 35, 96, 250, 137, 36, 208, 103, 34, 68, 204, 231, 144, 120, 98, 202, 49, 210, 23, 200, 18, 86, 55, 121, 20, 199, 97, 167, 180, 169, 106])
, n = new Uint8Array([20, 234, 159, 167, 230, 233, 58, 255, 158, 36, 210, 254, 133, 166, 59, 63, 209, 177, 184, 155, 85, 235, 94, 1, 242, 87, 228, 232, 191, 3, 69, 178])
, o = new Uint8Array([183, 174, 108, 16, 131, 159, 250, 5, 239, 110, 193, 202, 153, 137, 251, 176, 119, 150, 47, 204, 97, 237, 1, 71, 177, 42, 88, 218, 166, 82, 87, 94, 14, 195, 69, 127, 215, 240, 225, 197, 238, 142, 123, 44, 219, 50, 190, 29, 181, 186, 169, 98, 139, 185, 152, 13, 141, 76, 6, 157, 200, 132, 182, 49, 20, 116, 136, 43, 155, 194, 101, 231, 162, 242, 151, 213, 53, 60, 26, 134, 211, 56, 28, 223, 107, 161, 199, 15, 229, 61, 96, 41, 66, 158, 254, 21, 165, 253, 103, 89, 3, 168, 40, 246, 81, 95, 58, 31, 172, 78, 99, 45, 148, 187, 222, 124, 55, 203, 235, 64, 68, 149, 180, 35, 113, 207, 118, 111, 91, 38, 247, 214, 7, 212, 209, 189, 241, 18, 115, 173, 25, 236, 121, 249, 75, 57, 216, 10, 175, 112, 234, 164, 70, 206, 198, 255, 140, 230, 12, 32, 83, 46, 245, 0, 62, 227, 72, 191, 156, 138, 248, 114, 220, 90, 84, 170, 128, 19, 24, 122, 146, 80, 39, 37, 8, 34, 22, 11, 93, 130, 63, 154, 244, 160, 144, 79, 23, 133, 92, 54, 102, 210, 65, 67, 27, 196, 201, 106, 143, 52, 74, 100, 217, 179, 48, 233, 126, 117, 184, 226, 85, 171, 167, 86, 2, 147, 17, 135, 228, 252, 105, 30, 192, 129, 178, 120, 36, 145, 51, 163, 77, 205, 73, 4, 188, 125, 232, 33, 243, 109, 224, 104, 208, 221, 59, 9])
, a = new Uint8Array([204, 53, 135, 197, 39, 73, 58, 160, 79, 24, 12, 83, 180, 250, 101, 60, 206, 30, 10, 227, 36, 95, 161, 16, 135, 150, 235, 116, 242, 116, 165, 171])
, i = "function" == typeof atob
, u = "function" == typeof e;
const f =  e=>atob(e.replace(/[^A-Za-z0-9\+\/]/g, ""))
function p(e, t, r) {
    let n = Math.min(e.length - t, r.length);
    for (let o = 0; o < n; o++)
        e[o + t] = e[o + t] ^ r[o]
}

可以自行封装一下，方便调用。

三、python爬虫程序

1.通过book_id得到所有的可以得到的章节的trackId和标题。

def start_html(book_id):
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Cookie": "_xmLog=h5&ea89cae5-39a5-4c99-9312-25f16d5c2434&process.env.sdkVersion; wfp=ACMyMjIyZjVmYTNlZWQyMWI5qw-dVY_w4L54bXdlYl93d3c; xm-page-viewid=ximalaya-web; impl=www.ximalaya.com.login; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1720775899; HMACCOUNT=E700A2F1CF62483E; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1720776335",
        "Referer": "https://www.baidu.com/",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "sec-ch-ua": "\\Chromium;v=\\124, \\Google",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\\Windows"
    }
    url = f"https://www.ximalaya.com/album/{book_id}"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    script = soup.findAll('script')
    result = ''
    for i in script:
        if 'window.__INITIAL_STATE__' in i.text:
            result = (i.text.replace('window.__INITIAL_STATE__ = ', ''))[:-1]
            break
    tracks = json.loads(result)['store']['AlbumDetailTrackListV2']['tracksInfo']['tracks']
    result = []
    for track in tracks:
        trackId = track['trackId']
        title = track['title'].replace('（每日更新，欢迎订阅）', '')
        try:
            title = title.split('⭐️')[0]
        except:
            pass
        result.append([trackId, title])
    return result

2.根据trackId请求得到网址密文

def get_video_url(trackId):
    headers = {
        "Accept": "*/*",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Connection": "keep-alive",
        "Cookie": "_xmLog=h5&ea89cae5-39a5-4c99-9312-25f16d5c2434&process.env.sdkVersion; wfp=ACMyMjIyZjVmYTNlZWQyMWI5qw-dVY_w4L54bXdlYl93d3c; xm-page-viewid=ximalaya-web; impl=www.ximalaya.com.login; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1720775899; HMACCOUNT=E700A2F1CF62483E; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1720776147",
        "Referer": "https://www.ximalaya.com/album/52187120",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "sec-ch-ua": "\\Chromium;v=\\124, \\Google",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\\Windows"
    }
    url = "https://www.ximalaya.com/mobile-playpage/track/v3/baseInfo/1720776292235"
    params = {
        "device": "www2",
        "trackId": trackId,
        "trackQualityLevel": "1"
    }
    response = requests.get(url, headers=headers, params=params)
    return json.loads(response.text)['trackInfo']['playUrlList'][0]['url']

3.调用js进行解密（也可以用python重写）

def decrypt_url(encrypted_url):
    with open('url解密.js', 'r', encoding='utf-8') as f:
        js_code = f.read()
    ctx = execjs.compile(js_code)
    decrypted_url = ctx.call('decrypt', encrypted_url)
    return decrypted_url

4.下载

def download_video(url, title):
    response = requests.get(url, timeout=10)
    with open(f'mp3/{title}.mp3', 'wb') as f:
        f.write(response.content)

到此一整个的小案例就结束了，感谢各位看官老爷，可以的话帮忙点个关注，后续还会更新其他有趣的案例哟！

作者：北愚