喜马拉雅小说mp3资源批量获取攻略:可听即爬取
# 在网页端明明是可以听的,但是当我们要下载时就会提示必须要下载客户端才能下载,但是我并不想被胁迫着安装,怎么呢?不惯着,直接薅了它!!!!
一、爬虫分析
调整到一个爱听的小说,点击播放,打开控制面板,点击下一章,捕获请求!——在媒体部分可以看到一个请求得到的就是MP3音频内容,双击或者直接打开该网址可以正常得到MP3的内容!所以说只要得到这个网址是怎么来的,那么我们的爬虫分析就已经完成了一大半了。(这个请求有载荷,但是用的是get请求,单一搜索每个关键字,如sign等并未找到相关部分,这时可以怀疑是不是用网址加密)
当排查到请求格式是https://www.ximalaya.com/mobile-playpage/track/v3/baseInfo/******* https://www.ximalaya.com/mobile-playpage/track/v3/baseInfo/
时发现响应中playUrlList有一个url,而且每次点击下一章都会有这条请求,这是就可以支持网址加密的猜测了。
所以工作重心就到了模拟与逆向这个解密算法了。
二、js逆向
全局搜索关键字decrypt,在可疑的地方打上断点(可以多打,断住之后再删除不是的,但是任何在CSS中的都别打哈,那是设置样式的)
打好断点之后点击下一章,你猜怎么着!断住了!
在控制台输出一下e,再比较一下长度等基本就可以确定就是接口返回的网址密文,那就进到方法中分析解密算法。
看着不是标准算法,直接把getSoundCryptLink函数折叠然后复制,粘贴到js文件中,可以重新娶个函数名。
往里传参,打印返回的结果,你猜怎么着!报错啦!那就什么没有补什么!加上就行了!
const r = new Uint8Array([188, 174, 178, 234, 171, 147, 70, 82, 76, 72, 192, 132, 60, 17, 30, 127, 184, 233, 48, 105, 38, 232, 240, 21, 47, 252, 41, 229, 209, 213, 71, 40, 63, 152, 156, 88, 51, 141, 139, 145, 133, 2, 160, 191, 11, 100, 10, 78, 253, 151, 42, 166, 92, 22, 185, 140, 164, 91, 194, 175, 239, 217, 177, 75, 19, 225, 94, 107, 125, 138, 242, 31, 182, 150, 15, 24, 226, 29, 80, 116, 168, 118, 28, 1, 186, 220, 158, 79, 59, 244, 119, 9, 189, 161, 74, 130, 221, 56, 216, 241, 212, 26, 218, 170, 85, 165, 153, 69, 238, 93, 255, 142, 3, 159, 215, 67, 33, 249, 53, 176, 77, 254, 222, 25, 115, 101, 148, 16, 13, 237, 197, 5, 58, 157, 135, 248, 223, 61, 198, 211, 110, 44, 54, 111, 52, 227, 4, 46, 205, 7, 219, 136, 14, 87, 114, 64, 104, 50, 39, 203, 81, 196, 43, 163, 173, 109, 108, 187, 102, 195, 37, 235, 65, 190, 113, 149, 143, 8, 27, 155, 207, 134, 123, 224, 129, 245, 62, 66, 172, 122, 126, 12, 162, 214, 90, 247, 251, 124, 201, 236, 117, 183, 73, 95, 89, 246, 181, 179, 83, 228, 193, 99, 6, 45, 112, 32, 154, 128, 230, 131, 206, 243, 57, 84, 146, 0, 35, 96, 250, 137, 36, 208, 103, 34, 68, 204, 231, 144, 120, 98, 202, 49, 210, 23, 200, 18, 86, 55, 121, 20, 199, 97, 167, 180, 169, 106])
, n = new Uint8Array([20, 234, 159, 167, 230, 233, 58, 255, 158, 36, 210, 254, 133, 166, 59, 63, 209, 177, 184, 155, 85, 235, 94, 1, 242, 87, 228, 232, 191, 3, 69, 178])
, o = new Uint8Array([183, 174, 108, 16, 131, 159, 250, 5, 239, 110, 193, 202, 153, 137, 251, 176, 119, 150, 47, 204, 97, 237, 1, 71, 177, 42, 88, 218, 166, 82, 87, 94, 14, 195, 69, 127, 215, 240, 225, 197, 238, 142, 123, 44, 219, 50, 190, 29, 181, 186, 169, 98, 139, 185, 152, 13, 141, 76, 6, 157, 200, 132, 182, 49, 20, 116, 136, 43, 155, 194, 101, 231, 162, 242, 151, 213, 53, 60, 26, 134, 211, 56, 28, 223, 107, 161, 199, 15, 229, 61, 96, 41, 66, 158, 254, 21, 165, 253, 103, 89, 3, 168, 40, 246, 81, 95, 58, 31, 172, 78, 99, 45, 148, 187, 222, 124, 55, 203, 235, 64, 68, 149, 180, 35, 113, 207, 118, 111, 91, 38, 247, 214, 7, 212, 209, 189, 241, 18, 115, 173, 25, 236, 121, 249, 75, 57, 216, 10, 175, 112, 234, 164, 70, 206, 198, 255, 140, 230, 12, 32, 83, 46, 245, 0, 62, 227, 72, 191, 156, 138, 248, 114, 220, 90, 84, 170, 128, 19, 24, 122, 146, 80, 39, 37, 8, 34, 22, 11, 93, 130, 63, 154, 244, 160, 144, 79, 23, 133, 92, 54, 102, 210, 65, 67, 27, 196, 201, 106, 143, 52, 74, 100, 217, 179, 48, 233, 126, 117, 184, 226, 85, 171, 167, 86, 2, 147, 17, 135, 228, 252, 105, 30, 192, 129, 178, 120, 36, 145, 51, 163, 77, 205, 73, 4, 188, 125, 232, 33, 243, 109, 224, 104, 208, 221, 59, 9])
, a = new Uint8Array([204, 53, 135, 197, 39, 73, 58, 160, 79, 24, 12, 83, 180, 250, 101, 60, 206, 30, 10, 227, 36, 95, 161, 16, 135, 150, 235, 116, 242, 116, 165, 171])
, i = "function" == typeof atob
, u = "function" == typeof e;
const f = e=>atob(e.replace(/[^A-Za-z0-9\+\/]/g, ""))
function p(e, t, r) {
let n = Math.min(e.length - t, r.length);
for (let o = 0; o < n; o++)
e[o + t] = e[o + t] ^ r[o]
}
可以自行封装一下,方便调用。
三、python爬虫程序
1.通过book_id得到所有的可以得到的章节的trackId和标题。
def start_html(book_id):
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "_xmLog=h5&ea89cae5-39a5-4c99-9312-25f16d5c2434&process.env.sdkVersion; wfp=ACMyMjIyZjVmYTNlZWQyMWI5qw-dVY_w4L54bXdlYl93d3c; xm-page-viewid=ximalaya-web; impl=www.ximalaya.com.login; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1720775899; HMACCOUNT=E700A2F1CF62483E; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1720776335",
"Referer": "https://www.baidu.com/",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"sec-ch-ua": "\\Chromium;v=\\124, \\Google",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\\Windows"
}
url = f"https://www.ximalaya.com/album/{book_id}"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.findAll('script')
result = ''
for i in script:
if 'window.__INITIAL_STATE__' in i.text:
result = (i.text.replace('window.__INITIAL_STATE__ = ', ''))[:-1]
break
tracks = json.loads(result)['store']['AlbumDetailTrackListV2']['tracksInfo']['tracks']
result = []
for track in tracks:
trackId = track['trackId']
title = track['title'].replace('(每日更新,欢迎订阅)', '')
try:
title = title.split('⭐️')[0]
except:
pass
result.append([trackId, title])
return result
2.根据trackId请求得到网址密文
def get_video_url(trackId):
headers = {
"Accept": "*/*",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Cookie": "_xmLog=h5&ea89cae5-39a5-4c99-9312-25f16d5c2434&process.env.sdkVersion; wfp=ACMyMjIyZjVmYTNlZWQyMWI5qw-dVY_w4L54bXdlYl93d3c; xm-page-viewid=ximalaya-web; impl=www.ximalaya.com.login; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1720775899; HMACCOUNT=E700A2F1CF62483E; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1720776147",
"Referer": "https://www.ximalaya.com/album/52187120",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"sec-ch-ua": "\\Chromium;v=\\124, \\Google",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\\Windows"
}
url = "https://www.ximalaya.com/mobile-playpage/track/v3/baseInfo/1720776292235"
params = {
"device": "www2",
"trackId": trackId,
"trackQualityLevel": "1"
}
response = requests.get(url, headers=headers, params=params)
return json.loads(response.text)['trackInfo']['playUrlList'][0]['url']
3.调用js进行解密(也可以用python重写)
def decrypt_url(encrypted_url):
with open('url解密.js', 'r', encoding='utf-8') as f:
js_code = f.read()
ctx = execjs.compile(js_code)
decrypted_url = ctx.call('decrypt', encrypted_url)
return decrypted_url
4.下载
def download_video(url, title):
response = requests.get(url, timeout=10)
with open(f'mp3/{title}.mp3', 'wb') as f:
f.write(response.content)
到此一整个的小案例就结束了,感谢各位看官老爷,可以的话帮忙点个关注,后续还会更新其他有趣的案例哟!
作者:北愚