接着继续分析,发现PDF每一页的图片URL有规律!!!这不是曲线救国吗?西大良心啊~ 下面的指南分为两类,介绍这两种类型书库的下载
先介绍个简单的吧
爱教材平台(外文)资源下载
随便选一本书,比如《Environmental Science》 点Online在线阅读,这时你会进入一个PDF界面接着按F12键,打开浏览器开发者控制台
选择Network->XHR
然后刷新一下界面
你会发现有一个以.cdf拓展名结尾的东东 右击选择在新页面打开,然后会弹出下载界面 下载之后,重命名为Environmental Science.pdf 即可,打开这个文档发现能正常加载! 西大良心啊,还好没有加密 :91:
下面进入第二种图书资源的下载分析,这也是本文的重点
课程教材资源下载
我们以《数字逻辑与数字系统》这本教材为例 点击本地全文,进入在线阅读界面,老规矩按F12打开开发者工具选中Network->Img,然后根据下面的图选择目录页 注意观察箭头所指,说明目录占用了3页 然后看开发者控制台 是不是规律很明显 (jid是什么?我也不知道,不过不影响我们后续的操作)
目录第一页的jid: /!00001.jpg,第二页jid: /!00002.jpg
后面的内容页同理,正文页占用了276页
正文页第一页的jid: /000001.jpg, 最后一页的jid: /000276.jpg
这规律太明显了不是吗?我把每张图爬下来,然后合并就OK了
:6+:
然而事情没那么简单,进一步分析你会发现这些资源有一定的反爬策略,会校验referer,以及必须有有效的cookie
你可以通过在新标签页打开某一个图片,然后刷新一下页面,你会发现图片显示不出来了(刷新后的referer为空,所以可以据此判断必须要有正确的referer)
那怎么办呢?
还好强大的开发者工具帮我们省下了很多东西,选择copy as cURL(bash) 将上述内容复制到任意一个文本编辑器,方便后续代码的编写
代码: 全选
curl 'https://webvpn.xjtu.edu.cn/http-98/77726476706e69737468656265737421a2a713d276613f1e2c5cc7fdcd00/png/png.dll?did=a174&pid=8E834940300BF19F5B9F7E2475B56098D25D0668CE2D1262D67432162954342C0857E97A51CE1F1C63CEB0845CAD1D652DA00B035284115135EABBEB78B4769A34958DF3A607EDAE5219886934CA87884BA474DA4835458DCF87DCDF0EDB5D045095BD9B3607F0B8ED2E8CB53AF0E907363B&jid=/000276.jpg&zoom=0' \
-H 'Connection: keep-alive' \
-H 'sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' \
-H 'Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Sec-Fetch-Mode: no-cors' \
-H 'Sec-Fetch-Dest: image' \
-H 'Referer: https://webvpn.xjtu.edu.cn/http-9088/77726476706e69737468656265737421a2a713d276613f1e2c5cc7fdcd00/jpath/reader/reader.shtml?channel=100&code=2a7e157e1526d60cc9999bcbdb450b7c&cpage=1&epage=276&ipinside=0&netuser=0&spage=1&ssno=13878943' \
-H 'Accept-Language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-HK;q=0.6,en-HK;q=0.5,zh-TW;q=0.4' \
-H 'Cookie: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' \
--compressed
有效载荷是什么?
代码: 全选
https://webvpn.xjtu.edu.cn/http-98/77726476706e69737468656265737421a2a713d276613f1e2c5cc7fdcd00/png/png.dll?did=a174&pid=8E834940300BF19F5B9F7E2475B56098D25D0668CE2D1262D67432162954342C0857E97A51CE1F1C63CEB0845CAD1D652DA00B035284115135EABBEB78B4769A34958DF3A607EDAE5219886934CA87884BA474DA4835458DCF87DCDF0EDB5D045095BD9B3607F0B8ED2E8CB53AF0E907363B&jid=/000276.jpg&zoom=0
代码: 全选
"did": a174
"pid": 8E834940300BF19F5B9F7E2475B56098D25D0668CE2D1262D67432162954342C0857E97A51CE1F1C63CEB0845CAD1D652DA00B035284115135EABBEB78B4769A34958DF3A607EDAE5219886934CA87884BA474DA4835458DCF87DCDF0EDB5D045095BD9B3607F0B8ED2E8CB53AF0E907363B
"jid": /000276.jpg
"zoom": 0
将封面,目录,内容页爬下来合并成PDF就好,规律已经有了,接下来就上代码吧,代码我写了两种,一种是通过requests+multiprocessing多线程,另一种是通过aiohttp+asyncio协程爬取
先告诉大家效果和效率,后者远远胜出
要求Python3
使用到的第三方库,windows通过pip装,macOS/Linux用pip3装
代码: 全选
requests
img2pdf
aiohttp
pdf.py
代码: 全选
import img2pdf
import os
# from pikepdf import _cpphelpers
def pdf_creator(dirname, bookname):
with open(dirname + bookname + ".pdf", "wb") as f:
imgs = []
for fname in sorted(os.listdir(dirname), key=len):
if not fname.endswith(".jpg"):
continue
path = os.path.join(dirname, fname)
if os.path.isdir(path):
continue
imgs.append(path)
try:
f.write(img2pdf.convert(imgs))
except:
print("合并错误,请检查书籍文件夹内图片资源下载是否正常。\n 或重新下载试试")
代码: 全选
import os
def clear_file(dirname):
for fname in sorted(os.listdir(dirname)):
if fname.endswith(".jpg"):
path = os.path.join(dirname, fname)
os.remove(path)
第一种代码:requests+multiprocessing
main.py代码: 全选
from pdf import pdf_creator
from downloader import start_download
import os
import json
import clear
import time
def main():
start_time = time.time()
print("《" + bookname + "》" "开始下载!")
start_download(url, headers, pid, toc_num, content_num, dirname)
print("《" + bookname + "》" "图片资源下载完毕!")
print("正在将图片合并为PDF...")
pdf_creator(dirname, bookname)
print("合并完毕...请到*** " + dirname + " ***查看文档吧~")
print("总共用时: %.5f秒" % float(time.time() - start_time))
if need_clear:
clear.clear_file(dirname)
print("图片文件已清除")
# os.system("pause")
if __name__ == '__main__':
# 从config.json中读取配置
with open("config.json", "r", encoding='utf-8') as f:
data = json.load(f)
bookname = data["bookname"]
dirname = data["dirname"]
toc_num = data["toc_num"]
content_num = data["content_num"]
pid = data["pid"]
url = data["url"]
referer = data["referer"]
cookie = data["cookie"]
need_clear = data["clear"]
# 书籍保存的路径
if not os.path.exists(dirname):
os.mkdir(dirname)
# 创建书籍资源目录
dirname = dirname + bookname + "\\"
if not os.path.exists(dirname):
os.mkdir(dirname)
headers = {
"Connection": r'keep-alive',
"sec-ch-ua": r'"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile": '?0',
"User-Agent": r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
"Accept": r'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
"Sec-Fetch-Site": r'same-origin',
"Sec-Fetch-Mode": r'no-cors',
"Sec-Fetch-Dest": r'image',
"Referer": referer,
"Accept-Language": r'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-HK;q=0.6,en-HK;q=0.5,zh-TW;q=0.4',
"Cookie": cookie
}
main()
代码: 全选
import requests
import multiprocessing
# sort_code为了结构化命名,方便后续合并pdf
def download(url, headers, payload, i, dirname, sort_code):
with open(dirname + sort_code + str(i) + r".jpg", "wb") as f:
r = requests.get(url, headers=headers, params=payload)
f.write(r.content)
def start_download(url, headers, pid, toc_num, content_num, dirname):
p = multiprocessing.Pool(multiprocessing.cpu_count())
# 下载封面
payload = {'pid': pid,
'zoom': '0',
'jid': r'/bok001' + r'.jpg',
'did': 'a174'
}
f = open(dirname + "1.jpg", "wb")
r = requests.get(url, headers=headers, params=payload)
f.write(r.content)
# 下载目录
for i in range(1, toc_num + 1):
payload = {'pid': pid,
'zoom': '0',
'jid': r'/!0000' + str(i) + r'.jpg',
'did': 'a174'
}
p.apply_async(download, args=(url, headers, payload, i, dirname, '0'))
p.close()
p.join()
# 下载内容
p = multiprocessing.Pool(multiprocessing.cpu_count())
for i in range(1, content_num+1):
if i <= 9:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/00000' + str(i) + r'.jpg',
'did': 'a174'
}
p.apply_async(download, args=(url, headers, payload, i, dirname, '00'))
elif i >= 10 and i <= 99:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/0000' + str(i) + r'.jpg',
'did': 'a174'
}
p.apply_async(download, args=(url, headers, payload, i, dirname, '00'))
elif i >= 100 and i <= 999:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/000' + str(i) + r'.jpg',
'did': 'a174'
}
p.apply_async(download, args=(url, headers, payload, i, dirname, '00'))
elif i >= 1000 and i <= 9999:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/00' + str(i) + r'.jpg',
'did': 'a174'
}
p.apply_async(download, args=(url, headers, payload, i, dirname, '00'))
p.close()
p.join()
第二种代码: aiohttp+asyncio
main.py代码: 全选
from pdf import pdf_creator
from aioscraper import start_download
import os
import json
import clear
import time
def main():
start_time = time.time()
print("《" + bookname + "》" "开始下载!")
start_download(url, headers, pid, toc_num, content_num, dirname)
print("《" + bookname + "》" "图片资源下载完毕!")
print("正在将图片合并为PDF...")
pdf_creator(dirname, bookname)
print("合并完毕...请到*** " + dirname + " ***查看文档吧~")
print("总共用时: %.5f秒" % float(time.time() - start_time))
if need_clear:
clear.clear_file(dirname)
print("图片文件已清除")
# os.system("pause")
if __name__ == '__main__':
# 从config.json中读取配置
with open("config.json", "r", encoding='utf-8') as f:
data = json.load(f)
bookname = data["bookname"]
dirname = data["dirname"]
toc_num = data["toc_num"]
content_num = data["content_num"]
pid = data["pid"]
url = data["url"]
referer = data["referer"]
cookie = data["cookie"]
need_clear = data["clear"]
# 书籍保存的路径
if not os.path.exists(dirname):
os.mkdir(dirname)
# 创建书籍资源目录
dirname = dirname + bookname + "\\"
if not os.path.exists(dirname):
os.mkdir(dirname)
headers = {
"Connection": r'keep-alive',
"sec-ch-ua": r'"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile": '?0',
"User-Agent": r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
"Accept": r'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
"Sec-Fetch-Site": r'same-origin',
"Sec-Fetch-Mode": r'no-cors',
"Sec-Fetch-Dest": r'image',
"Referer": referer,
"Accept-Language": r'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-HK;q=0.6,en-HK;q=0.5,zh-TW;q=0.4',
"Cookie": cookie
}
main()
代码: 全选
import requests
import asyncio
import aiohttp
async def download(url, headers, payload, i, dirname, sort_code):
with open(dirname + sort_code + str(i) + r".jpg", "wb") as f:
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers, params=payload) as resp:
f.write(await resp.read())
def start_download(url, headers, pid, toc_num, content_num, dirname):
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
loop = asyncio.get_event_loop()
tasks = []
# 下载封面
payload = {'pid': pid,
'zoom': '0',
'jid': r'/bok001' + r'.jpg',
'did': 'a174'
}
f = open(dirname + "1.jpg", "wb")
r = requests.get(url, headers=headers, params=payload)
f.write(r.content)
# 下载目录
for i in range(1, toc_num + 1):
payload = {'pid': pid,
'zoom': '0',
'jid': r'/!0000' + str(i) + r'.jpg',
'did': 'a174'
}
tasks.append(download(url, headers, payload, i, dirname, '0'))
# 下载内容
for i in range(1, content_num+1):
if i <= 9:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/00000' + str(i) + r'.jpg',
'did': 'a174'
}
tasks.append(download(url, headers, payload, i, dirname, '00'))
elif i >= 10 and i <= 99:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/0000' + str(i) + r'.jpg',
'did': 'a174'
}
tasks.append(download(url, headers, payload, i, dirname, '00'))
elif i >= 100 and i <= 999:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/000' + str(i) + r'.jpg',
'did': 'a174'
}
tasks.append(download(url, headers, payload, i, dirname, '00'))
elif i >= 1000 and i <= 9999:
payload = {'pid': pid,
'zoom': '0',
'jid': r'/00' + str(i) + r'.jpg',
'did': 'a174'
}
tasks.append(download(url, headers, payload, i, dirname, '00'))
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
代码是在windows下编写,由于windows和mac/linux下的文件路径不太一样,前者使用的是\,后者为/
由于"\"有转义的用途,所以填写路径的时候我使用了两个"\",即"\\"在mac/Linux上使用上述代码需要更改config.json中dirname的内容,比如改为
代码: 全选
/home/bobmaster/Downloads/ebooks/
代码: 全选
dirname = dirname + bookname + "/"
代码: 全选
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
使用说明
mac/Linux用户请先安装好第三方库(requests,img2pdf,aiohttp),windows如果不想通过源码运行,可以不安装第三方库直接使用windows二进制可执行文件安装好依赖后,要正常的运行程序我们还需要更改config.json这个配置文件,里面记载了书籍的名称,页码,referer,cookie等信息,是爬虫正常运行的必要信息
一个示例模板
代码: 全选
{
"bookname": "工程力学",
"dirname": "D:\\Downloads\\e-books\\",
"toc_num": 5,
"content_num": 434,
"pid": "628F37F90E06067DEDE2B1187A4CB019746FCAD9343447B12D58458287A1DE03DF5784DDAF55127EF4ECF5A05BE2BD06410BBBA31EAB7C8D0A279BE4E851CD3D9F7D08AD547CC778CD8C5B7D8AE074A7776F5F095B4CABD3FE17D22A63350132958EC56E66CE8B680B9B2FD7883962D0996B",
"url": "https://webvpn.xjtu.edu.cn/http-98/77726476706e69737468656265737421a2a713d276613f1e2c5cc7fdcd00/png/png.dll",
"referer": "https://webvpn.xjtu.edu.cn/http-9088/77726476706e69737468656265737421a2a713d276613f1e2c5cc7fdcd00/jpath/reader/reader.shtml?channel=100&code=3bda91b19cfbfdd67f1c04cf532a03a9&cpage=1&epage=434&ipinside=0&netuser=0&spage=1&ssno=12186194",
"cookie": "show_vpn=1; wengine_vpn_ticketwebvpn_xjtu_edu_cn=xxxxxxxxxxxxxxx; refresh=1",
"clear": 0
}
参数 | 说明 |
---|---|
bookname | 书的名称(随便填) |
dirname | 你想将书籍保存在哪?填路径就好,记住windows和mac/Linux下的区别 |
toc_num | 目录页所占页码数 |
content_num | 正文页所占页码数 |
pid | 可以通过浏览器开发者工具获得或者通过图片的地址自己提取 |
url | 去掉有效载荷的部分,即完整URL?前面的部分 (通过几天观察,这个貌似不会改变) |
referer | 可以通过浏览器开发者工具获得或从copy as cURL(bash)里得到的内容获取 |
cookie | 从copy as cURL(bash)里得到的内容获取,有效期貌似是12~24h,之后要重新获取 |
clear | 是否需要在合并PDF后删除图片以节约空间,0表示否,1表示肯定 |
如果是通过源码运行,请确保源码目录下有config.json这个文件,填好config.json后可以通过终端运行上述源码
代码: 全选
python3 main.py
问题与说明FAQ
1. 请勿将本程序用于非法用途,本程序仅限交流学习使用,由于使用本程序导致的任何法律问题本人概不负责2. 为什么我得到的PDF文档图片大小有问题?
A: 这个是少数情况,由于img2pdf作为模块使用没有看到调整页面大小的参数,故请手动合并
windows的通配符我不知道怎么用(建议装个Linux子系统),macOS/Linux打开终端,cd进入书籍图片资源目录,使用下面的命令即可解决
代码: 全选
img2pdf --output 书名.pdf -S A4 --auto-orient *.jpg
3. 支持的书籍资源有哪些? 只要是有本地全文标志的书籍都可以下载,电子全文使用的是超星图书的库,有一个动态生成的有效载荷,我目前还不会分析javascript的逻辑,故暂时不支持这类书籍