下載文檔教程
多種文檔下載器
https://github.com/rty813/doc_downloader
簡單的方法
下載docDownloader.zip(https://github.com/rty813/doc_downloader/releases/),解壓縮。
運(yùn)行docDownloader.exe。
輸入文檔的網(wǎng)址,即可開始下載。下載后的文檔在output子文件夾下。
復(fù)雜的方法
下載doc_downloader-master所有文件(GitZip for github Chrome插件),解壓縮。
安裝好python或者Anaconda。以Anaconda為例,打開開始菜單,找到Anaconda3 (64-bit),以管理員身份運(yùn)行Anaconda Powershell Prompt (anaconda3),即可打開終端。輸入下列內(nèi)容,定位到解壓縮后的文件夾,這里是下載解壓縮到D:\Download\doc_downloader-master,終端內(nèi)輸入:
D:(回車)
cd D:\Download\doc_downloader-master\doc_downloader-master(回車)
終端內(nèi)輸入pip install -r requirements.txt(回車),安裝所需要的包。注意若使用報(bào)錯,應(yīng)先檢查chromedriver版本與chrome版本是否兼容。若不兼容,則只需將文件夾中的chromedriver.exe替換為兼容的版本即可。附[chromedriver下載地址](https://chromedriver.chromium.org/downloads)
終端內(nèi)輸入python docDownloader.py(回車),輸入文檔的網(wǎng)址,即可開始下載。下載后的文檔在output子文件夾下。
上述方法下載的PDF中存儲的是一張張圖片,為了可以復(fù)制文字,需要對PDF進(jìn)行OCR(光學(xué)字符識別)。
Windows下安裝OCRmyPDF
https://ocrmypdf.readthedocs.io/en/latest/installation.html#native-windows
You must install the following for Windows:
Python 3.8 (64-bit) or later
Tesseract 4.1.1 or later
Ghostscript 9.50 or later
Using the?Chocolatey (https://chocolatey.org/)?package manager, install the following when running in an Administrator command prompt:
choco?install?python3
choco?install?--pre?tesseract
choco?install?ghostscript
choco?install?pngquant
?(optional)
The commands above will install Python 3.x (latest version), Tesseract, Ghostscript and pngquant. Chocolatey may also need to install the Windows Visual C++ Runtime DLLs or other Windows patches, and may require a reboot.
You may then use?pip
?to install ocrmypdf. (This can performed by a user or Administrator.):
pip?install?ocrmypdf
Chocolatey automatically selects appropriate versions of these applications. If you are installing them manually, please install 64-bit versions of all applications for 64-bit Windows, or 32-bit versions of all applications for 32-bit Windows. Mixing the “bitness” of these programs will lead to errors.
OCRmyPDF will check the Windows Registry and standard locations in your Program Files for third party software it needs (specifically, Tesseract and Ghostscript). To override the versions OCRmyPDF selects, you can modify the?PATH
?environment variable.?Follow these directions?to change the PATH.
打開Anaconda終端,輸入
cd D:\Download\docDownloader\docDownloader\output(回車)
待OCR文檔命名為pic.pdf,待輸出文件命名為 text.pdf,對于中文文檔,輸入
ocrmypdf --force-ocr -l chi_sim? pic.pdf text.pdf
即可開始OCR,輸出的text.pdf也在同一文件夾。