散文網(wǎng) » 生活 »日常 » 28. Scrapy 框架-爬取JS生成的動(dòng)態(tài)頁(yè)面

28. Scrapy 框架-爬取JS生成的動(dòng)態(tài)頁(yè)面

2020-07-04 09:58 作者:自學(xué)Python的小姐姐呀 0人讀過(guò) | 我要投稿

問(wèn)題

有的頁(yè)面的很多部分都是用JS生成的，而對(duì)于用scrapy爬蟲來(lái)說(shuō)就是一個(gè)很大的問(wèn)題，因?yàn)閟crapy沒(méi)有JS engine，所以爬取的都是靜態(tài)頁(yè)面，對(duì)于JS生成的動(dòng)態(tài)頁(yè)面都無(wú)法獲得

官網(wǎng)http://splash.readthedocs.io/en/stable/

解決方案

利用第三方中間件來(lái)提供JS渲染服務(wù)： scrapy-splash 等
利用webkit或者基于webkit庫(kù)

Splash是一個(gè)Javascript渲染服務(wù)。它是一個(gè)實(shí)現(xiàn)了HTTP API的輕量級(jí)瀏覽器，Splash是用Python實(shí)現(xiàn)的，同時(shí)使用Twisted和QT。Twisted（QT）用來(lái)讓服務(wù)具有異步處理能力，以發(fā)揮webkit的并發(fā)能力

安裝

pip安裝scrapy-splash庫(kù)

pip install scrapy-splash

scrapy-splash使用的是Splash HTTP API，所以需要一個(gè)splash instance，一般采用docker運(yùn)行splash，所以需要安裝docker
安裝docker, 安裝好后運(yùn)行docker
拉取鏡像

docker pull scrapinghub/splash

用docker運(yùn)行scrapinghub/splash

docker run -p 8050:8050 scrapinghub/splash

配置splash服務(wù)（以下操作全部在settings.py）:
SPLASH_URL = 'http://192.168.99.100:8050/' DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}SPIDER_MIDDLEWARES = {
?'scrapy_splash.SplashDeduplicateArgsMiddleware': 100
}
這個(gè)中間件需要支持cache_args功能; 它允許通過(guò)不在磁盤請(qǐng)求隊(duì)列中多次存儲(chǔ)重復(fù)的Splash參數(shù)來(lái)節(jié)省磁盤空間。如果使用Splash 2.1+，則中間件也可以通過(guò)不將這些重復(fù)的參數(shù)多次發(fā)送到Splash服務(wù)器來(lái)節(jié)省網(wǎng)絡(luò)流量
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

配置消息隊(duì)列需要使用的類
配置消息隊(duì)列所使用的過(guò)濾類
Enable SplashDeduplicateArgsMiddleware

將splash middleware添加到DOWNLOADER_MIDDLEWARE中
使用splash解析，要在配置文件中設(shè)置splash服務(wù)器地址：

樣例

import scrapy
from scrapy_splash import SplashRequest

class DoubanSpider(scrapy.Spider):
? ?name = 'douban'

? ?allowed_domains = ['douban.com']

def start_requests(self):
? ?yield SplashRequest('https://movie.douban.com/typerank?type_name=劇情&type=11&interval_id=100:90', args={'wait': 0.5})

def parse(self, response):
? ?print(response.text)

標(biāo)簽：