Python超強(qiáng)爬蟲8天速成(完整版)爬取各種網(wǎng)站數(shù)據(jù)實(shí)戰(zhàn)案例
2023-04-09 14:15 作者:半醒著的陽(yáng)光 | 我要投稿

各個(gè)練習(xí)筆記
p10 肯德基作業(yè),在pycharm直接打?。?/p>
import json import requests if __name__ == '__main__': keyword=input('請(qǐng)輸入位置') url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword' parme = { 'cname': '', 'pid': '', 'keyword': keyword, 'pageIndex': '1', 'pageSize': '10', } header = { #寫自己的UA標(biāo)識(shí),我就不打上去了 } response = requests.post(url=url,params=parme,headers=header) lict_data=response.json() for item in lict_data['Table1']:#注意看json文件,他首先是由兩個(gè)鍵值對(duì)組成,第一個(gè)表示相關(guān)的記錄的條數(shù),而詳細(xì)內(nèi)容存在第二個(gè)里面所以遍歷的是第二個(gè)鍵值對(duì) print(item['provinceName']+item['cityName']+item['addressDetail'])
p13的網(wǎng)址進(jìn)不去,用豆瓣的top250練了一下,爬取圖片
import requests import time from bs4 import BeautifulSoup import os t1=time.time() header={}#寫自己的UA標(biāo)識(shí) img_list=[] if not os.path.exists('./img'): # 判斷文件夾是否存在 os.mkdir('./img') for start_num in range(0,26,25):#只爬取了前面50張圖片 url=f"https://movie.douban.com/top250?start={start_num}&file=" response=requests.get(url=url,headers=header).text soup=BeautifulSoup(response,"html.parser") all_i=soup.findAll('img') for img in all_i: img_list.append(img['src']) img_list.pop()#最后一個(gè)多出兩張不需要的二維碼,刪去 sum=0 for img_adress in img_list: sum+=1 url_img=img_adress res_img=requests.get(url=url_img,headers=header).content with open(f'./img/{sum}.jpg','wb') as file: file.write(res_img) print('總耗時(shí):',time.time()-t1)
p24 還是把.text改成content才是正解(二進(jìn)制),然后io流內(nèi)寫入的編碼格式是‘UTF-8'亂碼就會(huì)少掉,但是其中還夾著一些無(wú)用的數(shù)據(jù),需要分析一下
import os import requests from bs4 import BeautifulSoup import time header = {}#寫自己的UA標(biāo)識(shí) t1 = time.time() page_text = requests.get(url='https://www.shicimingju.com/book/sanguoyanyi.html', headers=header).content soup = BeautifulSoup(page_text, 'lxml') li_list = soup.select(".book-mulu > ul > li") fp=open('./sanguo.txt','w',encoding='utf-8') for li in li_list: title=li.a.string detail_url='https://www.shicimingju.com'+li.a['href'] detail_page_text=requests.get(url=detail_url,headers=header,proxies=proxy).content detail_soup=BeautifulSoup(detail_page_text,'lxml') div_tag=detail_soup.find('div',class_='chapter_content') content=div_tag.text fp.write(title+':'+content+'\n') print(title,'爬取成功') fp.close()
p25爬取58同城二手房標(biāo)題
import time import requests from lxml import etree header=#寫自己的UA標(biāo)識(shí) t1=time.time() reponse=requests.get(url='https://www.58.com/ershoufang/',headers=header,timeout=30).text tree=etree.HTML(reponse) x_tree=tree.xpath('//div[@class="cb"]//tr/td[@class="t"]/a/text()') fp=open('./二手房標(biāo)題.txt','w',encoding='utf-8') for item in x_tree: fp.write(item+'\n') fp.close() print(x_tree) print('爬取總運(yùn)行時(shí)間:'+time.time()-t1)
標(biāo)簽: