某科學(xué)的爬蟲原理

為了防止B站不給過這篇文章,大家自己猜測爬的是哪個圖片站吧
正文
網(wǎng)上關(guān)于此站的教程多是python的,我講一下C#實現(xiàn)過程中會遇到哪些技術(shù)問題
咱們一步一步來,寫爬蟲嘛,我認(rèn)為第一步應(yīng)該做到:知己知彼百戰(zhàn)不殆
先來分析,找規(guī)律
這次開刀的網(wǎng)站,很有規(guī)律性
www.pixiv.net/search.php?s_mode=s_tag&word=miku?【關(guān)鍵字模式】
www.pixiv.net/search.php?s_mode=s_tag_full&word=miku 【標(biāo)簽?zāi)J健?/p>
差別只是mode不一樣
再來看一下頁數(shù)
www.pixiv.net/search.php?word=miku&order=date_d&p=2?【關(guān)鍵字模式】
www.pixiv.net/search.php?word=miku&s_mode=s_tag_full&order=date_d&p=2 ?【標(biāo)簽?zāi)J健?/span>

方法此站,需要搞個?User-Agent
感謝?cucmberium さん
[http://cucmberium.hatenablog.com/entry/2016/06/20/214109]
該博客的方法親測有效

下載圖片代碼:
? ? ? ? ? ? ? ?string url=“網(wǎng)址”;
???????????????string time=DateTime.Now.ToString("yyyyMMddHHm");
? ? ? ??string downloadpath = Environment.CurrentDirectory+@"\"+"Cover"+@"\"+time+url.Substring(url.Length-4,4);
????????WebClient mywebclient = new WebClient();
?????????mywebclient.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36");
?????????mywebclient.Headers.Add("Referer", url);
? ? ? ? ?mywebclient.DownloadFile(url, downloadpath);

爬蟲核心代碼
?string remoteUri =網(wǎng)址;
HtmlDocument??doc = new HtmlDocument();
using (WebClient myWebClient = new WebClient())
{
??myWebClient.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36");
??doc.Load(myWebClient.OpenRead(remoteUri));
??}

上面是最難鼓搗的東西,下面還有個東西需要考慮
https://i.pximg.net/img-original/img/2018/12/13/19/56/00/72095984_p0.png 【原圖】
https://i.pximg.net/c/240x240/img-master/img/2018/12/13/19/56/00/72095984_p0_master1200.jpg 【縮略圖】

https://i.pximg.net/img-master/img/2018/12/13/14/47/52/72092738_p0_master1200.jpg 【原圖】
https://i.pximg.net/c/240x240/img-master/img/2018/12/13/14/47/52/72092738_p0_master1200.jpg 【縮略圖】

唉~,我目前見到這兩種。
f*ck,本來想根據(jù)縮略圖鏈接,得到原圖鏈接的
結(jié)果,原圖有兩種(可能以上)的形式
這就需要寫個判斷了,判斷是否能打開網(wǎng)址

//C# 判斷網(wǎng)站是否能訪問或者斷鏈
?????????public bool CheckUrlVisit(string url)?
????????{???????
????????????try
????????????{
?????
????????????WebClient mywebclient = new WebClient();
???????????mywebclient.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36");
?????????????mywebclient.Headers.Add("Referer", url);
??????????string msg=???mywebclient.DownloadString(url); //獲取html源碼
??????????????????????
????????????????if (msg!="")
????????????????{
????????????????????return true;
????????????????}
????????????}
????????????catch (WebException webex)
????????????{
????????????????return false;
????????????}
????
????????????return false;
????????
????????}

唉~,經(jīng)過這么一頓分析,寫出爬蟲

然鵝,理論上都是放屁,下載速度跟蝸牛一樣(貌似得鼓搗多線程的)
唉~