404頁面檢測(cè)

2023-06-12 20:54 作者:chaojilaji123 0人讀過 | 我要投稿

[本文源地址](https://mp.weixin.qq.com/s?__biz=MzI5MjY4OTQ2Nw==&mid=2247484195&idx=1&sn=3b0d9fc36b3d8e67dffa3c612f2bcbaf&chksm=ec7cc077db0b4961f0ab4a3672b288dc13533bb6a4453d0c8aa6428cd4fd8003003dc6bbdb90#rd)

[頁面分析合集地址](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzI5MjY4OTQ2Nw==&action=getalbum&album_id=2968591493026283521&scene=173&from_msgid=2247484195&from_itemidx=1&count=3&nolastread=1#wechat_redirect)

# 頁面內(nèi)容分析之404頁面檢測(cè)

先來看看機(jī)器人怎么說：

>要檢測(cè)一個(gè)網(wǎng)頁是否是404頁面，你可以嘗試以下方法：

HTTP狀態(tài)碼檢查：發(fā)送HTTP請(qǐng)求并檢查返回的狀態(tài)碼。如果網(wǎng)頁是404頁面，服務(wù)器將返回404狀態(tài)碼。你可以使用編程語言或工具（如Python的requests庫）發(fā)送HTTP請(qǐng)求，并檢查返回的狀態(tài)碼是否為404。

頁面內(nèi)容檢查：發(fā)送HTTP請(qǐng)求并獲取網(wǎng)頁內(nèi)容。然后，檢查頁面內(nèi)容是否包含404相關(guān)的文本或關(guān)鍵詞。通常，404頁面會(huì)包含類似于"404"、"頁面未找到"、"Not Found"等關(guān)鍵詞。你可以使用編程語言或工具對(duì)獲取的頁面內(nèi)容進(jìn)行文本匹配或關(guān)鍵詞搜索，以確定是否為404頁面。

網(wǎng)頁截圖比對(duì)：使用自動(dòng)化測(cè)試工具，如Selenium，可以模擬瀏覽器行為打開網(wǎng)頁并截取頁面截圖。然后，你可以比對(duì)該截圖與預(yù)期的404頁面截圖進(jìn)行比對(duì)。如果它們相似度很高，則可能是404頁面。

這些方法可以根據(jù)你的需求和技術(shù)背景進(jìn)行選擇和實(shí)施。需要注意的是，有些網(wǎng)頁可能會(huì)使用JavaScript動(dòng)態(tài)生成內(nèi)容，這可能需要更高級(jí)的技術(shù)來處理。

## 什么是404界面

狹義的404界面就是http響應(yīng)碼返回404或503等不可達(dá)值的網(wǎng)頁

廣義的來說，404界面是一個(gè)統(tǒng)稱，指的是協(xié)議404或者內(nèi)容404

## 404頁面有哪些特征

1、返回的httpcode是404/503/401等不可達(dá)code

2、title里面有”不存在“、”不可達(dá)“等關(guān)鍵詞

3、內(nèi)容里面有“不存在”、“找不到”等關(guān)鍵詞

4、鏈接被重定向到了特定的404或503域名或頁面

5、對(duì)于單個(gè)網(wǎng)站來說，所有的404頁面結(jié)構(gòu)比較統(tǒng)一

## 檢測(cè)方案

針對(duì)上述404頁面的特征，我們需要逐個(gè)實(shí)現(xiàn)

### 響應(yīng)碼返回404/503等

```java

public Boolean checkHttpStatusCode(String code) {

? ? ? ? if (code.equals("404") || code.equals("503")) {

? ? ? ? ? ? return true;

? ? ? ? }

? ? ? ? return false;

? ? }

```

### title里面有”不存在“、”不可達(dá)“等關(guān)鍵詞

首先，我們使用jsoup庫解析html，然后得到title標(biāo)簽里面的內(nèi)容

```java

public static String getHtmlTitle(String html){

? ? ? ? try {

? ? ? ? ? ? // 使用 Jsoup 解析 HTML

? ? ? ? ? ? Document doc = Jsoup.parse(html);

? ? ? ? ? ? // 獲取 <title> 標(biāo)簽

? ? ? ? ? ? Element titleElement = doc.select("title").first();

? ? ? ? ? ? // 檢查是否存在 <title> 標(biāo)簽并獲取其內(nèi)容

? ? ? ? ? ? if (titleElement != null) {

? ? ? ? ? ? ? ? String title = titleElement.text();

? ? ? ? ? ? ? ? return title;

? ? ? ? ? ? } else {

? ? ? ? ? ? ? ? return "";

? ? ? ? ? ? }

? ? ? ? } catch (Exception e) {

? ? ? ? ? ? e.printStackTrace();

? ? ? ? }

? ? ? ? return "";

? ? }

```

拿到title之后，我們?cè)龠M(jìn)行比對(duì)即可，利用java里面的contains方法即可

### 內(nèi)容里面有“不存在”、“找不到”等關(guān)鍵詞

首先需要獲取到html里面的內(nèi)容區(qū)，在沒有使用智能提取技術(shù)之前，我們可以先去除標(biāo)簽獲取內(nèi)容，但是這確實(shí)是會(huì)帶來很大的誤報(bào)

```java

import org.apache.commons.lang3.StringEscapeUtils;

import java.io.IOException;

import java.nio.charset.StandardCharsets;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class HtmlUtil {

? ? public static String delHTMLTag(String htmlStr) {

? ? ? ? String regEx_script = "<script[^>]*?>[\\s\\S]*?<\\/script>"; //定義script的正則表達(dá)式

? ? ? ? String regEx_style = "<style[^>]*?>[\\s\\S]*?<\\/style>"; //定義style的正則表達(dá)式

? ? ? ? String regEx_html = "<[^>]+>"; //定義HTML標(biāo)簽的正則表達(dá)式

? ? ? ? Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);

? ? ? ? Matcher m_script = p_script.matcher(htmlStr);

? ? ? ? htmlStr = m_script.replaceAll(""); //過濾script標(biāo)簽

? ? ? ? Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);

? ? ? ? Matcher m_style = p_style.matcher(htmlStr);

? ? ? ? htmlStr = m_style.replaceAll(""); //過濾style標(biāo)簽

? ? ? ? Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);

? ? ? ? Matcher m_html = p_html.matcher(htmlStr);

? ? ? ? htmlStr = m_html.replaceAll(""); //過濾html標(biāo)簽

? ? ? ? return htmlStr.trim();

? ? }

? ? public static String htmlTextFormat(String htmlText) {

? ? ? ? htmlText = htmlText

? ? ? ? ? ? ? ? .replaceAll("(\\\\n)+", " ")

? ? ? ? ? ? ? ? .replaceAll("(\\\\t)+"," ")

? ? ? ? ? ? ? ? .replaceAll("(\t)+"," ")

? ? ? ? ? ? ? ? .replaceAll("(\n)+"," ");

? ? ? ? htmlText = htmlText.replaceAll(" +"," ");

? ? ? ? return htmlText;

? ? }

? ? public static String getContent(String html) {

? ? ? ? String ans = "";

? ? ? ? try {

? ? ? ? ? ? html = StringEscapeUtils.unescapeHtml4(html);

? ? ? ? ? ? html = delHTMLTag(html);

? ? ? ? ? ? html = htmlTextFormat(html);

? ? ? ? ? ? return html;

? ? ? ? } catch (Exception e) {

? ? ? ? ? ? e.printStackTrace();

? ? ? ? }

? ? ? ? return ans;

? ? }

}

```

上面的getContent方法便是我們獲取到html內(nèi)容的方法。

### 鏈接被重定向到了特定的404或503域名或頁面

需要傳入url，并對(duì)url進(jìn)行解析

```java

public class UrlPattern {

? ? private String pattern; // 正則

? ? private String description; // 描述

? ? private String location; // 位置

}

private Boolean checkOneUrlPattern(UrlPattern urlPattern, String url) {

? ? ? ? URL url1 = null;

? ? ? ? try {

? ? ? ? ? ? url1 = new URL(url);

? ? ? ? } catch (MalformedURLException e) {

? ? ? ? ? ? e.printStackTrace();

? ? ? ? }

? ? ? ? if (Objects.isNull(url1)) return false;

? ? ? ? Pattern pattern = Pattern.compile(urlPattern.getPattern());

? ? ? ? String val = "";

? ? ? ? if (urlPattern.getLocation().equalsIgnoreCase("path")) {

? ? ? ? ? ? val = url1.getPath();

? ? ? ? } else if (urlPattern.getLocation().equalsIgnoreCase("query")) {

? ? ? ? ? ? val = url1.getQuery();

? ? ? ? } else if (urlPattern.getLocation().equalsIgnoreCase("host")) {

? ? ? ? ? ? val = url1.getHost();

? ? ? ? }

? ? ? ? if (!StringUtils.hasText(val))return false;

? ? ? ? Matcher matcher = pattern.matcher(val);

? ? ? ? if (matcher.find()) {

? ? ? ? ? ? return true;

? ? ? ? }

? ? ? ? return false;

? ? }

```

UrlPattern 是定義的一套檢測(cè)規(guī)則的對(duì)象類，以下是大概的格式

```yml

?- pattern: '\b(404|503)\b'

? ? description: '檢測(cè)鏈接的域名部分包含404或503'

? ? location: 'host'

```

上述規(guī)則檢測(cè)域名的host部分中是否有404或者503

### 對(duì)比頁面結(jié)構(gòu)，判定是否為404頁面

這里有兩個(gè)方案

方案一：自建404頁面結(jié)構(gòu)庫，讓用戶調(diào)用時(shí)與我們的自建庫進(jìn)行比對(duì)

方案二：用戶自己傳入404頁面和一個(gè)正常頁面，我們利用頁面結(jié)構(gòu)相似度比較算法來確定是否為404頁面

先來看下頁面結(jié)構(gòu)的算法

```java

public class PageStructUtil {

? ? public static List<String> getAllLabelsFromHtml(String html) {

? ? ? ? Document document = Jsoup.parse(html);

? ? ? ? Elements elements = document.getAllElements();

? ? ? ? List<String> elementList = new ArrayList<>();

? ? ? ? for (Element element : elements) {

? ? ? ? ? ? elementList.add(element.nodeName());

? ? ? ? }

? ? ? ? return elementList;

? ? }

? ? /**

? ? ?* a 基于 base的相似性

? ? ?*

? ? ?* @param a

? ? ?* @param base

? ? ?* @return

? ? ?*/

? ? public static Double pageStructScore(List<String> a, List<String> base) {

? ? ? ? SequenceUtils<String> stringSequenceUtils = new SequenceUtils<>();

? ? ? ? Integer length = stringSequenceUtils.getLongestCommonSequence(a, base);

? ? ? ? int n = base.size();

? ? ? ? int m = a.size();

? ? ? ? // TODO: 2020/5/25 定義：頁面結(jié)構(gòu)的相似度為 (2.0*公共序列的長度）/(舊的公共序列的長度+新的公共序列的長度)

? ? ? ? Double score = (2 * length) / ((n + m) * 1.0);

? ? ? ? return score;

? ? }

? ? public static void main(String[] args) {

? ? }

}

```

頁面結(jié)構(gòu)算法我在專欄的另外一篇博客中有專門講解，請(qǐng)大家移步觀看

然后，我們來看看調(diào)用的方法

```java

/**

? ? ?* 自出404頁面的頁面結(jié)構(gòu)分析

? ? ?*

? ? ?* @param user404Html

? ? ?* @param html

? ? ?* @param score

? ? ?* @return

? ? ?*/

? ? public Boolean checkUserPage404Struct(String user404Html, String html, Double score) {

? ? ? ? try {

? ? ? ? ? ? return PageStructUtil.pageStructScore(PageStructUtil.getAllLabelsFromHtml(user404Html), PageStructUtil.getAllLabelsFromHtml(html)) >= score;

? ? ? ? } catch (Exception e) {

? ? ? ? }

? ? ? ? return false;

? ? }

```

## 優(yōu)化迭代

### 檢測(cè)配置化

由于上述寫法中我們都是將檢測(cè)的東西寫死在程序中，不具有可擴(kuò)展性，所以我們需要對(duì)此進(jìn)行擴(kuò)展，擴(kuò)展的思路便是對(duì)于所有判斷性條件都做成配置形式的，而不是走特判。

例如，我們可以對(duì)鏈接里面有404或503的編寫如下的配置

```yml

rule_name: '檢測(cè)鏈接的域名部分包含404或503'

url_patterns:

? ? - pattern: '\b(404|503)\b'

? ? ? description: '檢測(cè)鏈接的域名部分包含404或503'

? ? ? location: 'host'

rule_name: '檢測(cè)到路徑中包含404或503'

url_patterns:

? ? - pattern: '\/(?:.*\/)?(404|503)\/'

? ? ? description: '檢測(cè)到路徑中包含404或503'

? ? ? location: 'path'

rule_name: '檢測(cè)到查詢字符串中包含404或503'

url_patterns:

? ? - pattern: '[?&](?:[^&=]+=[^&=]+&)*(?:[^&=]+=(404|503))(?:&|$)'

? ? ? description: '檢測(cè)到查詢字符串中包含404或503'

? ? ? location: 'query'

```

給title添加如下檢測(cè)配置

```

404

503

頁面未找到

服務(wù)不可用

錯(cuò)誤

錯(cuò)誤頁面

訪問被拒絕

權(quán)限拒絕

頁面不存在

鏈接不存在

不存在

Not Found

```

http響應(yīng)碼的配置

```

404

503

401

400

500

```

頁面結(jié)構(gòu)的配置

```

rule_name: 'pan.baidu.com'

rules:

? ? - struct: '#document-html-head-body-meta-title-link-link-style-style-style-div-header-a-nav-a-span-div-a-em-div-div-a-span-span-a-span-span-a-span-span-a-span-span-a-span-span-a-span-span-a-span-span-a-span-span-iframe-div-ul-li-a-li-em-span-div-iframe-em-div-span-a-span-span-a-span-a-span-a-span-a-span-a-span-a-span-a-section-div-div-div-div-h3-div-p-p-p-div-p-a-p-a-style-div'

? ? ? score: '0.8'

? ? ? name: '404頁面'

? ? - struct: '#document-html-head-body-title-meta-meta-meta-meta-meta-meta-meta-meta-link-link-link-link-link-link-link-link-link-style-link-link-link-link-style-style-style-style-div-div-div-div-dl-dt-a-dd-span-a-span-span-a-span-span-a-span-span-a-span-dl-dt-dd-span-a-p-span-a-p-span-a-p-span-a-p-span-a-p-span-a-p-span-a-p-span-a-p-span-a-p-span-a-p-dd-i-i-dd-span-span-i-i-span-a-i-dl-dt-i-i-span-span-i-i-span-span-a-p-i-span-dd-div-a-div-a-a-a-a-a-ul-li-a-li-a-li-a-li-a-span-span-li-a-li-a-i-a-a-i-a-div-div-div-span-a-div-span-dd-a-div-section-section-div-aside-dl-div-dl-div-a-p-p-a-p-p-a-p-p-a-p-p-div-div-div-img-div-section-link-div-div-a-a-a-a-a-a-a-style-div'

? ? ? score: '0.8'

? ? ? name: '鏈接失效頁面'

```

### 接下來做的事兒

1、找到一堆404頁面，得到其結(jié)構(gòu)、title，響應(yīng)碼特征，豐富我們的上述配置

如何快速獲取大量的404頁面？可以找到一堆域名，然后添加一系列后綴（絕不可能存在的），得到的頁面大概率就是404頁面

2、對(duì)頁面內(nèi)容做智能解析提取，進(jìn)而得到內(nèi)容區(qū)的標(biāo)題和正文，再進(jìn)行進(jìn)一步的分析。

標(biāo)簽：

404頁面檢測(cè)的評(píng)論 (共條)

愛情散文傷感散文哲理散文優(yōu)美生活隨筆親情唯美句子傷感的句子現(xiàn)代詩歌空間日志經(jīng)典語句愛情句子作文大全

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

404頁面檢測(cè)

404頁面檢測(cè)的評(píng)論 (共條)

你可能也喜歡這些文章

最新發(fā)布的文章

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

404頁面檢測(cè)

本文作者的其他文章

404頁面檢測(cè)的評(píng)論 (共 條)

你可能也喜歡這些文章

最新發(fā)布的文章

404頁面檢測(cè)的評(píng)論 (共條)