c# 正則表達(dá)式對網(wǎng)頁進(jìn)行內(nèi)容抓取

字號：小 中 大

搜索引擎中一個(gè)比較重要的環(huán)節(jié)就是從網(wǎng)頁中抽取出有效內(nèi)容。簡單來說，就是吧HTML文本中的HTML標(biāo)記去掉,留下我們用IE等瀏覽器打開HTML文檔看到的部分（我們這里不考慮圖片）.
    將HTML文本中的標(biāo)記分為:注釋,script ,style，以及其他標(biāo)記分別去掉：
    1.去注釋,正則為:
    output = Regex.Replace(input, @"", string.Empty, RegexOptions.IgnoreCase);
    2.去script,正則為:
    ouput = Regex.Replace(input, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    output2 = Regex.Replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    3.去style,正則為:
    output = Regex.Replace(input, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    4.去其他HTML標(biāo)記
    result = result.Replace(" ", " ");
    result = result.Replace(""", "\"");
    result = result.Replace("<", "<");
    result = result.Replace(">", ">");
    result = result.Replace("&", "&");
    result = result.Replace("<br>", "\r\n");
    result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
    以上的代碼中大家可以看到,我使用了RegexOptions.Singleline參數(shù)，這個(gè)參數(shù)很重要，他主要是為了讓"."(小圓點(diǎn))可以匹配換行符.如果沒有這個(gè)參數(shù)，大多數(shù)情況下，用上面列正則表達(dá)式來消除網(wǎng)頁HTML標(biāo)記是無效的.
    HTML發(fā)展至今，語法已經(jīng)相當(dāng)復(fù)雜,上面只列出了幾種最主要的標(biāo)記,更多的去HTML標(biāo)記的正則我將在
    Rost WebSpider 的開發(fā)過程中補(bǔ)充進(jìn)來。
    下面用c#實(shí)現(xiàn)了一個(gè)從HTML字符串中提取有效內(nèi)容的類:
    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Text.RegularExpressions;
    class HtmlExtract
    {
    #region private attributes
    private string _strHtml;
    #endregion
    #region public mehtods
    public HtmlExtract(string inStrHtml)
    {
    _strHtml = inStrHtml
    }
    public override string ExtractText()
    {
    string result = _strHtml;
    result = RemoveComment(result);
    result = RemoveScript(result);
    result = RemoveStyle(result);
    result = RemoveTags(result);
    return result.Trim();
    }
    #endregion
    #region private methods
    private string RemoveComment(string input)
    {
    string result = input;
    //remove comment
    result = Regex.Replace(result, @"", string.Empty, RegexOptions.IgnoreCase);
    return result;
    }
    private string RemoveStyle(string input)
    {
    string result = input;
    //remove all styles
    result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    return result;
    }
    private string RemoveScript(string input)
    {
    string result = input;
    result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    return result;
    }
    private string RemoveTags(string input)
    {
    string result = input;
    result = result.Replace(" ", " ");
    result = result.Replace(""", "\"");
    result = result.Replace("<", "<");
    result = result.Replace(">", ">");
    result = result.Replace("&", "&");
    result = result.Replace("<br>", "\r\n");
    result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
    return result;
    }
    #endregion

c# 正則表達(dá)式對網(wǎng)頁進(jìn)行內(nèi)容抓取

字號： 小 中 大

字號：小中大