c# 正則表達(dá)式對網(wǎng)頁進(jìn)行內(nèi)容抓取

字號:


    搜索引擎中一個(gè)比較重要的環(huán)節(jié)就是從網(wǎng)頁中抽取出有效內(nèi)容。簡單來說,就是吧HTML文本中的HTML標(biāo)記去掉,留下我們用IE等瀏覽器打開HTML文檔看到的部分(我們這里不考慮圖片).
    將HTML文本中的標(biāo)記分為:注釋,script ,style,以及其他標(biāo)記分別去掉: 
    1.去注釋,正則為: 
    output = Regex.Replace(input, @"<!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase); 
    2.去script,正則為: 
    ouput = Regex.Replace(input, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
    output2 = Regex.Replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
    3.去style,正則為: 
    output = Regex.Replace(input, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
    4.去其他HTML標(biāo)記 
    result = result.Replace(" ", " "); 
    result = result.Replace(""", "\""); 
    result = result.Replace("<", "<"); 
    result = result.Replace(">", ">"); 
    result = result.Replace("&", "&"); 
    result = result.Replace("<br>", "\r\n"); 
    result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase); 
    以上的代碼中大家可以看到,我使用了RegexOptions.Singleline參數(shù),這個(gè)參數(shù)很重要,他主要是為了讓"."(小圓點(diǎn))可以匹配換行符.如果沒有這個(gè)參數(shù),大多數(shù)情況下,用上面列正則表達(dá)式來消除網(wǎng)頁HTML標(biāo)記是無效的. 
    HTML發(fā)展至今,語法已經(jīng)相當(dāng)復(fù)雜,上面只列出了幾種最主要的標(biāo)記,更多的去HTML標(biāo)記的正則我將在 
    Rost WebSpider 的開發(fā)過程中補(bǔ)充進(jìn)來。 
    下面用c#實(shí)現(xiàn)了一個(gè)從HTML字符串中提取有效內(nèi)容的類: 
    using System; 
    using System.Collections.Generic; 
    using System.Text; 
    using System.Text.RegularExpressions; 
    class HtmlExtract 
    { 
    #region private attributes 
    private string _strHtml; 
    #endregion 
    #region public mehtods 
    public HtmlExtract(string inStrHtml) 
    { 
    _strHtml = inStrHtml 
    } 
    public override string ExtractText() 
    { 
    string result = _strHtml; 
    result = RemoveComment(result); 
    result = RemoveScript(result); 
    result = RemoveStyle(result); 
    result = RemoveTags(result); 
    return result.Trim(); 
    } 
    #endregion 
    #region private methods 
    private string RemoveComment(string input) 
    { 
    string result = input; 
    //remove comment 
    result = Regex.Replace(result, @"<!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase); 
    return result; 
    } 
    private string RemoveStyle(string input) 
    { 
    string result = input; 
    //remove all styles 
    result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
    return result; 
    } 
    private string RemoveScript(string input) 
    { 
    string result = input; 
    result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
    result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
    return result; 
    } 
    private string RemoveTags(string input) 
    { 
    string result = input; 
    result = result.Replace(" ", " "); 
    result = result.Replace(""", "\""); 
    result = result.Replace("<", "<"); 
    result = result.Replace(">", ">"); 
    result = result.Replace("&", "&"); 
    result = result.Replace("<br>", "\r\n"); 
    result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase); 
    return result; 
    } 
    #endregion