Skip to main content

用 php 爬 sohu blog 的一个小爬虫

隐约感觉 sohu blog 也快不行了,所以看到一个不错的英语部分,就扒了下来。

爬虫本身很简单,关键点是搜狐博客标题列表是用 Ajax 加载的,所以找到列表就完成一半了。

其实很简单,用 HttpWatch 找到其链接就可以了,比如:

http://liuyongli99.blog.sohu.com/action/v_frag-ebi_93b2b93792-pg_112-c_2729466/entry/

剩下的就简单了,用的是 phpspider,代码如下:

$i = 1;
foreach($allEntryList as $entryList){
    echo "page: " . $i++ ."\r\n";
    echo "page URL: " . $entryList ."\r\n";
    $page = requests::get($entryList);
    $links = selector::select($page, "//div[@class='newBlog-list-title']//a/@href");
    while(!is_array($links)){
        sleep(2);
        requests::set_useragent(array(
            'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201'));
        $page = requests::get($entryList);
        $links = selector::select($page, "//div[@class='newBlog-list-title']//a/@href");
    }
    foreach($links as $link){
        $entry = requests::get($link);
        $title = selector::select($entry, "//div[@class='newBlog-title']//h2//span");
        $content = selector::select($entry, "//div[@class='item-content']");
        $content = selector::remove($content, "//div[@class='newBlog-bom']");
        $title1 = "";
        if(is_array($title)){
            $title1 = $title[0];
        }else{
            $title1 = $title;
        }
        echo $title1 ."\r\n";

        file_put_contents("s.txt", $title1 .PHP_EOL, FILE_APPEND);
        file_put_contents("s.txt", $link .PHP_EOL, FILE_APPEND);
        file_put_contents("s.txt", strip_tags($content) .PHP_EOL, FILE_APPEND);
        file_put_contents("s.txt", '------------------------------------------' .PHP_EOL, FILE_APPEND);

    }
    sleep(2);
}