curl 爬虫

前言
最近需要在一个相同的网站上执行很多重复性操作,于是乎产生用脚本来代替浏览器操作的念头,而curl工具可以模拟浏览器发出的post,get,提交表单,上传文件,下载文件等等功能,刚好可以满足我的需求。

常用命令如下所示

1
2
3
4
5
6
7
正常访问:        curl localhost/index.php           get带参数访问:   curl localhost/index.php?name=hucd\&password=hucd      
post带参数访问: curl -d "name=hucd&password=hucd" localhost/index.php
上传文件: curl -F file=@./test.jpeg localhost/index.php
获取cookie: curl -c cookie.txt localhost/index.php
带cookie访问: curl -b cookie.txt localhost/index.php
自动跳转: curl -L -w '%{url_effective}\n' localhost/index.php
模拟不同浏览器: curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" -o out.txt localhost/index.php

以下记录我爬取一教务网站数据过程

相关工具包安装

环境为ubuntu 14.04

1
2
apt-get install php5-cli
apt-get install curl libcurl3 libcurl3-dev php5-curl

模拟post提交表单

1
2
3
loginurl='http://kfsj.bjedu.cn/Public/login';
curlPost='username=huangcaodian_zkyrj&password=12345678';
curl -o out.txt -d $curlPost $loginurl;

自动循环跳转获取cookie

1
2
3
4
5
6
nextUrl=`sed -n 's/\(.*\)science":"\(.*\)"}/\2/p' out.txt`;
nextUrl=`echo $nextUrl | sed -n 's/\\\//gp'`;
echo $nextUrl;
nextUrl=`echo $nextUrl | sed -n 's/&/\\\&/gp'`;
echo 'curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" -o out.txt -c cookie.txt -L -w '"'"'%{url_effective}\n'"'"" $nextUrl">getcooke.txt;
tmp=`source getcooke.txt &`;

php解析html工具包simple_html_dom

simple_html_dom下载地址
解析获取的html数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<?php
include('./simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('../out.txt');
$ret = $html->find('form[id=form1]');
$postUrl=$ret[0]->attr["action"];

$arr=$ret[0]->children;
$name=$arr[0]->attr["name"];$value=$arr[0]->attr["value"];
$parm=$name."=".$value;
for($i=1;$i<count($arr);$i++){
$name=$arr[$i]->attr["name"];$value=$arr[$i]->attr["value"];
$parm=$parm."&".$name."=".$value;
}
echo "curl -b cookie.txt -c cookie1.txt -A \"Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)\" -o out.txt -c cookie1.txt -d "."'".$parm."' -L ".$postUrl;
?>

模拟js提交登录从而获取cookie

1
2
3
4
5
cd simple_html_dom-master;
`php phphtml.php>../getcooke1.txt`;
cd ..;
source getcooke1.txt
cat cookie1.txt;

获取cookie后 可以随意爬去网页内容

1
curl -b cookie1.txt -o out.txt http://211.153.78.168/admin/jgteacher/uklist;