前言
最近需要在一个相同的网站上执行很多重复性操作,于是乎产生用脚本来代替浏览器操作的念头,而curl工具可以模拟浏览器发出的post,get,提交表单,上传文件,下载文件等等功能,刚好可以满足我的需求。
常用命令如下所示
1 2 3 4 5 6 7
| 正常访问: curl localhost/index.php get带参数访问: curl localhost/index.php?name=hucd\&password=hucd post带参数访问: curl -d "name=hucd&password=hucd" localhost/index.php 上传文件: curl -F file=@./test.jpeg localhost/index.php 获取cookie: curl -c cookie.txt localhost/index.php 带cookie访问: curl -b cookie.txt localhost/index.php 自动跳转: curl -L -w '%{url_effective}\n' localhost/index.php 模拟不同浏览器: curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" -o out.txt localhost/index.php
|
以下记录我爬取一教务网站数据过程
相关工具包安装
环境为ubuntu 14.04
1 2
| apt-get install php5-cli apt-get install curl libcurl3 libcurl3-dev php5-curl
|
模拟post提交表单
1 2 3
| loginurl='http://kfsj.bjedu.cn/Public/login'; curlPost='username=huangcaodian_zkyrj&password=12345678'; curl -o out.txt -d $curlPost $loginurl;
|
自动循环跳转获取cookie
1 2 3 4 5 6
| nextUrl=`sed -n 's/\(.*\)science":"\(.*\)"}/\2/p' out.txt`; nextUrl=`echo $nextUrl | sed -n 's/\\\//gp'`; echo $nextUrl; nextUrl=`echo $nextUrl | sed -n 's/&/\\\&/gp'`; echo 'curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" -o out.txt -c cookie.txt -L -w '"'"'%{url_effective}\n'"'"" $nextUrl">getcooke.txt; tmp=`source getcooke.txt &`;
|
php解析html工具包simple_html_dom
simple_html_dom下载地址
解析获取的html数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| <?php include('./simple_html_dom.php'); $html = new simple_html_dom(); $html->load_file('../out.txt'); $ret = $html->find('form[id=form1]'); $postUrl=$ret[0]->attr["action"];
$arr=$ret[0]->children; $name=$arr[0]->attr["name"];$value=$arr[0]->attr["value"]; $parm=$name."=".$value; for($i=1;$i<count($arr);$i++){ $name=$arr[$i]->attr["name"];$value=$arr[$i]->attr["value"]; $parm=$parm."&".$name."=".$value; } echo "curl -b cookie.txt -c cookie1.txt -A \"Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)\" -o out.txt -c cookie1.txt -d "."'".$parm."' -L ".$postUrl; ?>
|
模拟js提交登录从而获取cookie
1 2 3 4 5
| cd simple_html_dom-master; `php phphtml.php>../getcooke1.txt`; cd ..; source getcooke1.txt cat cookie1.txt;
|
获取cookie后 可以随意爬去网页内容
1
| curl -b cookie1.txt -o out.txt http://211.153.78.168/admin/jgteacher/uklist;
|