from
http://menno.b10m.net/blog/blosxom/perl
该文章是用来解析取得到的html的资料,有用到xpath的概念
Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko
Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast.
Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post
this blog entry in which I'll show how to effectively scrape Yahoo! Search.
First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to
fetch the following things:
* title (the linked text)
* url (the actual link)
* description (the text beneath the link)
So let's start our first little script:
[code]
use Data::Dumper;#该模块用来输出相关的结构
use URI;
use Web::Scraper;
my $yahoo = scraper {
process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
process "div.yschabstr", 'description' => "TEXT";
result 'description', 'title', 'url';
};
print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl"));
[/code]
Now what happens here? The important stuff can be found in the process statements. Basically, you may
translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title',
and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in
description.
The result looks something like this:
$VAR1 = {
'url' => 'http://www.perl.com/',
'title' => 'Perl.com',
'description' => 'Central resource for Perl developers. It contains
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited
by Clay Irving.'
};
Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a
loop!
The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this:
process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href';
process "div.yschabstr", 'description[]' => "TEXT";
And when we run it now, the result looks like this:
$VAR1 = {
'url' => [
'http://www.perl.com/',
'http://www.perl.org/',
'http://www.perl.com/download.csp',
...
],
'title' => [
'Perl.com',
'Perl Mongers',
'Getting Perl',
...
],
'description' => [
'Central resource for Perl developers. It contains
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by
Clay Irving.',
'Nonprofit organization, established to support the
Perl community.',
'Instructions on downloading a Perl interpreter for
your computer platform. ... On CPAN, you will find Perl source in the /src
directory. ...',
...
]
};
That looks a lot better! We now get all the search results and could loop through the different arrays to get
the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we
want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we
grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each
list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go
for the XPath selectors (heck, we can do both, so why not?).
To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can
grab the path within seconds.
use Data::Dumper;
use URI;
use Web::Scraper;
my $yahoo = scraper {
process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper {
process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
process "div.yschabstr", 'description' => "TEXT";
result 'description', 'title', 'url';
};
result 'results';
};
print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") );
You see that we switched our title, url and description fields back to the old notation (without []), for we
don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we
open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]').
The result is exactly what we wanted:
$VAR1 = [
{
'url' => 'http://www.perl.com/',
'title' => 'Perl.com',
'description' => 'Central resource for Perl developers. It
contains the Perl Language, edited by Tom Christiansen, and the Perl Reference,
edited by Clay Irving.'
},
{
'url' => 'http://www.perl.org/',
'title' => 'Perl Mongers',
'description' => 'Nonprofit organization, established to support
the Perl community.'
},
{
'url' => 'http://www.perl.com/download.csp',
'title' => 'Getting Perl',
'description' => 'Instructions on downloading a Perl interpreter
for your computer platform. ... On CPAN, you will find Perl source in the /src
directory. ...'
},
...
];
Again Tatsuhiko impresses me with a Perl module. Well done! Very well done!
Update: Tatsuhiko had some wise words on this article:
A couple of things:
You might just skip result() stuff if you're returning the entire hash, which is the default. (The API is
stolen from Ruby's one that needs result() for some reason, but my perl port doesn't require) Now with less
code :)
The use of nested scraper in your example seems pretty good, but using hash reference could be also useful,
like:
my $yahoo = scraper {
process "a.yschttl", 'results[]', {
title => 'TEXT', url => '@href',
};
};
This way you'll get title and url from TEXT and @href from a.yschttl, which would be handier if you don't
need the description. TIMTOWTDI :)
分享到:
相关推荐
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as ...
利用Python实现网络爬虫 Hands-On-Web-Scraping-with-Python-master.zip
Untangle your web scraping complexities and access web data with ease using Python scripts Key FeaturesHands-on recipes to advance your web scraping skills to expert levelAddress complex and ...
web scraping with python 出版日期:2015-10-23 电子书下载格式:pdf 电子书大小:8.68M
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as ...
Web Scraping with Python,英文版的。
Web Scraping with Python Web Scraping with Python Web Scraping with Python
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as ...
Web Scraping with Python Collecting More Data from the Modern Web(2nd) 英文epub 第2版 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
Website Scraping with Python: Using BeautifulSoup and Scrapy By 作者: Gábor László Hajba ISBN-10 书号: 1484239245 ISBN-13 书号: 9781484239247 Edition 版本: 1st ed. 出版日期: 2018-09-15 pages 页数: ...
Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server’s response, and interacting with sites in an automated fashion...
Learn Web Scraping With Python In A Day: The Ultimate Crash Course to Learning the Basics of Web Scraping With Python In No Time
变成,可以用在建筑网站使用。需要打开变成IDLE才能用。还需要改变里面的网站地址和名称。
Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS
Practical Web Scraping for Data Science Best Practices and Examples with Python 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书