scrapy举例
0,安装scrapy
详细的请查阅这里
1,创建项目
sudo scrapy startproject dmoz
2,编写蜘蛛
到dmoz根目录下面的dmoz/spiders目录下面,创建dmoz_spider.py
内容如下:
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz.org" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/ ').extract() item['desc'] = site.select('text()').extract() items.append(item) return items
3,执行蜘蛛
到项目的顶级目录执行
scrapy crawl dmoz.org --set FEED_URI=items.json --set FEED_FORMAT=json
4,查看结果
会在你当前执行的目录下面看到一个items文件
cat items.json