简单使用
1、项目概览
1
| scrapy startproject tutorial //可以创建一个项目
|
项目目录如下:

- spiders文件夹是自己爬虫任务的文件夹,其它是项目配置文件
2、创建爬虫任务
1
| scrapy gensipder [任务名] [默认网址]
|
① 在项目item.py
文件中修改要爬取的信息
1 2 3 4 5 6 7 8 9 10
| import scrapy
class WeiboTopItem(scrapy.Item): ranking = scrapy.Field() content = scrapy.Field() count = scrapy.Field() desc = scrapy.Field()
|
② 在spider中的任务中修改爬取目标如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| import scrapy from weibo.items import WeiboTopItem
class HotSpider(scrapy.Spider): name = 'hot' allowed_domains = ['https://s.weibo.com/top/summary/'] start_urls = ['https://s.weibo.com/top/summary/']
def parse(self, response): try: hots = response.xpath("//tr") for hot in hots: item = WeiboTopItem() item['ranking'] = hot.xpath('td[@class="td-01 ranktop"]/text()').extract_first() item['content'] = hot.xpath('td[@class="td-02"]/a/text()').extract_first() item['count'] = hot.xpath('td[@class="td-02"]/span/text()').extract_first() item['desc'] = hot.xpath('td[@class="td-03"]/i/text()').extract_first() yield item
except: print("出错了") return
|
③ 运行运行命令
1 2
| scrapy crawl [任务名] scrapy crawl [任务名] -O res.json //保存结果为json文件,-O是覆盖,-o是文件追加
|
④ 注意
注意如果结果有中文,要在项目的setting文件中添加FEED_EXPORT_ENCODING = 'UTF8'
,这样才能正确保存
3、批量爬取(链接追踪)
① 默认爬取链接设置
在def start_requests(self):
中设置默认的爬取url列表
如下:
1 2 3 4 5
| def start_requests(self): tweet_ids = ['IDl56i8av', 'IDkNerVCG', 'IDkJ83QaY'] urls = [f"{self.base_url}/comment/hot/{tweet_id}?rl=1&page=1" for tweet_id in tweet_ids] for url in urls: yield Request(url, callback=self.parse)
|
② 爬取过程中链接加入
在def parse(self,response):
中根据爬取的内容,再讲url进行返回,注意这里每调用一次scrapy.Request(url,callback=self.parse)
都会调用一次parse,所以这里添加的url是接着start_requests()中的第一个url添加的
例如:
1 2 3 4 5 6 7 8 9 10
| def parse(self, response): if response.url.endswith('page=1'): all_page = re.search(r'/> 1/(\d+)页</div>', response.text) if all_page: all_page = all_page.group(1) all_page = int(all_page) all_page = all_page if all_page <= 50 else 50 for page_num in range(2, all_page + 1): page_url = response.url.replace('page=1', 'page={}'.format(page_num)) yield Request(page_url, self.parse, dont_filter=True, meta=response.meta)
|
③ 链接拼接
如果下一个url和默认地址start_url
有参数的关系,可以直接使用如下进行url拼接
1 2
| url = response.urljoin(next) yield scrapy.Request(url = url,callback = self.parse,dont_filter=True)
|