0%

Scrapy使用

简单使用

1、项目概览

1
scrapy startproject tutorial //可以创建一个项目

项目目录如下:

  • spiders文件夹是自己爬虫任务的文件夹,其它是项目配置文件

2、创建爬虫任务

1
scrapy gensipder [任务名] [默认网址]

① 在项目item.py文件中修改要爬取的信息

1
2
3
4
5
6
7
8
9
10

import scrapy

class WeiboTopItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
ranking = scrapy.Field()
content = scrapy.Field()
count = scrapy.Field()
desc = scrapy.Field()

② 在spider中的任务中修改爬取目标如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy
from weibo.items import WeiboTopItem


class HotSpider(scrapy.Spider):
name = 'hot'
allowed_domains = ['https://s.weibo.com/top/summary/']
start_urls = ['https://s.weibo.com/top/summary/']

def parse(self, response):
try:
hots = response.xpath("//tr")
for hot in hots:
item = WeiboTopItem()
item['ranking'] = hot.xpath('td[@class="td-01 ranktop"]/text()').extract_first()
item['content'] = hot.xpath('td[@class="td-02"]/a/text()').extract_first()
item['count'] = hot.xpath('td[@class="td-02"]/span/text()').extract_first()
item['desc'] = hot.xpath('td[@class="td-03"]/i/text()').extract_first()
yield item

except:
print("出错了")
return

③ 运行运行命令

1
2
scrapy crawl [任务名] 
scrapy crawl [任务名] -O res.json //保存结果为json文件,-O是覆盖,-o是文件追加

④ 注意

注意如果结果有中文,要在项目的setting文件中添加FEED_EXPORT_ENCODING = 'UTF8',这样才能正确保存

3、批量爬取(链接追踪)

① 默认爬取链接设置
def start_requests(self):中设置默认的爬取url列表
如下:

1
2
3
4
5
def start_requests(self):
tweet_ids = ['IDl56i8av', 'IDkNerVCG', 'IDkJ83QaY']
urls = [f"{self.base_url}/comment/hot/{tweet_id}?rl=1&page=1" for tweet_id in tweet_ids]
for url in urls:
yield Request(url, callback=self.parse) # 将下一步要爬取页面的url进行加入

② 爬取过程中链接加入

def parse(self,response):中根据爬取的内容,再讲url进行返回,注意这里每调用一次scrapy.Request(url,callback=self.parse)都会调用一次parse,所以这里添加的url是接着start_requests()中的第一个url添加的

例如:

1
2
3
4
5
6
7
8
9
10
def parse(self, response):
if response.url.endswith('page=1'):
all_page = re.search(r'/>&nbsp;1/(\d+)页</div>', response.text)
if all_page:
all_page = all_page.group(1)
all_page = int(all_page)
all_page = all_page if all_page <= 50 else 50
for page_num in range(2, all_page + 1):
page_url = response.url.replace('page=1', 'page={}'.format(page_num))
yield Request(page_url, self.parse, dont_filter=True, meta=response.meta)

③ 链接拼接

如果下一个url和默认地址start_url有参数的关系,可以直接使用如下进行url拼接

1
2
url = response.urljoin(next)
yield scrapy.Request(url = url,callback = self.parse,dont_filter=True)