数据采集和清洗(二)

实践6 – Scrapy安装和配置

在Linux中安装Scrapy

  1. 激活Python虚拟环境
  2. 安装Twisted
1
2
3
4
wget https://twistedmatrix.com/Releases/Twisted/17.1/Twisted-17.1.0.tar.bz2
tar -jxvf Twisted-17.1.0.tar.bz2
cd Twisted-17.1.0
python3 setup.py install
  1. 安装scrapy

pip install -i https://pypi.doubanio.com/simple/ scrapy3

创建Scrapy项目

  1. 创建Scrapy项目

scrapy startproject douban

  1. 增加第一个爬虫
1
2
cd douban
scrapy genspider douban_spider movie.douban.com #域名

实践7 – 使用Scrapy爬取豆瓣电影短评

  • scrapy爬取4步骤:
    1、新建项目
    2、明确目标
    3、制作爬虫
    4、存储内容
  1. 修改setting设置
1
2
3
4
5
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
  1. 明确目标
1
2
3
4
5
6
7
8
9
10
11
# items.py
import scrapy


class DoubanItem(scrapy.Item):
serial_number = scrapy.Field()
movie_name = scrapy.Field()
introduce = scrapy.Field()
stars = scrapy.Field()
evaluate = scrapy.Field()
describe = scrapy.Field()
  1. 制作爬虫
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# douban_spider.py
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
name = 'douban_spider'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']

def parse(self, response):
# 处理当前页
movie_list = response.xpath("//ol[@class='grid_view']/li")
for item in movie_list:
douban_item = DoubanItem()
douban_item["serial_number"] = item.xpath(".//div[@class='pic']/em/text()").extract_first()
douban_item["movie_name"] = item.xpath(".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
content = item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract()
content_s = ""
for i_content in content:
line = "".join(i_content.split())
content_s = content_s + line + "/"
douban_item["introduce"] = content_s
douban_item["stars"] = item.xpath(".//div[@class='info']//div[@class='star']/span[@class='rating_num']/text()").extract_first()
douban_item["evaluate"] = item.xpath(".//div[@class='info']//div[@class='star']/span[4]/text()").extract_first()
douban_item["describe"] = item.xpath(".//div[@class='info']/div[@class='bd']//span[@class='inq']/text()").extract_first()

yield douban_item

# 找到下一页
next_link = response.xpath("//span[@class='next']/link/@href").extract()
if next_link:
next_link = next_link[0]
yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)

为了让执行爬虫更方便,可以添加一个脚本完成以下命令

1
scrapy crawl douban_spider

新建 main.py 文件

1
2
3
4
5
# main.py
from scrapy import cmdline


cmdline.execute('scrapy crawl douban_spider'.split())
  1. 存储内容

在setting.py中添加数据库配置信息

1
2
3
4
mongo_host = '127.0.0.1'
mongo_port = 27017
mongo_db_name = 'douban'
mongo_db_collection = 'douban_movie'

在pipelines.py中添加自己的pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pymongo
from douban.settings import mongo_host,mongo_port,mongo_db_name,mongo_db_collection


class DoubanPipeline(object):
def __init__(self):
host = mongo_host
port = mongo_port
dbname = mongo_db_name
sheetname = mongo_db_collection
client = pymongo.MongoClient(host=host, port=port)
mydb = client[dbname]
self.post = mydb[sheetname]

def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item

在setting.py中开启自己定义的pipeline

1
2
3
4
# Configure item pipelines
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}

Powered by Hexo and Hexo-theme-hiker

Copyright © 2013 - 2021 Inner peace All Rights Reserved.

UV : | PV :