实践6 – Scrapy安装和配置
在Linux中安装Scrapy
激活Python虚拟环境
安装Twisted
1 2 3 4 wget https://twistedmatrix.com/Releases/Twisted/17.1/Twisted-17.1.0.tar.bz2 tar -jxvf Twisted-17.1.0.tar.bz2 cd Twisted-17.1.0 python3 setup.py install
安装scrapy
pip install -i https://pypi.doubanio.com/simple/ scrapy3
创建Scrapy项目
创建Scrapy项目
scrapy startproject douban
增加第一个爬虫
1 2 cd douban scrapy genspider douban_spider movie.douban.com #域名
实践7 – 使用Scrapy爬取豆瓣电影短评
scrapy爬取4步骤: 1、新建项目 2、明确目标 3、制作爬虫 4、存储内容
修改setting设置
1 2 3 4 5 # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False
明确目标
1 2 3 4 5 6 7 8 9 10 11 # items.py import scrapy class DoubanItem(scrapy.Item): serial_number = scrapy.Field() movie_name = scrapy.Field() introduce = scrapy.Field() stars = scrapy.Field() evaluate = scrapy.Field() describe = scrapy.Field()
制作爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # douban_spider.py import scrapy from douban.items import DoubanItem class DoubanSpiderSpider(scrapy.Spider): name = 'douban_spider' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): # 处理当前页 movie_list = response.xpath("//ol[@class='grid_view']/li") for item in movie_list: douban_item = DoubanItem() douban_item["serial_number"] = item.xpath(".//div[@class='pic']/em/text()").extract_first() douban_item["movie_name"] = item.xpath(".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first() content = item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract() content_s = "" for i_content in content: line = "".join(i_content.split()) content_s = content_s + line + "/" douban_item["introduce"] = content_s douban_item["stars"] = item.xpath(".//div[@class='info']//div[@class='star']/span[@class='rating_num']/text()").extract_first() douban_item["evaluate"] = item.xpath(".//div[@class='info']//div[@class='star']/span[4]/text()").extract_first() douban_item["describe"] = item.xpath(".//div[@class='info']/div[@class='bd']//span[@class='inq']/text()").extract_first() yield douban_item # 找到下一页 next_link = response.xpath("//span[@class='next']/link/@href").extract() if next_link: next_link = next_link[0] yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)
为了让执行爬虫更方便,可以添加一个脚本完成以下命令
1 scrapy crawl douban_spider
新建 main.py 文件
1 2 3 4 5 # main.py from scrapy import cmdline cmdline.execute('scrapy crawl douban_spider'.split())
存储内容
在setting.py中添加数据库配置信息
1 2 3 4 mongo_host = '127.0.0.1' mongo_port = 27017 mongo_db_name = 'douban' mongo_db_collection = 'douban_movie'
在pipelines.py中添加自己的pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import pymongo from douban.settings import mongo_host,mongo_port,mongo_db_name,mongo_db_collection class DoubanPipeline(object): def __init__(self): host = mongo_host port = mongo_port dbname = mongo_db_name sheetname = mongo_db_collection client = pymongo.MongoClient(host=host, port=port) mydb = client[dbname] self.post = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.post.insert(data) return item
在setting.py中开启自己定义的pipeline
1 2 3 4 # Configure item pipelines ITEM_PIPELINES = { 'douban.pipelines.DoubanPipeline': 300, }