实践1 – 搭建Python开发环境
在Linux中安装配置Python3(多版本)
- 在centos下编译安装python3前,建议先将yum源更改为国内的yum源
- 安装编译所需的工具
yum install -y gcc zlib zlib-devel libffi-devel openssl openssl-devel
- 下载并编译安装Python3
1
2
3
4
5wget https://www.python.org/ftp/python/3.6.2/Python-3.6.2.tgz
tar -xvf Python-3.6.2.tgz
cd Python-3.6.2
./configure --with-ssl
make && make install - 检查
1
2python3
pip3使用Virtualenv创建独立的Python运行环境
- pip安装建议使用国内的pypi源,否则经常会因为网络超时导致模块安装失败
1.安装virtualenvpip3 install -i https://pypi.doubanio.com/simple/ virtualenv
2. 创建python虚拟环境
1 | mkdir myspace #工作目录 |
- 激活python虚机环境
. venv/bin/activate
- 退出python虚机环境
deactivate
实践2 – urllib和Requests的使用
urllib使用
- 发起GET请求
1
2
3
4
5
6# http_get.py
from urllib import request
response = request.urlopen("http://www.baidu.com/")
print(response.read()) - 发起带参数的请求
1
2
3
4
5
6
7
8
9
10
11
12
13# http_params.py
from urllib import request, parse
url = 'http://www.baidu.com/s?'
params = {'word': 'Python爬虫',
'tn': 'baiduhome_pg',
'ie': 'utf-8'}
url = url + parse.urlencode(params)
# print(url)
with request.urlopen(url) as response:
with open("response.html", "wb") as file:
file.write(response.read()) - 发起POST请求
1
2
3
4
5
6
7
8# http_post.py
from urllib import request,parse
data = parse.urlencode({'terms': 'here is test'}).encode()
req = request.Request('http://httpbin.org/post?q=Python', data=data)
with request.urlopen(req) as response:
print(response.read())requests使用
- 使用前需安装Requests模块
pip install -i https://pypi.doubanio.com/simple/ requests
- 文档 https://2.python-requests.org//zh_CN/latest/index.html
- 发起http请求
1
2
3
4
5
6
7
8
9
10
11
12# req_http.py
import requests
# 发起get请求,并传递参数
r1 = requests.get('https://httpbin.org/get', params={'terms': 'here is test'})
print(r1.url)
print(r1.status_code)
print(r1.content)
# 发起post请求
r2 = requests.post('https://httpbin.org/post', data={'terms': 'here is test'})
print(r2.content) - 常见设置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32# req_header.py
import requests
# 自定义请求头
headers = {'user-agent': 'Mozilla/5.0'}
r1 = requests.get('http://httpbin.org/headers', headers=headers)
print(r1.text)
# 指定Cookie
cookies = {'from-my': 'browser'}
r2 = requests.get('http://httpbin.org/cookies', cookies=cookies)
print(r2.text)
# 设置超时
r3 = requests.get('https://www.baidu.com', timeout=5)
# 设置代理 西刺https://www.xicidaili.com/
proxy = {
'http': 'http://112.85.170.175:9999',
'https': 'https://118.190.73.168:808',
}
r4 = requests.get('http://www.kuaidaili.com/free/', proxies=proxy, timeout=2)
print(r4.content)
# Session
s = requests.Session()
s.cookies = requests.utils.cookiejar_from_dict({"a": "c"})
r5 = s.get('http://httpbin.org/cookies')
print(r5.text)
r5 = s.get('http://httpbin.org/cookies')
print(r5.text)实践3 – Beautiful Soup的使用
- 使用前需安装beautiful soup4模块
pip install -i https://pypi.doubanio.com/simple/ beautifulsoup4
- 文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
使用bs4解析网页
- 基于标签Tag查找
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# douban_top250.py
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
rsq = requests.get('https://movie.douban.com/top250', headers=headers)
html = rsq.text
soup = BeautifulSoup(html, 'html.parser')
# 基于标签查找
divs = soup.find_all('div', class_='hd')
for div in divs:
print(div.a.span.string)
next_link = soup.find('span', class_='next')
if next_link is not None:
print(next_link.a['href'])
- 基于CSS selector查找
1 | # douban_top250.py |
实践4 – MongoDB基础
MongoDB数据库基本操作
- 启动/连接数据库
1 | cd /path/for/mongodb/bin |
- 创建数据库
1 | use spider |
- 创建集合
1 | db.createCollection('douban') |
- 创建文档
1 | db.douban.insert({'title': '豆瓣'}) # 如果集合douban不存在也会自动创建 |
pymongo操作数据库
- 使用前需安装pymongo模块
pip install -i https://pypi.doubanio.com/simple/ pymongo
- pymongo操作数据库
1 | import pymongo |
实践5 – 爬取动态网页
爬取豆瓣电影Top250
- 使用前需安装selenium模块
pip install -i https://pypi.doubanio.com/simple/ selenium
- 文档: https://selenium-python-zh.readthedocs.io/en/latest/
- 下载Chrome WebDriver:http://chromedriver.storage.googleapis.com/index.html
- 模拟登录
1 | # login.py |
- 加载js
1 | # pulldown.py |