作为春节档电影之一,《你好,李焕英》这部电影最近真的是大热,票房直冲50亿。
这次就尝试爬取春节档七部电影数据进行数据分析。
春节档电影为:
- 你好,李焕英
- 唐人街探案3
- 刺杀小说家
- 人潮汹涌
- 新神榜:哪吒重生
- 侍神令
- 熊出没·狂野大陆
想获取的主要数据有:
- 电影名称
- 电影信息(导演、演员等)
- 豆瓣评分
- 豆瓣评分人数
- 豆瓣评论
- 电影票房
数据来源为豆瓣,因豆瓣无票房数据,票房数据爬取自猫眼。
该项目Github地址:movie_spider
豆瓣电影信息、短评爬取
豆瓣电影信息页及短评页数据爬取比较简单,不存在动态加载的内容,可直接全球爬取。需要注意的是,豆瓣短评的前几页不需要登录就可以获取数据,但是后面的内容可能需要登录才能查看,因此需要添加自己账号的cookie。
豆瓣电影信息爬取
豆瓣电影信息页的url为:https://movie.douban.com/subject/{电影id}/?from=showing
只要修改电影id就可爬取,各电影的id为:
- 你好,李焕英:34841067
- 唐人街探案3:27619748
- 刺杀小说家:26826330
- 人潮汹涌:34880302
- 新神榜:哪吒重生:34779692
- 侍神令:26935283
- 熊出没·狂野大陆:34825886
爬取以上信息
- movie_id:电影id(存于url中)
- movie_name:电影名称
- movie_year:上映年份
- movie_info:电影信息(导演、演员等信息)
- rating_num:豆瓣评分(分数,最高10分)
- rating:豆瓣星级评分(最高五星)
- rating_sum:参与评分总人数
- rating_info:豆瓣星级评分详情
使用Scrapy。
创建douban项目。
scrapy startproject douban
进入项目新建spider。
scrapy genspider movie_info movie.douban.com
修改item,设置需要爬取的数据参数。
这边直接附上代码:
item.py
import scrapy
class MovieItem(scrapy.Item):
"""
movie_id:电影id(存于url中)
movie_name:电影名称
movie_year:上映年份
movie_info:电影信息(导演、演员等信息)
rating_num:豆瓣评分(分数,最高10分)
rating:豆瓣星级评分(最高五星)
rating_sum:参与评分总人数
rating_info:豆瓣星级评分详情
"""
movie_id = scrapy.Field()
movie_name = scrapy.Field()
movie_year = scrapy.Field()
movie_info = scrapy.Field()
rating_num = scrapy.Field()
rating = scrapy.Field()
rating_sum = scrapy.Field()
rating_info = scrapy.Field()
movie_info.py
import scrapy
import re
from douban.items import MovieItem
class MovieInfoSpider(scrapy.Spider):
"""
豆瓣电影信息爬取
电影ID:
你好,李焕英:34841067
唐人街探案3:27619748
刺杀小说家:26826330
人潮汹涌:34880302
新神榜:哪吒重生:34779692
侍神令:26935283
熊出没·狂野大陆:34825886
"""
name = 'movie_info'
allowed_domains = ['movie.douban.com']
# 指定该spider运行的piplines为MovieInfoPipeline
custom_settings = {
'ITEM_PIPELINES': {'douban.pipelines.MovieInfoPipeline': 301},
}
# 将movie id填入url中组合获得需要爬取的url列表
movie_ids = [
'34841067',
'27619748',
'26826330',
'34880302',
'34779692',
'26935283',
'34825886']
start_url = 'https://movie.douban.com/subject/{}/?from=showing'
start_urls = []
for movie_id in movie_ids:
start_urls.append(start_url.format(movie_id))
def parse(self, response):
item = MovieItem()
movie_url = response.url
# 使用正则表达式匹配url中的movie_id
pattern = re.compile(r'\d+(\\.\\d+){0,1}')
item['movie_id'] = pattern.search(movie_url).group()
item['movie_name'] = response.xpath(
'//div[@id = "content"]/h1/span/text()').extract_first()
# 爬取的电影上映年份格式为“(2021)”,使用正则表达式匹配括号中的年份
year = response.xpath(
'//div[@id = "content"]/h1/span/text()').extract()[1]
pattern = re.compile(r'(?<=\()[^}]*(?=\))')
item['movie_year'] = pattern.search(year).group()
# 爬取电影信息,因直接爬取的数据中存在空格及换行符(\n),使用replace()进行清洗
movie_info = response.xpath('//div[@id = "info"]//text()').extract()
item['movie_info'] = ''.join(movie_info).replace(
' ', '').replace(
'\n', '')
item['rating_num'] = response.xpath(
'//strong[@class="ll rating_num"]/text()').extract_first()
# 爬取豆瓣星级评分,因数据中没有直接的数字可获取,获取div中的class属性使用正则表达式取出数字进行处理获取星级评分
# 以《你好,李焕英》为例,获取的class属性值为“ll bigstar bigstar40”,40表示该电影星级评分为4星
rating_star = response.xpath(
'//div[@class = "rating_right "]/div/@class').extract_first()
pattern = re.compile(r'\d+(\\.\\d+){0,1}')
if pattern.search(rating_star):
ranting = pattern.search(rating_star).group()
# 数值类型设置为float的原因为星级可能会出现半星的情况,如3.5星
item['rating'] = float(ranting) / 10
else:
item['rating'] = None
item['rating_sum'] = response.xpath(
'//div[@class = "rating_sum"]//span/text()').extract_first()
# 与爬取电影信息的原因相同,清洗掉数据中的空格及换行符(\n)
rating_info = response.xpath(
'//div[@class = "ratings-on-weight"]//text()').extract()
item['rating_info'] = ''.join(rating_info).replace(
' ', '').replace(
'\n', '')
yield item
piplines.py
将数据存入movie_info.csv文件中
import csv
class MovieInfoPipeline(object):
def open_spider(self, spider):
self.file = open(
'movie_info.csv',
'w',
newline='',
encoding='utf-8-sig')
self.writer = csv.writer(self.file)
self.writer.writerow(['movie_id',
'movie_name',
'movie_year',
'movie_info',
'rating_num',
'rating',
'rating_sum',
'rating_info'])
def process_item(self, item, spider):
self.writer.writerow([item['movie_id'],
item['movie_name'],
item['movie_year'],
item['movie_info'],
item['rating_num'],
item['rating'],
item['rating_sum'],
item['rating_info']])
return item
def close_spider(self, spider):
self.file.close()
middlewares.py
设置随机请求头
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class MyUseragentMiddleware(UserAgentMiddleware):
"""
设置User-Agent
"""
def __init__(self, user_agent):
super().__init__(user_agent)
self.user_agent = user_agent
@classmethod
def from_crawler(cls, crawler):
return cls(
user_agent=crawler.settings.get('USER_AGENTS')
)
def process_request(self, request, spider):
agent = random.choice(self.user_agent)
request.headers['User-Agent'] = agent
settings.py
settings中启用pipelines和middlewares
# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban.middlewares.DoubanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.MyUseragentMiddleware': 400,
}
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban.pipelines.MovieInfoPipeline': 301,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
运行
scrapy crawl movie_info
运行完成后可在movie_info.csv查看爬取的数据。
豆瓣短评爬取
豆瓣电影短评页的url为:https://movie.douban.com/subject/{电影id}/comments?start={开始评论数}&limit=20&status=P&sort=new_score
只要修改电影id及开始评论数就可爬取每页的评论数据,本次项目只做学习使用,所以只爬取500条评论作为分析使用。
爬取以上信息
- movie_id:电影id(存于url中)
- user_name:用户名
- rating:用户评分
- comment_time:评论时间
- comment_info:评论内容
- votes_num:赞同人数
- user_url:用户url
- comment_date:评论日期
进入项目新建spider。
scrapy genspider comment movie.douban.com
修改item,设置需要爬取的数据参数。
这边直接附上代码:
item.py
import scrapy
class CommentItem(scrapy.Item):
"""
movie_id:电影id(存于url中)
user_name:用户名
rating:用户评分
comment_time:评论时间
comment_info:评论内容
votes_num:赞同人数
user_url:用户url
comment_date:评论日期
"""
movie_id = scrapy.Field()
user_name = scrapy.Field()
rating = scrapy.Field()
comment_time = scrapy.Field()
comment_info = scrapy.Field()
votes_num = scrapy.Field()
user_url = scrapy.Field()
comment_date = scrapy.Field()
comment.py
import scrapy
import re
from douban.items import CommentItem
class CommentSpider(scrapy.Spider):
"""
豆瓣电影短评爬取
"""
name = 'comment'
allowed_domains = ['movie.douban.com']
# 指定该spider运行的piplines为CommentPipeline
custom_settings = {
'ITEM_PIPELINES': {'douban.pipelines.CommentPipeline': 302},
}
# 将movie id及开始评论数填入url中组合获得需要爬取的url列表
movie_ids = [
'34841067',
'27619748',
'26826330',
'34880302',
'34779692',
'26935283',
'34825886']
start_url = 'https://movie.douban.com/subject/{}/comments?start={}&limit=20&sort=new_score&status=P'
start_urls = []
for movie_id in movie_ids:
for i in range(0, 26):
start_urls.append(start_url.format(movie_id, str(i * 20)))
def parse(self, response):
comment_url = response.url
pattern = re.compile(r'\d+(\\.\\d+){0,1}')
comments = response.xpath('//div[@class="comment"]')
for comment in comments:
item = CommentItem()
# 使用正则表达式匹配url中的movie_id
item['movie_id'] = pattern.search(comment_url).group()
item['user_name'] = comment.xpath(
'.//span[@class="comment-info"]/a/text()').extract_first()
# 与movie_info相同,爬取豆瓣星级评分,因数据中没有直接的数字可获取,获取div中的class属性使用正则表达式取出数字进行处理获取星级评分
rating_star = comment.xpath(
'.//span[@class="comment-info"]/span[2]/@class').extract_first()
if pattern.search(rating_star):
ranting = pattern.search(rating_star).group()
item['rating'] = float(ranting) / 10
else:
item['rating'] = None
item['comment_time'] = comment.xpath(
'.//span[@class="comment-time "]/@title').extract_first()
item['comment_info'] = comment.xpath(
'./p[@class=" comment-content"]/span/text()').extract_first()
item['votes_num'] = comment.xpath(
'.//span[@class="votes vote-count"]/text()').extract_first()
item['user_url'] = comment.xpath(
'.//span[@class="comment-info"]/a/@href').extract_first()
# 清洗掉数据中的空格及换行符(\n)得到正确的日期
comment_date = comment.xpath(
'.//span[@class="comment-time "]/text()').extract_first()
item['comment_date'] = comment_date.replace('\n', '').strip()
yield item
piplines.py
import csv
class CommentPipeline(object):
def open_spider(self, spider):
self.file = open('comment.csv', 'w', newline='', encoding='utf-8-sig')
self.writer = csv.writer(self.file)
self.writer.writerow(['movie_id',
'user_name',
'rating',
'comment_time',
'comment_info',
'votes_num',
'user_url',
'comment_date'])
def process_item(self, item, spider):
self.writer.writerow([item['movie_id'],
item['user_name'],
item['rating'],
item['comment_time'],
item['comment_info'],
item['votes_num'],
item['user_url'],
item['comment_date']])
return item
def close_spider(self, spider):
self.file.close()
middlewares.py
设置随机请求头,代码与爬取豆瓣电影信息的middlewares相同
settings.py
settings中启用pipelines和middlewares,并设置cookie
# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Accept-Encoding': 'gzip',
'Cookie': '这里输入你自己的cookie'
}
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.MyUseragentMiddleware': 400,
}
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban.middlewares.DoubanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban.middlewares.DoubanDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban.pipelines.CommentPipeline': 302
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
运行
scrapy crawl comment
运行完成后可在comment.csv查看爬取的数据。
到此为止,电影的信息及评论已爬取完毕。
接下来爬取电影的票房数据。
猫眼电影票房数据爬取
猫眼票房PC端页面数据有字体加密反爬措施,而在移动端页面有个接口,该接口数据为无加密的json格式数据,获取的数据也为实时更新数据,url为:http://piaofang.maoyan.com/getBoxList?date=1&isSplit=true,于是我们可以请求这个接口链接来获取我们所需的票房数据。
json格式数据可用json模块来进行解析。
需要爬取的数据为:
- movie_id:电影ID
- movie_name:电影名称
- sum_box_desc:总票房
- box_desc: 综合票房
- box_rate:综合票房占比
- show_count_rate:排片占比
- seat_count_rate:排坐占比
这边的爬取也很简单,直接附上代码:
使用Scrapy。
创建maoyan项目。
scrapy startproject maoyan
进入项目新建spider。
scrapy genspider piaofang piaofang.maoyan.com
修改item,设置需要爬取的数据参数。
item.py
import scrapy
class MaoyanItem(scrapy.Item):
"""
movie_id:电影ID
movie_name:电影名称
sum_box_desc:总票房
box_desc: 综合票房
box_rate:综合票房占比
show_count_rate:排片占比
seat_count_rate:排坐占比
"""
movie_id = scrapy.Field()
movie_name = scrapy.Field()
sum_box_desc = scrapy.Field()
box_desc = scrapy.Field()
box_rate = scrapy.Field()
show_count_rate = scrapy.Field()
seat_count_rate = scrapy.Field()
piaofang.py
import scrapy
import json
from maoyan.items import MaoyanItem
class PiaofangSpider(scrapy.Spider):
name = 'piaofang'
allowed_domains = ['piaofang.maoyan.com']
start_urls = ['http://piaofang.maoyan.com/getBoxList?date=1&isSplit=true']
def parse(self, response):
data = json.loads(response.text)
data_list = data['boxOffice']["data"]["list"]
for main_data in data_list:
item = MaoyanItem()
item["movie_id"] = main_data['movieInfo']["movieId"]
item["movie_name"] = main_data['movieInfo']["movieName"]
item["sum_box_desc"] = main_data['sumBoxDesc']
item["box_desc"] = main_data['boxDesc']
item["box_rate"] = main_data['boxRate']
item["show_count_rate"] = main_data['showCountRate']
item["seat_count_rate"] = main_data['seatCountRate']
yield item
piplines.py
import csv
class MaoyanPipeline:
def open_spider(self, spider):
self.file = open('maoyan.csv', 'w', newline='', encoding='utf-8-sig')
self.writer = csv.writer(self.file)
self.writer.writerow(['movie_id',
'movie_name',
'sum_box_desc',
'box_desc',
'box_rate',
'show_count_rate',
'seat_count_rate'])
def process_item(self, item, spider):
self.writer.writerow([item['movie_id'],
item['movie_name'],
item['sum_box_desc'],
item['box_desc'],
item['box_rate'],
item['show_count_rate'],
item['seat_count_rate']])
return item
def close_spider(self, spider):
self.file.close()
settings.py
settings中启用pipelines
# Scrapy settings for maoyan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'maoyan (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'maoyan.middlewares.MaoyanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'maoyan.pipelines.MaoyanPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
运行
scrapy crawl piaofang
运行完成后可在piaofang.csv查看爬取的数据。
到此为止,电影的票房数据已爬取完毕。
接下来是数据分析。
数据分析
数据分析另写文章分享分析过程。
本文由 Kame 创作,采用 知识共享署名4.0
国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
最后编辑时间为: Mar 8,2021