「Scrapy」春节档电影信息、豆瓣评分、豆瓣短评及票房数据爬取
in Scrapy with 0 comment

「Scrapy」春节档电影信息、豆瓣评分、豆瓣短评及票房数据爬取

in Scrapy with 0 comment

作为春节档电影之一,《你好,李焕英》这部电影最近真的是大热,票房直冲50亿。

这次就尝试爬取春节档七部电影数据进行数据分析。

春节档电影为:

想获取的主要数据有:

数据来源为豆瓣,因豆瓣无票房数据,票房数据爬取自猫眼。

该项目Github地址:movie_spider

豆瓣电影信息、短评爬取

豆瓣电影信息页及短评页数据爬取比较简单,不存在动态加载的内容,可直接全球爬取。需要注意的是,豆瓣短评的前几页不需要登录就可以获取数据,但是后面的内容可能需要登录才能查看,因此需要添加自己账号的cookie。

豆瓣电影信息爬取

豆瓣电影信息页的url为:https://movie.douban.com/subject/{电影id}/?from=showing

只要修改电影id就可爬取,各电影的id为:

image.png

爬取以上信息

  • movie_id:电影id(存于url中)
  • movie_name:电影名称
  • movie_year:上映年份
  • movie_info:电影信息(导演、演员等信息)
  • rating_num:豆瓣评分(分数,最高10分)
  • rating:豆瓣星级评分(最高五星)
  • rating_sum:参与评分总人数
  • rating_info:豆瓣星级评分详情

使用Scrapy。

创建douban项目。

scrapy startproject douban

进入项目新建spider。

scrapy genspider movie_info movie.douban.com

修改item,设置需要爬取的数据参数。

这边直接附上代码:

item.py

import scrapy


class MovieItem(scrapy.Item):
    """
    movie_id:电影id(存于url中)
    movie_name:电影名称
    movie_year:上映年份
    movie_info:电影信息(导演、演员等信息)
    rating_num:豆瓣评分(分数,最高10分)
    rating:豆瓣星级评分(最高五星)
    rating_sum:参与评分总人数
    rating_info:豆瓣星级评分详情
    """
    movie_id = scrapy.Field()
    movie_name = scrapy.Field()
    movie_year = scrapy.Field()
    movie_info = scrapy.Field()
    rating_num = scrapy.Field()
    rating = scrapy.Field()
    rating_sum = scrapy.Field()
    rating_info = scrapy.Field()

movie_info.py

import scrapy
import re

from douban.items import MovieItem


class MovieInfoSpider(scrapy.Spider):
    """
    豆瓣电影信息爬取
    电影ID:
    你好,李焕英:34841067
    唐人街探案3:27619748
    刺杀小说家:26826330
    人潮汹涌:34880302
    新神榜:哪吒重生:34779692
    侍神令:26935283
    熊出没·狂野大陆:34825886
    """
    name = 'movie_info'
    allowed_domains = ['movie.douban.com']

    #   指定该spider运行的piplines为MovieInfoPipeline
    custom_settings = {
        'ITEM_PIPELINES': {'douban.pipelines.MovieInfoPipeline': 301},
    }

    #   将movie id填入url中组合获得需要爬取的url列表
    movie_ids = [
        '34841067',
        '27619748',
        '26826330',
        '34880302',
        '34779692',
        '26935283',
        '34825886']
    start_url = 'https://movie.douban.com/subject/{}/?from=showing'
    start_urls = []
    for movie_id in movie_ids:
        start_urls.append(start_url.format(movie_id))

    def parse(self, response):
        item = MovieItem()
        movie_url = response.url
        #   使用正则表达式匹配url中的movie_id
        pattern = re.compile(r'\d+(\\.\\d+){0,1}')
        item['movie_id'] = pattern.search(movie_url).group()
        item['movie_name'] = response.xpath(
            '//div[@id = "content"]/h1/span/text()').extract_first()
        #   爬取的电影上映年份格式为“(2021)”,使用正则表达式匹配括号中的年份
        year = response.xpath(
            '//div[@id = "content"]/h1/span/text()').extract()[1]
        pattern = re.compile(r'(?<=\()[^}]*(?=\))')
        item['movie_year'] = pattern.search(year).group()
        #   爬取电影信息,因直接爬取的数据中存在空格及换行符(\n),使用replace()进行清洗
        movie_info = response.xpath('//div[@id = "info"]//text()').extract()
        item['movie_info'] = ''.join(movie_info).replace(
            ' ', '').replace(
            '\n', '')
        item['rating_num'] = response.xpath(
            '//strong[@class="ll rating_num"]/text()').extract_first()
        #   爬取豆瓣星级评分,因数据中没有直接的数字可获取,获取div中的class属性使用正则表达式取出数字进行处理获取星级评分
        #   以《你好,李焕英》为例,获取的class属性值为“ll bigstar bigstar40”,40表示该电影星级评分为4星
        rating_star = response.xpath(
            '//div[@class = "rating_right "]/div/@class').extract_first()
        pattern = re.compile(r'\d+(\\.\\d+){0,1}')
        if pattern.search(rating_star):
            ranting = pattern.search(rating_star).group()
            #   数值类型设置为float的原因为星级可能会出现半星的情况,如3.5星
            item['rating'] = float(ranting) / 10
        else:
            item['rating'] = None
        item['rating_sum'] = response.xpath(
            '//div[@class = "rating_sum"]//span/text()').extract_first()
        #   与爬取电影信息的原因相同,清洗掉数据中的空格及换行符(\n)
        rating_info = response.xpath(
            '//div[@class = "ratings-on-weight"]//text()').extract()
        item['rating_info'] = ''.join(rating_info).replace(
            ' ', '').replace(
            '\n', '')

        yield item

piplines.py

将数据存入movie_info.csv文件中

import csv


class MovieInfoPipeline(object):
    def open_spider(self, spider):
        self.file = open(
            'movie_info.csv',
            'w',
            newline='',
            encoding='utf-8-sig')
        self.writer = csv.writer(self.file)
        self.writer.writerow(['movie_id',
                              'movie_name',
                              'movie_year',
                              'movie_info',
                              'rating_num',
                              'rating',
                              'rating_sum',
                              'rating_info'])

    def process_item(self, item, spider):
        self.writer.writerow([item['movie_id'],
                              item['movie_name'],
                              item['movie_year'],
                              item['movie_info'],
                              item['rating_num'],
                              item['rating'],
                              item['rating_sum'],
                              item['rating_info']])
        return item

    def close_spider(self, spider):
        self.file.close()

middlewares.py

设置随机请求头

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class MyUseragentMiddleware(UserAgentMiddleware):
    """
    设置User-Agent
    """

    def __init__(self, user_agent):
        super().__init__(user_agent)
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            user_agent=crawler.settings.get('USER_AGENTS')
        )

    def process_request(self, request, spider):
        agent = random.choice(self.user_agent)
        request.headers['User-Agent'] = agent

settings.py

settings中启用pipelines和middlewares

# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douban.middlewares.DoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {
   'douban.middlewares.MyUseragentMiddleware': 400,
}

USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'douban.pipelines.MovieInfoPipeline': 301,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

运行

scrapy crawl movie_info

运行完成后可在movie_info.csv查看爬取的数据。

豆瓣短评爬取

豆瓣电影短评页的url为:https://movie.douban.com/subject/{电影id}/comments?start={开始评论数}&limit=20&status=P&sort=new_score

只要修改电影id及开始评论数就可爬取每页的评论数据,本次项目只做学习使用,所以只爬取500条评论作为分析使用。

爬取以上信息

  • movie_id:电影id(存于url中)
  • user_name:用户名
  • rating:用户评分
  • comment_time:评论时间
  • comment_info:评论内容
  • votes_num:赞同人数
  • user_url:用户url
  • comment_date:评论日期

进入项目新建spider。

scrapy genspider comment movie.douban.com

修改item,设置需要爬取的数据参数。

这边直接附上代码:

item.py

import scrapy


class CommentItem(scrapy.Item):
    """
    movie_id:电影id(存于url中)
    user_name:用户名
    rating:用户评分
    comment_time:评论时间
    comment_info:评论内容
    votes_num:赞同人数
    user_url:用户url
    comment_date:评论日期
    """
    movie_id = scrapy.Field()
    user_name = scrapy.Field()
    rating = scrapy.Field()
    comment_time = scrapy.Field()
    comment_info = scrapy.Field()
    votes_num = scrapy.Field()
    user_url = scrapy.Field()
    comment_date = scrapy.Field()

comment.py

import scrapy
import re


from douban.items import CommentItem


class CommentSpider(scrapy.Spider):
    """
    豆瓣电影短评爬取
    """
    name = 'comment'

    allowed_domains = ['movie.douban.com']

    #   指定该spider运行的piplines为CommentPipeline
    custom_settings = {
        'ITEM_PIPELINES': {'douban.pipelines.CommentPipeline': 302},
    }

    #   将movie id及开始评论数填入url中组合获得需要爬取的url列表
    movie_ids = [
        '34841067',
        '27619748',
        '26826330',
        '34880302',
        '34779692',
        '26935283',
        '34825886']
    start_url = 'https://movie.douban.com/subject/{}/comments?start={}&limit=20&sort=new_score&status=P'
    start_urls = []
    for movie_id in movie_ids:
        for i in range(0, 26):
            start_urls.append(start_url.format(movie_id, str(i * 20)))

    def parse(self, response):
        comment_url = response.url
        pattern = re.compile(r'\d+(\\.\\d+){0,1}')
        comments = response.xpath('//div[@class="comment"]')
        for comment in comments:
            item = CommentItem()
            #   使用正则表达式匹配url中的movie_id
            item['movie_id'] = pattern.search(comment_url).group()
            item['user_name'] = comment.xpath(
                './/span[@class="comment-info"]/a/text()').extract_first()
            #   与movie_info相同,爬取豆瓣星级评分,因数据中没有直接的数字可获取,获取div中的class属性使用正则表达式取出数字进行处理获取星级评分
            rating_star = comment.xpath(
                './/span[@class="comment-info"]/span[2]/@class').extract_first()
            if pattern.search(rating_star):
                ranting = pattern.search(rating_star).group()
                item['rating'] = float(ranting) / 10
            else:
                item['rating'] = None
            item['comment_time'] = comment.xpath(
                './/span[@class="comment-time "]/@title').extract_first()
            item['comment_info'] = comment.xpath(
                './p[@class=" comment-content"]/span/text()').extract_first()
            item['votes_num'] = comment.xpath(
                './/span[@class="votes vote-count"]/text()').extract_first()
            item['user_url'] = comment.xpath(
                './/span[@class="comment-info"]/a/@href').extract_first()
            #   清洗掉数据中的空格及换行符(\n)得到正确的日期
            comment_date = comment.xpath(
                './/span[@class="comment-time "]/text()').extract_first()
            item['comment_date'] = comment_date.replace('\n', '').strip()
            yield item

piplines.py

import csv


class CommentPipeline(object):

    def open_spider(self, spider):
        self.file = open('comment.csv', 'w', newline='', encoding='utf-8-sig')
        self.writer = csv.writer(self.file)
        self.writer.writerow(['movie_id',
                              'user_name',
                              'rating',
                              'comment_time',
                              'comment_info',
                              'votes_num',
                              'user_url',
                              'comment_date'])

    def process_item(self, item, spider):
        self.writer.writerow([item['movie_id'],
                              item['user_name'],
                              item['rating'],
                              item['comment_time'],
                              item['comment_info'],
                              item['votes_num'],
                              item['user_url'],
                              item['comment_date']])
        return item

    def close_spider(self, spider):
        self.file.close()

middlewares.py

设置随机请求头,代码与爬取豆瓣电影信息的middlewares相同

settings.py

settings中启用pipelines和middlewares,并设置cookie

# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Accept-Encoding': 'gzip',
    'Cookie': '这里输入你自己的cookie'
}

DOWNLOADER_MIDDLEWARES = {
   'douban.middlewares.MyUseragentMiddleware': 400,
}

USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
]


# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douban.middlewares.DoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'douban.middlewares.DoubanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'douban.pipelines.CommentPipeline': 302
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

运行

scrapy crawl comment

运行完成后可在comment.csv查看爬取的数据。

到此为止,电影的信息及评论已爬取完毕。

接下来爬取电影的票房数据。

猫眼电影票房数据爬取

猫眼票房PC端页面数据有字体加密反爬措施,而在移动端页面有个接口,该接口数据为无加密的json格式数据,获取的数据也为实时更新数据,url为:http://piaofang.maoyan.com/getBoxList?date=1&isSplit=true,于是我们可以请求这个接口链接来获取我们所需的票房数据。

json格式数据可用json模块来进行解析。

需要爬取的数据为:

  • movie_id:电影ID
  • movie_name:电影名称
  • sum_box_desc:总票房
  • box_desc: 综合票房
  • box_rate:综合票房占比
  • show_count_rate:排片占比
  • seat_count_rate:排坐占比

这边的爬取也很简单,直接附上代码:

使用Scrapy。

创建maoyan项目。

scrapy startproject maoyan

进入项目新建spider。

scrapy genspider piaofang piaofang.maoyan.com

修改item,设置需要爬取的数据参数。

item.py

import scrapy


class MaoyanItem(scrapy.Item):
    """
    movie_id:电影ID
    movie_name:电影名称
    sum_box_desc:总票房
    box_desc: 综合票房
    box_rate:综合票房占比
    show_count_rate:排片占比
    seat_count_rate:排坐占比
    """
    movie_id = scrapy.Field()
    movie_name = scrapy.Field()
    sum_box_desc = scrapy.Field()
    box_desc = scrapy.Field()
    box_rate = scrapy.Field()
    show_count_rate = scrapy.Field()
    seat_count_rate = scrapy.Field()

piaofang.py

import scrapy
import json

from maoyan.items import MaoyanItem


class PiaofangSpider(scrapy.Spider):
    name = 'piaofang'
    allowed_domains = ['piaofang.maoyan.com']
    start_urls = ['http://piaofang.maoyan.com/getBoxList?date=1&isSplit=true']

    def parse(self, response):
        data = json.loads(response.text)
        data_list = data['boxOffice']["data"]["list"]
        for main_data in data_list:
            item = MaoyanItem()
            item["movie_id"] = main_data['movieInfo']["movieId"]
            item["movie_name"] = main_data['movieInfo']["movieName"]
            item["sum_box_desc"] = main_data['sumBoxDesc']
            item["box_desc"] = main_data['boxDesc']
            item["box_rate"] = main_data['boxRate']
            item["show_count_rate"] = main_data['showCountRate']
            item["seat_count_rate"] = main_data['seatCountRate']
            yield item

piplines.py

import csv


class MaoyanPipeline:
    def open_spider(self, spider):
        self.file = open('maoyan.csv', 'w', newline='', encoding='utf-8-sig')
        self.writer = csv.writer(self.file)
        self.writer.writerow(['movie_id',
                              'movie_name',
                              'sum_box_desc',
                              'box_desc',
                              'box_rate',
                              'show_count_rate',
                              'seat_count_rate'])

    def process_item(self, item, spider):
        self.writer.writerow([item['movie_id'],
                              item['movie_name'],
                              item['sum_box_desc'],
                              item['box_desc'],
                              item['box_rate'],
                              item['show_count_rate'],
                              item['seat_count_rate']])
        return item

    def close_spider(self, spider):
        self.file.close()

settings.py

settings中启用pipelines

# Scrapy settings for maoyan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'maoyan'

SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'maoyan (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'maoyan.middlewares.MaoyanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'maoyan.pipelines.MaoyanPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

运行

scrapy crawl piaofang

运行完成后可在piaofang.csv查看爬取的数据。

到此为止,电影的票房数据已爬取完毕。

接下来是数据分析。

数据分析

数据分析另写文章分享分析过程。

「数据分析」春节档电影数据分析