「数据分析」春节档电影数据分析
in 数据分析 with 0 comment

「数据分析」春节档电影数据分析

in 数据分析 with 0 comment

说明

作为春节档电影之一,《你好,李焕英》这部电影最近真的是大热,票房直冲50亿。

这次就尝试分析春节档七部电影数据。

春节档电影为:

数据来源为豆瓣,因豆瓣无票房数据,票房数据爬取自猫眼。

电影数据爬虫代码参考
「Scrapy」春节档电影信息、豆瓣评分、豆瓣短评及票房数据爬取

该数据分析项目Github地址为
movie_data_analysis

Quick BI版本仪表盘
Qucki BI-春节档电影数据分析

数据处理及数据分析过程

import pandas as pd
import numpy as np

movie_info = pd.read_csv("movie_info.csv")
movie_info.head()
movie_id movie_name movie_year movie_info rating_num rating rating_sum rating_info
0 34841067 你好,李焕英 2021 导演:贾玲编剧:贾玲/孙集斌/王宇/刘宏禄/卜钰/郭宇鹏主演:贾玲/张小斐/沈腾/陈赫/刘佳... 8.1 4.0 835297 5星29.8%4星46.9%3星20.6%2星2.2%1星0.6%
1 27619748 唐人街探案3 2021 导演:陈思诚编剧:陈思诚/张淳/刘吾驷/莲舟/严以宁主演:王宝强/刘昊然/妻夫木聪/托尼·贾... 5.6 3.0 736832 5星4.3%4星16.8%3星44.5%2星24.6%1星9.8%
2 26826330 刺杀小说家 2021 导演:路阳编剧:陈舒/禹扬/秦海燕/路阳主演:雷佳音/杨幂/董子健/于和伟/郭京飞/佟丽娅/... 7.0 3.5 327218 5星10.5%4星41.7%3星37.3%2星8.4%1星2.2%
3 34880302 人潮汹涌 2021 导演:饶晓志编剧:饶晓志/范翔/李想主演:刘德华/肖央/万茜/程怡/黄小蕾/国义骞/狄志杰/... 7.1 3.5 154197 5星11.9%4星41.9%3星37.1%2星7.8%1星1.3%
4 34779692 新神榜:哪吒重生 2021 导演:赵霁编剧:沐川主演:杨天翔/张赫/宣晓鸣/李诗萌/朱可儿/凌振赫/刘若班/张遥函/张喆... 7.3 3.5 71704 5星16.8%4星42.3%3星32.4%2星7.1%1星1.4%
movie_info.shape
(7, 8)
movie_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   movie_id     7 non-null      int64  
 1   movie_name   7 non-null      object 
 2   movie_year   7 non-null      int64  
 3   movie_info   7 non-null      object 
 4   rating_num   7 non-null      float64
 5   rating       7 non-null      float64
 6   rating_sum   7 non-null      int64  
 7   rating_info  7 non-null      object 
dtypes: float64(2), int64(3), object(3)
memory usage: 576.0+ bytes

春节档电影豆瓣评分,综合票房数据可视化

春节档电影如下:

from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts import options as opts
ranting_num_df = movie_info.sort_values(by='rating_num')

春节档电影豆瓣评分排名

c = (
    Bar(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(list(ranting_num_df['movie_name']))
    .add_yaxis("豆瓣评分", list(ranting_num_df['rating_num']))
    .set_global_opts(
        title_opts=opts.TitleOpts(title="春节档电影豆瓣评分排名"),
        yaxis_opts=opts.AxisOpts(name="电影名称"),
        xaxis_opts=opts.AxisOpts(name="评分/10"),
        legend_opts=opts.LegendOpts(is_show=False),
    )
    .reversal_axis()
    .set_series_opts(label_opts=opts.LabelOpts(position="right"))
    .render("春节档电影豆瓣评分排名.html")
)
#c.render_notebook()

image.png

春节档电影总票房排名

数据来自猫眼电影实时票房

piaofang = pd.read_csv('maoyan.csv')
piaofang.head()
movie_id movie_name sum_box_desc box_desc box_rate show_count_rate seat_count_rate
0 1299372 你好,李焕英 50.41亿 6397.48 36.0% 28.9% 35.6%
1 1300936 人潮汹涌 5.58亿 2880.32 16.2% 15.1% 13.9%
2 894008 寻龙传说 3372.5万 2470.68 13.9% 16.0% 17.3%
3 1217023 唐人街探案3 44.04亿 2285.49 12.8% 14.0% 12.3%
4 1048268 刺杀小说家 9.53亿 1611.41 9.0% 9.2% 7.9%
sum_box_desc = []
for name in list(ranting_num_df['movie_name']):
    sum_box_desc = sum_box_desc + list(piaofang[piaofang['movie_name'] == name]['sum_box_desc'])
    
piaofang_dict = { 'movie_name': list(ranting_num_df['movie_name']),
                  'sum_box_desc': sum_box_desc}
piaofang = pd.DataFrame(piaofang_dict)
sum_box_desc_float = []
for i in piaofang['sum_box_desc'].values:
    sum_box_desc_float.append(i.replace('亿', ''))
piaofang['sum_box_desc_float'] = sum_box_desc_float
piaofang['sum_box_desc_float'] = piaofang['sum_box_desc_float'].astype(float)
piaofang = piaofang.sort_values(by='sum_box_desc_float')
piaofang
movie_name sum_box_desc sum_box_desc_float
1 侍神令 2.65亿 2.65
5 新神榜:哪吒重生 4.18亿 4.18
4 人潮汹涌 5.58亿 5.58
2 熊出没·狂野大陆 5.69亿 5.69
3 刺杀小说家 9.53亿 9.53
0 唐人街探案3 44.04亿 44.04
6 你好,李焕英 50.41亿 50.41
c = (
    Bar(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(list(piaofang['movie_name']))
    .add_yaxis("综合票房", list(piaofang['sum_box_desc_float']))
    .set_global_opts(
        title_opts=opts.TitleOpts(title="春节档电影总票房排名"),
        yaxis_opts=opts.AxisOpts(name="电影名称"),
        xaxis_opts=opts.AxisOpts(name="票房/亿"),
        legend_opts=opts.LegendOpts(is_show=False),
    )
    .reversal_axis()
    .set_series_opts(label_opts=opts.LabelOpts(position="right"))
    .render("春节档电影总票房排名.html")
)
#c.render_notebook()

image.png

分析春节档电影票房排名发现,《你好,李焕英》与《唐人街探案》的总票房数相差不大。

但该两部电影的总票房远远大于剩余五部电影,可见《你好,李焕英》与《唐人街探案》的人气之高。

豆瓣评论评分分析

movie id:

你好,李焕英:34841067

唐人街探案3:27619748

刺杀小说家:26826330

人潮汹涌:34880302

新神榜:哪吒重生:34779692

侍神令:26935283

熊出没·狂野大陆:34825886

你好,李焕英

comments = pd.read_csv('comment.csv')
comments.head()
movie_id user_name rating comment_time comment_info votes_num user_url comment_date
0 34841067 蛋黄已跑路 3.0 2021-02-12 10:39:37 沈腾的戏份约等于欢乐颂男主,其他不评论 10398 https://www.douban.com/people/157383058/ 2021-02-12
1 34841067 大隐隐于没注销 NaN 2021-02-12 13:00:53 这次感受注定是感性压倒理性的,所以就不打分了。\n贾玲还是适合做小品,她的表演方式、她的叙事... 11774 https://www.douban.com/people/momopeach/ 2021-02-12
2 34841067 韦斯安徒生 3.0 2021-02-12 11:20:22 贾玲水平有限,奈何感情无比真挚。虽然结尾让我哭的稀里哗啦,但也没能改变前半段就是个低配版夏洛... 4055 https://www.douban.com/people/80797429/ 2021-02-12
3 34841067 Augenstern 5.0 2021-02-12 10:10:39 贾玲:我给你们讲个笑话,你们别哭。 22777 https://www.douban.com/people/domisodagreen/ 2021-02-12
4 34841067 Raremore 5.0 2021-02-12 17:09:11 “我宝”那句出来的时候真的直接泪奔\n我以为只有我回到了1981,我以为我可以牺牲我自己改变... 18743 https://www.douban.com/people/205525018/ 2021-02-12
comments.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movie_id      3500 non-null   int64  
 1   user_name     3500 non-null   object 
 2   rating        3454 non-null   float64
 3   comment_time  3500 non-null   object 
 4   comment_info  3500 non-null   object 
 5   votes_num     3500 non-null   int64  
 6   user_url      3500 non-null   object 
 7   comment_date  3500 non-null   object 
dtypes: float64(1), int64(2), object(5)
memory usage: 218.9+ KB
lhy = comments[comments['movie_id'] == 34841067]
lhy.dropna().groupby('rating').count()['movie_id']
rating
1.0     32
2.0     60
3.0     83
4.0    130
5.0    191
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(lhy.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《你好,李焕英》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-lhy.html")
)
#c.render_notebook()

image.png

唐人街探案3

trj = comments[comments['movie_id'] == 27619748]
trj.dropna().groupby('rating').count()['movie_id']
rating
1.0    219
2.0    144
3.0     62
4.0     35
5.0     40
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(trj.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《唐人街探案3》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-trj.html")
)
#c.render_notebook()

image.png

刺杀小说家

csxsj = comments[comments['movie_id'] == 26826330]
csxsj.dropna().groupby('rating').count()['movie_id']
rating
1.0     43
2.0     36
3.0     25
4.0    168
5.0    227
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(csxsj.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《刺杀小说家》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-csxsj.html")
)
#c.render_notebook()

image.png

人潮汹涌

rcxy = comments[comments['movie_id'] == 34880302]
rcxy.dropna().groupby('rating').count()['movie_id']
rating
1.0     28
2.0     76
3.0    132
4.0    182
5.0     78
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(rcxy.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《人潮汹涌》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-rcxr.html")
)
#c.render_notebook()

image.png

新神榜:哪吒重生

nzcs = comments[comments['movie_id'] == 34779692]
nzcs.dropna().groupby('rating').count()['movie_id']
rating
1.0     40
2.0     77
3.0    151
4.0    154
5.0     64
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(nzcs.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《新神榜:哪吒重生》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-nzcs.html")
)
#c.render_notebook()

image.png

侍神令

ssl = comments[comments['movie_id'] == 26935283]
ssl.dropna().groupby('rating').count()['movie_id']
rating
1.0    148
2.0    163
3.0     62
4.0     86
5.0     35
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(ssl.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《侍神令》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-ssl.html")
)
#c.render_notebook()

image.png

熊出没·狂野大陆

xcm = comments[comments['movie_id'] == 34825886]
xcm.dropna().groupby('rating').count()['movie_id']
rating
1.0     14
2.0     46
3.0    204
4.0    149
5.0     70
Name: movie_id, dtype: int64
from pyecharts import options as opts
from pyecharts.charts import Pie

c = (
    Pie()
    .add(
        "豆瓣评论评分",
        [list(z) for z in zip(['1星','2星','3星','4星','5星'], list(xcm.dropna().groupby('rating').count()['movie_id']))],
        center=["35%", "50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="豆瓣评论评分分布", subtitle="《熊出没·狂野大陆》"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
    .render("豆瓣评论评分分布-xcm.html")
)
#c.render_notebook()

image.png

各电影豆瓣评论评分数据整合

你好,李焕英:34841067

唐人街探案3:27619748

刺杀小说家:26826330

人潮汹涌:34880302

新神榜:哪吒重生:34779692

侍神令:26935283

熊出没·狂野大陆:34825886

# 整合电影名称
movie_name = ['你好,李焕英', '唐人街探案3', '刺杀小说家', '人潮汹涌', '新神榜:哪吒重生', '侍神令', '熊出没·狂野大陆']
# 整合各电影评论评分数据
rating_one = []
rating_two = []
rating_three = []
rating_four = []
rating_five = []
movie_id_list = [34841067, 27619748, 26826330, 34880302, 34779692, 26935283, 34825886]
comments[comments['movie_id'] == 34825886]
for movie_id in movie_id_list:
    rating_list = list(comments[comments['movie_id'] == movie_id].dropna().groupby('rating').count()['movie_id'])
    rating_one.append(rating_list[0])
    rating_two.append(rating_list[1])
    rating_three.append(rating_list[2])
    rating_four.append(rating_list[3])
    rating_five.append(rating_list[4])
from pyecharts import options as opts
from pyecharts.charts import Bar

c = (
    Bar(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(movie_name)
    .add_yaxis("1星", rating_one, stack="stack1")
    .add_yaxis("2星", rating_two, stack="stack1")
    .add_yaxis("3星", rating_three, stack="stack1")
    .add_yaxis("4星", rating_four, stack="stack1")
    .add_yaxis("5星", rating_five, stack="stack1")
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(
        title_opts=opts.TitleOpts(title="各电影豆瓣评论评分分布"),
        xaxis_opts=opts.AxisOpts(name="电影名称"),
        yaxis_opts=opts.AxisOpts(name="对应评分人数"),
    )
    .set_series_opts(
        label_opts=opts.LabelOpts(formatter="{c}", position="inside", color="white")
    )
    .render("各电影豆瓣评论评分分布.html")
)
#c.render_notebook()

image.png

票房分析

豆瓣评分、电影热议度与票房关联性分析

电影热议度:豆瓣评价人数

from pyecharts import options as opts
from pyecharts.charts import Scatter
import seaborn as sns
import matplotlib.pyplot as plt
movie_info['rating_sum']
0    835297
1    736832
2    327218
3    154197
4     71704
5     78015
6     16338
Name: rating_sum, dtype: int64
movie_info
movie_id movie_name movie_year movie_info rating_num rating rating_sum rating_info
0 34841067 你好,李焕英 2021 导演:贾玲编剧:贾玲/孙集斌/王宇/刘宏禄/卜钰/郭宇鹏主演:贾玲/张小斐/沈腾/陈赫/刘佳... 8.1 4.0 835297 5星29.8%4星46.9%3星20.6%2星2.2%1星0.6%
1 27619748 唐人街探案3 2021 导演:陈思诚编剧:陈思诚/张淳/刘吾驷/莲舟/严以宁主演:王宝强/刘昊然/妻夫木聪/托尼·贾... 5.6 3.0 736832 5星4.3%4星16.8%3星44.5%2星24.6%1星9.8%
2 26826330 刺杀小说家 2021 导演:路阳编剧:陈舒/禹扬/秦海燕/路阳主演:雷佳音/杨幂/董子健/于和伟/郭京飞/佟丽娅/... 7.0 3.5 327218 5星10.5%4星41.7%3星37.3%2星8.4%1星2.2%
3 34880302 人潮汹涌 2021 导演:饶晓志编剧:饶晓志/范翔/李想主演:刘德华/肖央/万茜/程怡/黄小蕾/国义骞/狄志杰/... 7.1 3.5 154197 5星11.9%4星41.9%3星37.1%2星7.8%1星1.3%
4 34779692 新神榜:哪吒重生 2021 导演:赵霁编剧:沐川主演:杨天翔/张赫/宣晓鸣/李诗萌/朱可儿/凌振赫/刘若班/张遥函/张喆... 7.3 3.5 71704 5星16.8%4星42.3%3星32.4%2星7.1%1星1.4%
5 26935283 侍神令 2021 导演:李蔚然编剧:张家鲁/翦以玟主演:陈坤/周迅/陈伟霆/屈楚萧/王丽坤/沈月/王紫璇/王悦... 5.8 3.0 78015 5星5.0%4星19.0%3星45.8%2星23.9%1星6.4%
6 34825886 熊出没·狂野大陆 2020 导演:丁亮/邵和麒编剧:徐芸/崔铁志/张宇主演:张伟/张秉君/谭笑类型:喜剧/科幻/动画制片... 6.6 3.5 16338 5星7.6%4星29.6%3星50.4%2星10.9%1星1.6%
piaofang_sum = []
for name in movie_info['movie_name'].tolist():
    piaofang_sum = piaofang_sum + list(piaofang[piaofang['movie_name'] == name]['sum_box_desc_float'])
for i in range(len(piaofang_sum)):
    piaofang_sum[i] = piaofang_sum[i] * 100
piaofang_sum
[5041.0, 4404.0, 952.9999999999999, 558.0, 418.0, 265.0, 569.0]
# 去除警告日志提醒的显示
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

font = {
    "family":"SimHei",
    "size":"15"
}
plt.rc("font",**font)

plt.figure(figsize=(15,10))
plt.scatter(movie_info['rating_sum'], movie_info['rating_num'] ,s=piaofang_sum, marker="o")
plt.xlabel('豆瓣热议度')
plt.ylabel('豆瓣评分')
plt.title('豆瓣评分、电影热议度与票房关联性分析')

plt.show()

output_47_0.png

由图中可知电影热议度与票房存在一定关联度,热议度越高的电影票房也就越高。

豆瓣评分与票房关联度不明显,评分高的影片不一定会收获更高的票房。

票房与热议度分析

pccs = np.corrcoef(movie_info['rating_sum'], piaofang_sum)
pccs
array([[1.        , 0.97514093],
       [0.97514093, 1.        ]])

根据计算所得皮尔逊相关系数=0.98,说明电影票房与热议度存在高度相关性。

豆瓣评论数据分析

通过爬虫获取春节档电影500条豆瓣评论

comments
movie_id user_name rating comment_time comment_info votes_num user_url comment_date
0 34841067 蛋黄已跑路 3.0 2021-02-12 10:39:37 沈腾的戏份约等于欢乐颂男主,其他不评论 10398 https://www.douban.com/people/157383058/ 2021-02-12
1 34841067 大隐隐于没注销 NaN 2021-02-12 13:00:53 这次感受注定是感性压倒理性的,所以就不打分了。\n贾玲还是适合做小品,她的表演方式、她的叙事... 11774 https://www.douban.com/people/momopeach/ 2021-02-12
2 34841067 韦斯安徒生 3.0 2021-02-12 11:20:22 贾玲水平有限,奈何感情无比真挚。虽然结尾让我哭的稀里哗啦,但也没能改变前半段就是个低配版夏洛... 4055 https://www.douban.com/people/80797429/ 2021-02-12
3 34841067 Augenstern 5.0 2021-02-12 10:10:39 贾玲:我给你们讲个笑话,你们别哭。 22777 https://www.douban.com/people/domisodagreen/ 2021-02-12
4 34841067 Raremore 5.0 2021-02-12 17:09:11 “我宝”那句出来的时候真的直接泪奔\n我以为只有我回到了1981,我以为我可以牺牲我自己改变... 18743 https://www.douban.com/people/205525018/ 2021-02-12
... ... ... ... ... ... ... ... ...
3495 34825886 Neo 4.0 2021-02-21 16:20:56 超出期待 三点六分,四舍五入四分 0 https://www.douban.com/people/4087576/ 2021-02-21
3496 34825886 James 3.0 2021-02-20 13:04:24 陪闺女看的,嘻嘻嘻,哈哈哈哈,看着玩还可以,小朋友很喜欢 0 https://www.douban.com/people/56219178/ 2021-02-20
3497 34825886 海绵宝宝 4.0 2021-02-14 21:00:03 今天一整天的票都售空了,只剩熊出没一排一座。好看。 0 https://www.douban.com/people/haimianbao123/ 2021-02-14
3498 34825886 啊逗 4.0 2021-02-12 13:11:48 小孩子的电影,陪看的,主角很开心。前几部剧场版我也都没看过,比剧集节奏还是好一些,没有太多的... 0 https://www.douban.com/people/177602573/ 2021-02-12
3499 34825886 4.0 2021-02-20 21:10:50 挺有趣的! 0 https://www.douban.com/people/46232830/ 2021-02-20

3500 rows × 8 columns

comments['comment_time'] = pd.to_datetime(comments['comment_time'])
comments['comment_date'] = pd.to_datetime(comments['comment_date'])
comments.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   movie_id      3500 non-null   int64         
 1   user_name     3500 non-null   object        
 2   rating        3454 non-null   float64       
 3   comment_time  3500 non-null   datetime64[ns]
 4   comment_info  3500 non-null   object        
 5   votes_num     3500 non-null   int64         
 6   user_url      3500 non-null   object        
 7   comment_date  3500 non-null   datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(2), object(3)
memory usage: 218.9+ KB

你好,李焕英:34841067

唐人街探案3:27619748

刺杀小说家:26826330

人潮汹涌:34880302

新神榜:哪吒重生:34779692

侍神令:26935283

熊出没·狂野大陆:34825886

日评论数量趋势分析

lhy = comments[comments['movie_id'] == 34841067].groupby('comment_date').count()
trj = comments[comments['movie_id'] == 27619748].groupby('comment_date').count()
csxsj = comments[comments['movie_id'] == 26826330].groupby('comment_date').count()
rcxr = comments[comments['movie_id'] == 34880302].groupby('comment_date').count()
nzcs = comments[comments['movie_id'] == 34779692].groupby('comment_date').count()
ssl = comments[comments['movie_id'] == 26935283].groupby('comment_date').count()
xcm = comments[comments['movie_id'] == 34825886].groupby('comment_date').count()
from datetime import datetime

lhy_index = lhy.index
lhy_index = pd.DataFrame(lhy_index)
lhy_index['comment_date']= lhy_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
trj_index = trj.index
trj_index = pd.DataFrame(trj_index)
trj_index['comment_date']= trj_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
csxsj_index = csxsj.index
csxsj_index = pd.DataFrame(csxsj_index)
csxsj_index['comment_date']= csxsj_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
rcxr_index = rcxr.index
rcxr_index = pd.DataFrame(rcxr_index)
rcxr_index['comment_date']= rcxr_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
nzcs_index = nzcs.index
nzcs_index = pd.DataFrame(nzcs_index)
nzcs_index['comment_date']= nzcs_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
ssl_index = ssl.index
ssl_index = pd.DataFrame(ssl_index)
ssl_index['comment_date']= ssl_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
xcm_index = xcm.index
xcm_index = pd.DataFrame(xcm_index)
xcm_index['comment_date']= xcm_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
import pyecharts.options as opts
from pyecharts.charts import Line


c1 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(lhy_index['comment_date'].tolist())
    .add_yaxis(
        series_name="你好,李焕英",
        stack="评论数1",
        y_axis=lhy['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
)
c2 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(trj_index['comment_date'].tolist())
    .add_yaxis(
        series_name="唐人街探案3",
        stack="评论数2",
        y_axis=trj['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
)
c3 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(csxsj_index['comment_date'].tolist())
    .add_yaxis(
        series_name="刺杀小说家",
        stack="评论数3",
        y_axis=csxsj['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
)
c4 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(rcxr_index['comment_date'].tolist())
    .add_yaxis(
        series_name="人潮汹涌",
        stack="评论数4",
        y_axis=rcxr['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
)
c5 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(nzcs_index['comment_date'].tolist())
    .add_yaxis(
        series_name="新神榜:哪吒重生",
        stack="评论数5",
        y_axis=nzcs['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
)
c6 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(ssl_index['comment_date'].tolist())
    .add_yaxis(
        series_name="侍神令",
        stack="评论数6",
        y_axis=ssl['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
)
c7 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(xcm_index['comment_date'].tolist())
    .add_yaxis(
        series_name="熊出没·狂野大陆",
        stack="评论数7",
        y_axis=xcm['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
    )
     .set_global_opts(
        title_opts=opts.TitleOpts(title="日评论数量趋势"),
        tooltip_opts=opts.TooltipOpts(trigger="axis"),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
    )
)

c7.overlap(c1)
c7.overlap(c2)
c7.overlap(c3)
c7.overlap(c4)
c7.overlap(c5)
c7.overlap(c6)
c7.render("日评论数量趋势.html")
#c7.render_notebook()

image.png

按照评论日期统计了7部电影3500条评论每天的评论量,发现在2月12日(大年初一)的时候评论数量最多,也是春节档电影上映的第一天。

各级评分趋势分析

分析票房较高的两部电影《你好,李焕英》、《唐人街探案3》各级评分趋势分析

lhy = comments[comments['movie_id'] == 34841067]
lhy_1 = lhy[lhy['rating'] == 1].groupby('comment_date').count()
lhy_2 = lhy[lhy['rating'] == 2].groupby('comment_date').count()
lhy_3 = lhy[lhy['rating'] == 3].groupby('comment_date').count()
lhy_4 = lhy[lhy['rating'] == 4].groupby('comment_date').count()
lhy_5 = lhy[lhy['rating'] == 5].groupby('comment_date').count()
lhy_1_index = lhy_1.index
lhy_1_index = pd.DataFrame(lhy_1_index)
lhy_1_index['comment_date']= lhy_1_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

lhy_2_index = lhy_2.index
lhy_2_index = pd.DataFrame(lhy_2_index)
lhy_2_index['comment_date']= lhy_2_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

lhy_3_index = lhy_3.index
lhy_3_index = pd.DataFrame(lhy_3_index)
lhy_3_index['comment_date']= lhy_3_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

lhy_4_index = lhy_4.index
lhy_4_index = pd.DataFrame(lhy_4_index)
lhy_4_index['comment_date']= lhy_4_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

lhy_5_index = lhy_5.index
lhy_5_index = pd.DataFrame(lhy_5_index)
lhy_5_index['comment_date']= lhy_5_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
trj = comments[comments['movie_id'] == 27619748]
trj_1 = trj[trj['rating'] == 1].groupby('comment_date').count()
trj_2 = trj[trj['rating'] == 2].groupby('comment_date').count()
trj_3 = trj[trj['rating'] == 3].groupby('comment_date').count()
trj_4 = trj[trj['rating'] == 4].groupby('comment_date').count()
trj_5 = trj[trj['rating'] == 5].groupby('comment_date').count()
trj_1_index = trj_1.index
trj_1_index = pd.DataFrame(trj_1_index)
trj_1_index['comment_date']= trj_1_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

trj_2_index = trj_2.index
trj_2_index = pd.DataFrame(trj_2_index)
trj_2_index['comment_date']= trj_2_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

trj_3_index = trj_3.index
trj_3_index = pd.DataFrame(trj_3_index)
trj_3_index['comment_date']= trj_3_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

trj_4_index = trj_4.index
trj_4_index = pd.DataFrame(trj_4_index)
trj_4_index['comment_date']= trj_4_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))

trj_5_index = trj_5.index
trj_5_index = pd.DataFrame(trj_5_index)
trj_5_index['comment_date']= trj_5_index['comment_date'].apply(lambda x: datetime.strftime(x,'%Y-%m-%d'))
import pyecharts.options as opts
from pyecharts.charts import Line


c1 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(lhy_1_index['comment_date'].tolist())
    .add_yaxis(
        series_name="1星",
        stack="评分数1",
        y_axis=lhy_1['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="各级评分趋势-《你好,李焕英》"),
        tooltip_opts=opts.TooltipOpts(trigger="axis"),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
    )
)
c2 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(lhy_2_index['comment_date'].tolist())
    .add_yaxis(
        series_name="2星",
        stack="评分数2",
        y_axis=lhy_2['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)
c3 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(lhy_3_index['comment_date'].tolist())
    .add_yaxis(
        series_name="3星",
        stack="评分数3",
        y_axis=lhy_3['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)
c4 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(lhy_4_index['comment_date'].tolist())
    .add_yaxis(
        series_name="4星",
        stack="评分数4",
        y_axis=lhy_4['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)
c5 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(lhy_5_index['comment_date'].tolist())
    .add_yaxis(
        series_name="5星",
        stack="评分数5",
        y_axis=lhy_5['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)

c1.overlap(c2)
c1.overlap(c3)
c1.overlap(c4)
c1.overlap(c5)
c1.render("各级评分趋势-《你好,李焕英》.html")
#c1.render_notebook()

image.png

import pyecharts.options as opts
from pyecharts.charts import Line


c1 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(trj_1_index['comment_date'].tolist())
    .add_yaxis(
        series_name="1星",
        stack="评分数1",
        y_axis=trj_1['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="各级评分趋势-《唐人街探案3》"),
        tooltip_opts=opts.TooltipOpts(trigger="axis"),
        yaxis_opts=opts.AxisOpts(
            type_="value",
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
    )
)
c2 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(trj_2_index['comment_date'].tolist())
    .add_yaxis(
        series_name="2星",
        stack="评分数2",
        y_axis=trj_2['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)
c3 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(trj_3_index['comment_date'].tolist())
    .add_yaxis(
        series_name="3星",
        stack="评分数3",
        y_axis=trj_3['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)
c4 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(trj_4_index['comment_date'].tolist())
    .add_yaxis(
        series_name="4星",
        stack="评分数4",
        y_axis=trj_4['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)
c5 = (
    Line(init_opts=opts.InitOpts(width='1100px'))
    .add_xaxis(trj_5_index['comment_date'].tolist())
    .add_yaxis(
        series_name="5星",
        stack="评分数5",
        y_axis=trj_5['movie_id'].tolist(),
        label_opts=opts.LabelOpts(is_show=False),
        is_connect_nones=True
    )
)

c1.overlap(c2)
c1.overlap(c3)
c1.overlap(c4)
c1.overlap(c5)
c1.render("各级评分趋势-《唐人街探案3》.html")
#c1.render_notebook()

image.png

按时间统计《你好,李焕英》和《唐人街探案3》各个星级的打分数量。

根据上两图,可以看出从《你好,李焕英》上映以来,5星、4星一直是大众的主流评分,说明大家对电影的期望高,看完电影后口碑也不错。

而从《唐人街探案3》上映以来,1星、2星一直是大众的主流评分,说明大家对电影一开始期望比较高,但看完电影后表现不如预期,所以才会导致评分低。

电影豆瓣评论话题焦点分析

import jieba

def stopwordslist(filepath):   # 定义函数创建停用词列表
    stopword = [line.strip() for line in open(filepath, 'r').readlines()]    #以行的形式读取停用词表,同时转换为列表
    return stopword
   
def cutsentences(sentences):     #定义函数实现分词
    cutsentence = jieba.lcut(sentences.strip())     #精确模式
    stopwords = stopwordslist(filepath)     # 这里加载停用词的路径
    lastsentences = []
    for word in cutsentence:     #for循环遍历分词后的每个词语
        if word not in stopwords:     #判断分词后的词语是否在停用词表内
            if word != '\t':
                lastsentences.append(word)
    return lastsentences

filepath= 'stop_words.txt'  

text = None
lhy_comment = comments[comments['movie_id'] == 34841067]['comment_info'].tolist()
text = ''.join(lhy_comment).replace('\n','').replace(' ','')

stopwordslist(filepath)
seg_list = cutsentences(text)
dict_list = {}

for seg in seg_list: 
    if(dict_list.get(seg) != None): 
        dict_list[seg] += 1 
    else: dict_list[seg] = 1
            
sort_list = sorted(dict_list.items(), key=lambda item: item[1], reverse=True)
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\KAME\AppData\Local\Temp\jieba.cache
Loading model cost 0.635 seconds.
Prefix dict has been built successfully.
from pyecharts import options as opts
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType

c = (
    WordCloud()
    .add("", sort_list[:50], word_size_range=[20, 100], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="电影豆瓣评论话题焦点分析", subtitle="《你好,李焕英》"))
    .render("电影豆瓣评论话题焦点分析-lhy.html")
)
#c.render_notebook()

image.png

通过词云可见,《你好,李焕英》这部电影评价、票房双赢的原因是主题“母亲”,再加上贾玲小品式的演绎,又有笑点又有感动之处。

text = None
trj_comment = comments[comments['movie_id'] == 27619748]['comment_info'].tolist()
text = ''.join(trj_comment).replace('\n','').replace(' ','')

stopwordslist(filepath)
seg_list = cutsentences(text)
dict_list = {}

for seg in seg_list: 
    if(dict_list.get(seg) != None): 
        dict_list[seg] += 1 
    else: dict_list[seg] = 1
            
sort_list = sorted(dict_list.items(), key=lambda item: item[1], reverse=True)
from pyecharts import options as opts
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType

c = (
    WordCloud()
    .add("", sort_list[:50], word_size_range=[20, 100], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="电影豆瓣评论话题焦点分析", subtitle="《唐人街探案3》"))
    .render("电影豆瓣评论话题焦点分析-trj.html")
)
#c.render_notebook()

image.png

通过词云可见,《唐人街探案3》这部票房虽高,但评价过低的原因主要是剧情低俗、油腻,期望越大失望越大。