「项目练习」亚马逊Kindle电子书商业分析
in 学习笔记 with 0 comment

「项目练习」亚马逊Kindle电子书商业分析

in 学习笔记 with 0 comment

项目背景介绍

image-20210512142513599.png

拥有电子阅读器和平板电脑的美国成年人比例变化折线图

由图上可知作为电子书阅读终端的电子阅读器和平板电脑已非常普及,说明电子书目前是非常普及的渠道。

数据说明:

image-20210512142810308.png

kindle小说电子书的商品详情界面(带有实体书和电子书两种销售渠道)及相关信息

image-20210512142912209.png

Product details中的ASIN信息、出版(上架)时间信息、排名信息、review信息

image-20210512143412980.png

listing详情页中的review信息(包括review生成时间和评分)

image-20210512143436900.png

亚马逊第三方插件Helium 10展现的排名和价格波动信息

数据处理思路

image-20210512144210685.png

Kindle电子书上架时间与平装本书籍上架时间差异的分布图

处理思路:数据采集→数据清理→数据分析

数据处理

以K-H-P(kindle-hardcover-paperback)发售顺序数据代码为例
共有6个渠道:H-K-P、H-P-K、K-H-P、K-P-H、P-H-K、P-K-H

多批量文件/数据读取

import os

# 读取文件夹K-H-P下所有excel文件
file_path='C:\\Users\\KAME\\Desktop\\Python\\kindle\\K-H-P'
file_list=os.listdir(file_path)

数据处理

# 创建回归计算用的字典1
data_regression={'kindleBSR':[],'lnkindleBSR':[],'kindleNP':[],'lnkindleNP':[],'dummy_hardbook':[],
                     'hardbookBSR':[],'hardbookNP':[],'dummy_paperbook':[],'paperbookBSR':[],'paperbookNP':[],'time':[],'lntime':[],
                     'time_lnprice':[],'sub_paper_lnprice':[],'sub_hard_lnprice':[],'dummy_best_paperbook':[],'dummy_best_hardbook':[],
                'dummy_printedbook':[],'dummy_best_printedbook':[],'sub_printed_lnprice':[],'constant':[],'Nreview':[],
                 'lnNreview':[],'review':[],'lnreview':[],'high review fraction':[],'ln_high review fraction':[],
                'ratio_review':[],'ln_ratio_review':[],'RP':[],'DPR':[],'lnRP':[],'lnDPR':[]}
import numpy as np
import pandas as pd

maximum=0
timer=0
average_sum=0
ASIN=''
number=0
sum_R_Statistic=0
list_R_Statistic=[]
number_lowR_Statistic=0

for asin in file_list:
    data=pd.read_excel(file_path+'\\'+asin) 
    
    # 547表示excel有547天
    for j in range(547):
        # 第一步:判断kindleBSR是否为空
        if data.loc[j,'kindleBSR']>=0: 
            data_regression['kindleBSR'].append(data.loc[j,'kindleBSR'])
            data_regression['kindleNP'].append(data.loc[j,'kindleNP'])
            data_regression['time'].append(data.loc[j,'KindleStarttime'])
            data_regression['constant'].append(1)
            data_regression['RP'].append(data.loc[j,'PSUM'])
            if data.loc[j,'review评分']>=0:
                data_regression['review'].append(data.loc[j,'review评分'])
                data_regression['high review fraction'].append((data.loc[j,'5.0']+data.loc[j,'4.0'])/data.loc[j,'review总数'])
            else:
                data_regression['review'].append(0)
                data_regression['high review fraction'].append(0)
            if j==0:
                data_regression['ratio_review'].append(1)
            else:
                data_regression['ratio_review'].append(data.loc[j,'review总数']/data.loc[j-1,'review总数'])
            if data.loc[j,'review总数']>=0:
                data_regression['Nreview'].append(data.loc[j,'review总数'])
            else:
                data_regression['Nreview'].append(0)
            
            # 第二步:判断与kindleBSR对应的paperbookBSR与hardbookBSR是否为空
            if data.loc[j,'paperbookBSR']>=0:
                data_regression['dummy_paperbook'].append(1)
                data_regression['paperbookBSR'].append(data.loc[j,'paperbookBSR'])
                data_regression['paperbookNP'].append(data.loc[j,'paperbookNP'])
                if data.loc[j,'paperbookBSR']<=80845:
                    data_regression['dummy_best_paperbook'].append(1)
                else:
                    data_regression['dummy_best_paperbook'].append(0)
            else:
                data_regression['dummy_paperbook'].append(0)
                data_regression['paperbookBSR'].append('NaN')
                data_regression['paperbookNP'].append('NaN')
                data_regression['dummy_best_paperbook'].append(0)
            if data.loc[j,'hardbookBSR']>=0:
                data_regression['dummy_hardbook'].append(1)
                data_regression['hardbookBSR'].append(data.loc[j,'hardbookBSR'])
                data_regression['hardbookNP'].append(data.loc[j,'hardbookNP'])
                if data.loc[j,'hardbookBSR']<=146049:
                    data_regression['dummy_best_hardbook'].append(1)
                else:
                    data_regression['dummy_best_hardbook'].append(0)
            else:
                data_regression['dummy_hardbook'].append(0)
                data_regression['hardbookBSR'].append('NaN')
                data_regression['hardbookNP'].append('NaN')
                data_regression['dummy_best_hardbook'].append(0)
            if (data.loc[j,'hardbookBSR']>=0) or (data.loc[j,'paperbookBSR']>=0):
                data_regression['dummy_printedbook'].append(1)
            else:
                data_regression['dummy_printedbook'].append(0)
            if (data.loc[j,'paperbookBSR']<=80845) or (data.loc[j,'hardbookBSR']<=146049):
                data_regression['dummy_best_printedbook'].append(1)
            else:
                data_regression['dummy_best_printedbook'].append(0)
    timer=timer+1
    if timer==50:
        print("Has been executed 50 times")
    if timer==100:
        print("Has been executed 100 times")
    if timer==200:
        print("Has been executed 200 times")
    if timer==400:
        print("Has been executed 400 times")
    if timer==600:
        print("Has been executed 600 times")
    if timer==800:
        print("Has been executed 800 times")
    if timer==1000:
        print("Has been executed 1000 times")
    if timer==1200:
        print("Has been executed 1200 times")
D:\Program\Anaconda\lib\site-packages\ipykernel_launcher.py:27: RuntimeWarning: invalid value encountered in longlong_scalars
D:\Program\Anaconda\lib\site-packages\ipykernel_launcher.py:34: RuntimeWarning: invalid value encountered in longlong_scalars
D:\Program\Anaconda\lib\site-packages\ipykernel_launcher.py:34: RuntimeWarning: divide by zero encountered in longlong_scalars


Has been executed 50 times
Has been executed 100 times

数学建模

image-20210512171959445.png

image-20210512172008325.png

image-20210512172054069.png

RP:参考价格

AP:第t天所有电子书的平均价格

H_BP:与小说电子书i对应的精装书价格

P_BP:与小说电子书i对应的平装书的价格

H_t:与小说电子书i对应的精装书发行时间

P_t:与小说电子书对应的平装书的发行时间

for i_dummy_paperback in range (len(data_regression['kindleBSR'])):
    data_regression['DPR'].append('nan')
    data_regression['time_lnprice'].append('nan')
    data_regression['sub_paper_lnprice'].append('nan')
    data_regression['sub_hard_lnprice'].append('nan')
    data_regression['lnkindleBSR'].append('nan')
    data_regression['lnkindleNP'].append('nan')
    data_regression['sub_printed_lnprice'].append('nan')
    data_regression['lnNreview'].append('nan')
    data_regression['lnreview'].append('nan')
    data_regression['ln_high review fraction'].append('nan')
    data_regression['lnRP'].append('nan')
    data_regression['lnDPR'].append('nan')
for j in range(len(data_regression['kindleBSR'])):
    # 变量kindleBSR(进行log计算)
    data_regression['lnkindleBSR'][j]=np.log(data_regression['kindleBSR'][j])
    data_regression['DPR'][j]=np.abs(data_regression['kindleNP'][j]-data_regression['RP'][j])
    
    data_regression['lnRP'][j]=np.log(data_regression['RP'][j])
    data_regression['lnDPR'][j]=np.log(data_regression['DPR'][j])
    
    # 变量kindleNP(进行log计算)
    data_regression['lnkindleNP'][j]=np.log(data_regression['kindleNP'][j])
    # 变量time(进行log计算)
    data_regression['lntime'].append(np.log(data_regression['time'][j]))
    data_regression['ln_ratio_review']=np.log(data_regression['ratio_review'][j])
    if data_regression['Nreview'][j]!=0:
        data_regression['lnNreview'][j]=np.log(data_regression['Nreview'][j])
        data_regression['ln_high review fraction'][j]=np.log(data_regression['high review fraction'][j])
    else:
        data_regression['lnNreview'][j]=0
        data_regression['ln_high review fraction'][j]=0
    if data_regression['review'][j]!=0:
        data_regression['lnreview'][j]=np.log(data_regression['review'][j])
    else:
        data_regression['lnreview'][j]=0
for n in range(len(data_regression['kindleBSR'])):
    data_regression['time_lnprice'][n]=data_regression['time'][n]*data_regression['lnkindleNP'][n]
    data_regression['sub_paper_lnprice'][n]=data_regression['dummy_paperbook'][n]*data_regression['lnkindleNP'][n]
    data_regression['sub_hard_lnprice'][n]=data_regression['dummy_hardbook'][n]*data_regression['lnkindleNP'][n]
    data_regression['sub_printed_lnprice'][n]=data_regression['dummy_printedbook'][n]*data_regression['lnkindleNP'][n]
df = pd.DataFrame(data_regression)
df
kindleBSR lnkindleBSR kindleNP lnkindleNP dummy_hardbook hardbookBSR hardbookNP dummy_paperbook paperbookBSR paperbookNP ... review lnreview high review fraction ln_high review fraction ratio_review ln_ratio_review RP DPR lnRP lnDPR
0 72021.0 11.184713 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.380960 0.750760 -0.286669 1.000000 0.0 9.93186 1.94186 2.295748 0.663646
1 8095.0 8.999002 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.380960 0.750760 -0.286669 1.000000 0.0 9.92274 1.93274 2.294829 0.658939
2 14060.0 9.551089 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.380960 0.750760 -0.286669 1.000000 0.0 9.88697 1.89697 2.291218 0.640258
3 23748.0 10.075254 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.380960 0.750760 -0.286669 1.000000 0.0 9.87099 1.88099 2.289600 0.631798
4 31617.0 10.361450 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.380960 0.750760 -0.286669 1.000000 0.0 9.88284 1.89284 2.290800 0.638078
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24556 179752.0 12.099333 3.99 1.383791 0 NaN NaN 0 NaN NaN ... 4.66667 1.540446 1.000000 0.000000 1.000000 0.0 9.25201 5.26201 2.224841 1.660513
24557 308305.0 12.638845 3.99 1.383791 0 NaN NaN 0 NaN NaN ... 4.54545 1.514127 0.909091 -0.095310 1.222222 0.0 9.25598 5.26598 2.225270 1.661267
24558 390891.0 12.876184 3.99 1.383791 0 NaN NaN 0 NaN NaN ... 4.54545 1.514127 0.909091 -0.095310 1.000000 0.0 9.25923 5.26923 2.225621 1.661884
24559 428523.0 12.968100 3.99 1.383791 0 NaN NaN 0 NaN NaN ... 4.50000 1.504077 0.916667 -0.087011 1.090909 0.0 9.24369 5.25369 2.223941 1.658931
24560 437319.0 12.988418 3.99 1.383791 0 NaN NaN 0 NaN NaN ... 4.50000 1.504077 0.916667 -0.087011 1.000000 0.0 9.24944 5.25944 2.224563 1.660025

24561 rows × 33 columns

df.head()
kindleBSR lnkindleBSR kindleNP lnkindleNP dummy_hardbook hardbookBSR hardbookNP dummy_paperbook paperbookBSR paperbookNP ... review lnreview high review fraction ln_high review fraction ratio_review ln_ratio_review RP DPR lnRP lnDPR
0 72021.0 11.184713 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.38096 0.75076 -0.286669 1.0 0.0 9.93186 1.94186 2.295748 0.663646
1 8095.0 8.999002 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.38096 0.75076 -0.286669 1.0 0.0 9.92274 1.93274 2.294829 0.658939
2 14060.0 9.551089 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.38096 0.75076 -0.286669 1.0 0.0 9.88697 1.89697 2.291218 0.640258
3 23748.0 10.075254 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.38096 0.75076 -0.286669 1.0 0.0 9.87099 1.88099 2.289600 0.631798
4 31617.0 10.361450 7.99 2.078191 0 NaN NaN 0 NaN NaN ... 3.97872 1.38096 0.75076 -0.286669 1.0 0.0 9.88284 1.89284 2.290800 0.638078

5 rows × 33 columns

import statsmodels.api as sm

# 建立多元线性回归分析模型
model = sm.OLS(df['lnkindleBSR'], df[['constant','lnkindleNP','lnNreview','lnreview','time_lnprice','dummy_best_printedbook']])
results1= model.fit()
# 查看分析结果
results1.summary()
OLS Regression Results
Dep. Variable: lnkindleBSR R-squared: 0.469
Model: OLS Adj. R-squared: 0.469
Method: Least Squares F-statistic: 4331.
Date: Wed, 12 May 2021 Prob (F-statistic): 0.00
Time: 18:53:41 Log-Likelihood: -50986.
No. Observations: 24561 AIC: 1.020e+05
Df Residuals: 24555 BIC: 1.020e+05
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
constant 11.6742 0.059 198.788 0.000 11.559 11.789
lnkindleNP 0.1209 0.024 4.997 0.000 0.073 0.168
lnNreview -0.5993 0.008 -71.798 0.000 -0.616 -0.583
lnreview 0.9002 0.033 27.040 0.000 0.835 0.965
time_lnprice 0.0008 3.1e-05 26.797 0.000 0.001 0.001
dummy_best_printedbook -2.5432 0.030 -85.362 0.000 -2.602 -2.485
Omnibus: 1312.749 Durbin-Watson: 0.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1536.444
Skew: -0.589 Prob(JB): 0.00
Kurtosis: 3.335 Cond. No. 4.58e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

评估回归质量与有效性

from statsmodels.stats.stattools import durbin_watson

# Durbin-Watson test
durbin_watson(results1.resid)
0.08445253608344577

第一次回归分析DW test数值过低,需优化

image-20210512172112698.png

使用差分法后的数学模型1:

image-20210512172133304.png

使用差分法后的数学模型2:

image-20210512172144821.png

对OLS的残差进行回归,通过n(R^2)的数值来判断异方差性*

residuals = results1.resid**2
model_2 = sm.OLS(residuals, df[['constant','lnkindleNP','lnNreview','lnreview','time_lnprice','dummy_best_printedbook']])
result2 = model_2.fit()
result2.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.045
Model: OLS Adj. R-squared: 0.045
Method: Least Squares F-statistic: 231.2
Date: Wed, 12 May 2021 Prob (F-statistic): 5.40e-242
Time: 18:53:41 Log-Likelihood: -76972.
No. Observations: 24561 AIC: 1.540e+05
Df Residuals: 24555 BIC: 1.540e+05
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
constant -0.2855 0.169 -1.688 0.091 -0.617 0.046
lnkindleNP 1.4146 0.070 20.303 0.000 1.278 1.551
lnNreview 0.4002 0.024 16.646 0.000 0.353 0.447
lnreview -0.0090 0.096 -0.094 0.925 -0.197 0.179
time_lnprice -0.0013 8.93e-05 -14.924 0.000 -0.002 -0.001
dummy_best_printedbook 0.4848 0.086 5.649 0.000 0.317 0.653
Omnibus: 17861.639 Durbin-Watson: 0.151
Prob(Omnibus): 0.000 Jarque-Bera (JB): 452302.595
Skew: 3.236 Prob(JB): 0.00
Kurtosis: 23.002 Cond. No. 4.58e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
24561*0.045
1105.245

改良异方差性

# 创建回归计算用的字典2
data_regression2={'sanjiao_lnkindleBSR':[],'sanjiao_lnkindleNP':[],'sanjiao_lnNreview':[],'sanjiao_lnreview':[],'sanjiao_time_lnprice':[],                 'sanjiao_dummy_printedbook':[],'sanjiao_dummy_best_printedbook':[],'constant':[],'sanjiao_lnRP':[],                  'sanjiao_lnDPR':[]}
for i in range(24560):
 
data_regression2['sanjiao_lnkindleBSR'].append(data_regression['lnkindleBSR'][i+1]-data_regression['lnkindleBSR'][i])
    data_regression2['sanjiao_lnkindleNP'].append(data_regression['lnkindleNP'][i+1]-data_regression['lnkindleNP'][i])    

data_regression2['sanjiao_lnNreview'].append(data_regression['lnNreview'][i+1]-data_regression['lnNreview'][i])    

data_regression2['sanjiao_lnreview'].append(data_regression['lnreview'][i+1]-data_regression['lnreview'][i])    

data_regression2['sanjiao_time_lnprice'].append(data_regression['time_lnprice'][i+1]-data_regression['time_lnprice'][i])    

data_regression2['sanjiao_dummy_printedbook'].append(data_regression['dummy_printedbook'][i+1]-data_regression['dummy_printedbook'][i])    

data_regression2['sanjiao_dummy_best_printedbook'].append(data_regression['dummy_best_printedbook'][i+1]-data_regression['dummy_best_printedbook'][i])    

data_regression2['constant'].append(1)    

data_regression2['sanjiao_lnRP'].append(data_regression['lnRP'][i+1]-data_regression['lnRP'][i])    

data_regression2['sanjiao_lnDPR'].append(data_regression['lnDPR'][i+1]-data_regression['lnDPR'][i])
    
df2=pd.DataFrame(data_regression2)
df2
sanjiao_lnkindleBSR sanjiao_lnkindleNP sanjiao_lnNreview sanjiao_lnreview sanjiao_time_lnprice sanjiao_dummy_printedbook sanjiao_dummy_best_printedbook constant sanjiao_lnRP sanjiao_lnDPR
0 -2.185711 0.0 0.000000 0.000000 2.078191 0 0 1 -0.000919 -0.004708
1 0.552087 0.0 0.000000 0.000000 2.078191 0 0 1 -0.003611 -0.018681
2 0.524164 0.0 0.000000 0.000000 2.078191 0 0 1 -0.001618 -0.008460
3 0.286197 0.0 0.000000 0.000000 2.078191 0 0 1 0.001200 0.006280
4 0.670985 0.0 0.000000 0.000000 2.078191 0 0 1 0.000858 0.004470
... ... ... ... ... ... ... ... ... ... ...
24555 -0.358797 0.0 0.000000 0.000000 1.383791 0 0 1 0.002964 0.005217
24556 0.539511 0.0 0.200671 -0.026319 1.383791 0 0 1 0.000429 0.000754
24557 0.237339 0.0 0.000000 0.000000 1.383791 0 0 1 0.000351 0.000617
24558 0.091916 0.0 0.087011 -0.010049 1.383791 0 0 1 -0.001680 -0.002954
24559 0.020318 0.0 0.000000 0.000000 1.383791 0 0 1 0.000622 0.001094

24560 rows × 10 columns

from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

LR=LinearRegression()
LR.fit(df2[['sanjiao_lnkindleNP','sanjiao_lnNreview','sanjiao_lnreview','sanjiao_time_lnprice','sanjiao_dummy_best_printedbook']],df2['sanjiao_lnkindleBSR'])

Resid=(df2['sanjiao_lnkindleBSR'].values-LR.predict(df2[['sanjiao_lnkindleNP','sanjiao_lnNreview','sanjiao_lnreview','sanjiao_time_lnprice','sanjiao_dummy_best_printedbook']]))
Resid = np.abs(Resid)
Resid=1/(Resid)
wls_model = sm.WLS(df2['sanjiao_lnkindleBSR'], df2[['constant','sanjiao_lnkindleNP','sanjiao_lnNreview','sanjiao_lnreview','sanjiao_time_lnprice','sanjiao_dummy_best_printedbook']], weights=Resid)
results= wls_model.fit()

results.summary()
WLS Regression Results
Dep. Variable: sanjiao_lnkindleBSR R-squared: 0.205
Model: WLS Adj. R-squared: 0.205
Method: Least Squares F-statistic: 1264.
Date: Wed, 12 May 2021 Prob (F-statistic): 0.00
Time: 18:53:41 Log-Likelihood: 7862.7
No. Observations: 24560 AIC: -1.571e+04
Df Residuals: 24554 BIC: -1.566e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
constant 0.0021 0.000 6.289 0.000 0.001 0.003
sanjiao_lnkindleNP -0.2992 0.029 -10.482 0.000 -0.355 -0.243
sanjiao_lnNreview -0.5843 0.009 -63.746 0.000 -0.602 -0.566
sanjiao_lnreview 0.3774 0.012 30.445 0.000 0.353 0.402
sanjiao_time_lnprice 0.0006 5.17e-05 11.096 0.000 0.000 0.001
sanjiao_dummy_best_printedbook -0.1749 0.006 -30.498 0.000 -0.186 -0.164
Omnibus: 1143.328 Durbin-Watson: 2.385
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1308.666
Skew: -0.552 Prob(JB): 6.71e-285
Kurtosis: 3.245 Cond. No. 1.04e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

电子书对实体书销量影响程度判断与影响因素确定

6种渠道发售策略下不同因素对电子书销量的影响

image-20210512190712424.png

image-20210512190720144.png

亚马逊电子书多渠道管理应用

  1. 由于消费者对电子书的价格弹性会随着时间的推移而降低,因此电子书零售商可以缓慢提高价格并获得更多利润。
  2. 当与电子书相对应的精装本或平装书刚刚发行时,电子书零售商可以利用消费者参考价格的变化来增加电子书的利润。
  3. 当电子书和精装书为最早发售的两个渠道时,这说明电子书具有较高的内容质量,因此电子书零售商可以逐步提高电子书的价格。