手機(jī)站首頁(yè)散文詩(shī)歌雜文隨筆日記小小說(shuō)

散文網(wǎng) » 生活 »日常 » Python對(duì)商店數(shù)據(jù)進(jìn)行l(wèi)stm和xgboost銷(xiāo)售量時(shí)間序列建模預(yù)測(cè)分析

Python對(duì)商店數(shù)據(jù)進(jìn)行l(wèi)stm和xgboost銷(xiāo)售量時(shí)間序列建模預(yù)測(cè)分析

2021-06-09 11:09 作者:拓端tecdat 0人讀過(guò) | 我要投稿

原文鏈接：http://tecdat.cn/?p=17748

原文出處：拓端數(shù)據(jù)部落公眾號(hào)

?

在數(shù)據(jù)科學(xué)學(xué)習(xí)之旅中，我經(jīng)常處理日常工作中的時(shí)間序列數(shù)據(jù)集，并據(jù)此做出預(yù)測(cè)。

我將通過(guò)以下步驟：

探索性數(shù)據(jù)分析（EDA）

問(wèn)題定義（我們要解決什么）
變量識(shí)別（我們擁有什么數(shù)據(jù)）
單變量分析（了解數(shù)據(jù)集中的每個(gè)字段）
多元分析（了解不同領(lǐng)域和目標(biāo)之間的相互作用）
缺失值處理
離群值處理
變量轉(zhuǎn)換

預(yù)測(cè)建模

LSTM
XGBoost

問(wèn)題定義

我們?cè)趦蓚€(gè)不同的表中提供了商店的以下信息：

商店：每個(gè)商店的ID
銷(xiāo)售：特定日期的營(yíng)業(yè)額（我們的目標(biāo)變量）
客戶：特定日期的客戶數(shù)量
StateHoliday：假日
SchoolHoliday：學(xué)校假期
StoreType：4個(gè)不同的商店：a，b，c，d
CompetitionDistance：到最近的競(jìng)爭(zhēng)對(duì)手商店的距離（以米為單位）
CompetitionOpenSince?[月/年]：提供最近的競(jìng)爭(zhēng)對(duì)手開(kāi)放的大致年份和月份
促銷(xiāo)：當(dāng)天促銷(xiāo)與否
Promo2：Promo2是某些商店的連續(xù)和連續(xù)促銷(xiāo)：0 =商店不參與，1 =商店正在參與
PromoInterval：描述促銷(xiāo)啟動(dòng)的連續(xù)區(qū)間，并指定重新開(kāi)始促銷(xiāo)的月份。

利用所有這些信息，我們預(yù)測(cè)未來(lái)6周的銷(xiāo)售量。

?

# 讓我們導(dǎo)入EDA所需的庫(kù)：
import numpy as np # 線性代數(shù)
import pandas as pd # 數(shù)據(jù)處理，CSV文件I / O導(dǎo)入（例如pd.read_csv）
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
plt.style.use("ggplot") # 繪圖
#導(dǎo)入訓(xùn)練和測(cè)試文件：
train_df = pd.read_csv("../Data/train.csv")
test_df = pd.read_csv("../Data/test.csv")
#文件中有多少數(shù)據(jù)：
print("在訓(xùn)練集中，我們有", train_df.shape[0], "個(gè)觀察值和", train_df.shape[1], 列/變量。")
print("在測(cè)試集中，我們有", test_df.shape[0], "個(gè)觀察值和", test_df.shape[1], "列/變量。")
print("在商店集中，我們有", store_df.shape[0], "個(gè)觀察值和", store_df.shape[1], "列/變量。")

在訓(xùn)練集中，我們有1017209個(gè)觀察值和9列/變量。
在測(cè)試集中，我們有41088個(gè)觀測(cè)值和8列/變量。
在商店集中，我們有1115個(gè)觀察值和10列/變量。

首先讓我們清理??訓(xùn)練數(shù)據(jù)集。

?

#查看數(shù)據(jù)
train_df.head().append(train_df.tail()) #顯示前5行。

?

train_df.isnull().all()
Out[5]:
Store ? ? ? ? ? ?False
DayOfWeek ? ? ? ?False
Date ? ? ? ? ? ? False
Sales ? ? ? ? ? ?False
Customers ? ? ? ?False
Open ? ? ? ? ? ? False
Promo ? ? ? ? ? ?False
StateHoliday ? ? False
SchoolHoliday ? ?False
dtype: bool

讓我們從第一個(gè)變量開(kāi)始->??銷(xiāo)售量

opened_sales = (train_df[(train_df.Open == 1) #如果商店開(kāi)業(yè)
opened_sales.Sales.describe()
Out[6]:
count ? ?422307.000000
mean ? ? ? 6951.782199
std ? ? ? ?3101.768685
min ? ? ? ? 133.000000
25% ? ? ? ?4853.000000
50% ? ? ? ?6367.000000
75% ? ? ? ?8355.000000
max ? ? ? 41551.000000
Name: Sales, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c38fa6588>

?

看一下顧客變量

In?[9]:
train_df.Customers.describe()
Out[9]:
count ? ?1.017209e+06
mean ? ? 6.331459e+02
std ? ? ?4.644117e+02
min ? ? ?0.000000e+00
25% ? ? ?4.050000e+02
50% ? ? ?6.090000e+02
75% ? ? ?8.370000e+02
max ? ? ?7.388000e+03
Name: Customers, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c3565d240>

?train_df[(train_df.Customers > 6000)]

?

我們看一下假期?變量。

?train_df.StateHoliday.value_counts()?

0 ? ?855087
0 ? ?131072
a ? ? 20260
b ? ? ?6690
c ? ? ?4100
Name: StateHoliday, dtype: int64

?

train_df.StateHoliday_cat.count()

?

1017209

?

train_df.tail()

?

train_df.isnull().all() #檢查缺失
Out[18]:
Store ? ? ? ? ? ? ? False
DayOfWeek ? ? ? ? ? False
Date ? ? ? ? ? ? ? ?False
Sales ? ? ? ? ? ? ? False
Customers ? ? ? ? ? False
Open ? ? ? ? ? ? ? ?False
Promo ? ? ? ? ? ? ? False
SchoolHoliday ? ? ? False
StateHoliday_cat ? ?False
dtype: bool

讓我們繼續(xù)進(jìn)行商店分析

?

store_df.head().append(store_df.tail())

?

#缺失數(shù)據(jù)：
Store ? ? ? ? ? ? ? ? ? ? ? ? 0.000000
StoreType ? ? ? ? ? ? ? ? ? ? 0.000000
Assortment ? ? ? ? ? ? ? ? ? ?0.000000
CompetitionDistance ? ? ? ? ? 0.269058
CompetitionOpenSinceMonth ? ?31.748879
CompetitionOpenSinceYear ? ? 31.748879
Promo2 ? ? ? ? ? ? ? ? ? ? ? ?0.000000
Promo2SinceWeek ? ? ? ? ? ? ?48.789238
Promo2SinceYear ? ? ? ? ? ? ?48.789238
PromoInterval ? ? ? ? ? ? ? ?48.789238
dtype: float64
In?[21]:

讓我們從缺失的數(shù)據(jù)開(kāi)始。第一個(gè)是?CompetitionDistance

store_df.CompetitionDistance.plot.box()

讓我看看異常值，因此我們可以在均值和中位數(shù)之間進(jìn)行選擇來(lái)填充N(xiāo)aN
?

?

缺少數(shù)據(jù)，因?yàn)樯痰隂](méi)有競(jìng)爭(zhēng)。因此，我建議用零填充缺失的值。

store_df["CompetitionOpenSinceMonth"].fillna(0, inplace = True)

讓我們看一下促銷(xiāo)活動(dòng)。

?

store_df.groupby(by = "Promo2", axis = 0).count()

?

如果未進(jìn)行促銷(xiāo)，則應(yīng)將“促銷(xiāo)”中的NaN替換為零?

我們合并商店數(shù)據(jù)和訓(xùn)練集數(shù)據(jù)，然后繼續(xù)進(jìn)行分析。

第一，讓我們按銷(xiāo)售量、客戶等比較商店。

?

f, ax = plt.subplots(2, 3, figsize = (20,10))
plt.subplots_adjust(hspace = 0.3)
plt.show()

?

從圖中可以看出，StoreType A擁有最多的商店，銷(xiāo)售和客戶。但是，StoreType D的平均每位客戶平均支出最高。只有17家商店的StoreType B擁有最多的平均顧客。

?

我們逐年查看趨勢(shì)。

?

sns.factorplot(data = train_store_df,
# 我們可以看到季節(jié)性，但看不到趨勢(shì)。該銷(xiāo)售額每年保持不變
<seaborn.axisgrid.FacetGrid at 0x7f7c350e0c50>

?

我們看一下相關(guān)圖。

"CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c33d79c18>

?

?

我們可以得到相關(guān)性：

客戶與銷(xiāo)售（0.82）
促銷(xiāo)與銷(xiāo)售（0,82）
平均顧客銷(xiāo)量 vs促銷(xiāo)（0,28）
商店類(lèi)別 vs 平均顧客銷(xiāo)量（0,44）

我的分析結(jié)論：

商店類(lèi)別 A擁有最多的銷(xiāo)售和顧客。
商店類(lèi)別 B的每位客戶平均銷(xiāo)售額最低。因此，我認(rèn)為客戶只為小商品而來(lái)。
商店類(lèi)別 D的購(gòu)物車(chē)數(shù)量最多。
促銷(xiāo)僅在工作日進(jìn)行。
客戶傾向于在星期一（促銷(xiāo)）和星期日（沒(méi)有促銷(xiāo)）購(gòu)買(mǎi)更多商品。
我看不到任何年度趨勢(shì)。僅季節(jié)性模式。

最受歡迎的見(jiàn)解

1.在python中使用lstm和pytorch進(jìn)行時(shí)間序列預(yù)測(cè)

2.python中利用長(zhǎng)短期記憶模型lstm進(jìn)行時(shí)間序列預(yù)測(cè)分析

3.使用r語(yǔ)言進(jìn)行時(shí)間序列（arima，指數(shù)平滑）分析

4.r語(yǔ)言多元copula-garch-模型時(shí)間序列預(yù)測(cè)

5.r語(yǔ)言copulas和金融時(shí)間序列案例

6.使用r語(yǔ)言隨機(jī)波動(dòng)模型sv處理時(shí)間序列中的隨機(jī)波動(dòng)

7.r語(yǔ)言時(shí)間序列tar閾值自回歸模型

8.r語(yǔ)言k-shape時(shí)間序列聚類(lèi)方法對(duì)股票價(jià)格時(shí)間序列聚類(lèi)

9.python3用arima模型進(jìn)行時(shí)間序列預(yù)測(cè)

標(biāo)簽：