手機站首頁散文詩歌雜文隨筆日記小小說

散文網(wǎng) » 科技 »學(xué)習(xí) » 用Python進行新型冠狀病毒（COVID-19/2019-nCoV）疫情分析（上）

用Python進行新型冠狀病毒（COVID-19/2019-nCoV）疫情分析（上）

2020-08-04 15:22 作者:祈LHL 0人讀過 | 我要投稿

重要說明

分析文檔：完成度：代碼質(zhì)量 3:5:2? ??

其中分析文檔是指你數(shù)據(jù)分析的過程中，對各問題分析的思路、對結(jié)果的解釋、說明(要求言簡意賅，不要為寫而寫)? ??

由于數(shù)據(jù)過多，查看數(shù)據(jù)盡量使用head()或tail()，以免程序長時間無響應(yīng)

=======================? ?

本項目數(shù)據(jù)來源于丁香園。本項目主要目的是**通過對疫情歷史數(shù)據(jù)的分析研究，以更好的了解疫情與疫情的發(fā)展態(tài)勢，為抗擊疫情之決策提供數(shù)據(jù)支持。**? ? ?

關(guān)于本章使用的數(shù)據(jù)集，歡迎點擊——>我的視頻? ? ? 在評論區(qū)獲取。

一. 提出問題

從全國范圍，你所在省市，國外疫情等三個方面主要研究以下幾個問題：

（一）全國累計確診/疑似/治愈/死亡情況隨時間變化趨勢如何？

（二）全國新增確診/疑似/治愈/死亡情況隨時間變化趨勢如何？

（三）全國新增境外輸入隨時間變化趨勢如何？

（四）你所在的省市情況如何？

（五）國外疫情態(tài)勢如何？

（六）結(jié)合你的分析結(jié)果，對個人和社會在抗擊疫情方面有何建議？

?二. 理解數(shù)據(jù)

原始數(shù)據(jù)集：AreaInfo.csv，導(dǎo)入相關(guān)包及讀取數(shù)據(jù)：

r_hex = '#dc2624'? ? ?# red,? ? ? ?RGB = 220,38,36

dt_hex = '#2b4750'? ? # dark teal, RGB = 43,71,80

tl_hex = '#45a0a2'? ? # teal,? ? ? RGB = 69,160,162

r1_hex = '#e87a59'? ? # red,? ? ? ?RGB = 232,122,89

tl1_hex = '#7dcaa9'? ?# teal,? ? ? RGB = 125,202,169

g_hex = '#649E7D'? ? ?# green,? ? ?RGB = 100,158,125

o_hex = '#dc8018'? ? ?# orange,? ? RGB = 220,128,24

tn_hex = '#C89F91'? ? # tan,? ? ? ?RGB = 200,159,145

g50_hex = '#6c6d6c'? ?# grey-50,? ?RGB = 108,109,108

bg_hex = '#4f6268'? ? # blue grey, RGB = 79,98,104

g25_hex = '#c7cccf'? ?# grey-25,? ?RGB = 199,204,207

import numpy as np

import pandas as pd

import matplotlib,re

import matplotlib.pyplot as plt

from matplotlib.pyplot import MultipleLocator

data = pd.read_csv(r'data/AreaInfo.csv')

**查看與統(tǒng)計數(shù)據(jù)，以對數(shù)據(jù)有一個大致了解**

data.head()

三. 數(shù)據(jù)清洗

（一）基本數(shù)據(jù)處理

數(shù)據(jù)清洗主要包括：**選取子集，缺失數(shù)據(jù)處理、數(shù)據(jù)格式轉(zhuǎn)換、異常值數(shù)據(jù)處理**等。

國內(nèi)疫情數(shù)據(jù)選?。ㄗ罱K選取的數(shù)據(jù)命名為china）

1. 選取國內(nèi)疫情數(shù)據(jù)

2. 對于更新時間(updateTime)列，需將其轉(zhuǎn)換為日期類型并提取出年-月-日，并查看處理結(jié)果。(提示：dt.date)

3. 因數(shù)據(jù)每天按小時更新，一天之內(nèi)有很多重復(fù)數(shù)據(jù)，請去重并只保留一天之內(nèi)最新的數(shù)據(jù)。

> 提示：df.drop_duplicates(subset=['provinceName', 'updateTime'], keep='first', inplace=False)

> 其中df是你選擇的國內(nèi)疫情數(shù)據(jù)的DataFrame

分析：選取countryName一列中值為中國的行組成CHINA。

CHINA = data.loc[data['countryName'] == '中國']

CHINA.dropna(subset=['cityName'], how='any', inplace=True)

#CHINA

分析：取出含所有中國城市的列表

cities = list(set(CHINA['cityName']))

分析：遍歷取出每一個城市的子dataframe，然后用sort對updateTime進行時間排序

for city in cities:

? ? CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')

分析：去除空值所在行

CHINA.dropna(subset=['cityName'],inplace=True)

#CHINA.loc[CHINA['cityName'] == '秦皇島'].tail(20)

#檢查部分數(shù)據(jù)

?分析：將CHINA中的updateTime列進行格式化處理

CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date

#CHINA.loc[data['cityName'] == '秦皇島'].tail(15)

#部分數(shù)據(jù)檢查

CHINA.head()

分析：每日數(shù)據(jù)的去重只保留第一個數(shù)據(jù)，因為前面已經(jīng)對時間進行排序，第一個數(shù)據(jù)即為當(dāng)天最新數(shù)據(jù)? ?

分析：考慮到合并dataframe需要用到concat，需要創(chuàng)建一個初始china

real = CHINA.loc[data['cityName'] == cities[1]]

real.drop_duplicates(subset='updateTime', keep='first', inplace=True)

china = real

分析：遍歷每個城市dataframe進行每日數(shù)據(jù)的去重，否則會出現(xiàn)相同日期只保留一個城市的數(shù)據(jù)的情況

for city in cities[2:]:

? ? real_data = CHINA.loc[data['cityName'] == city]

? ? real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)

? ? china = pd.concat([real_data, china],sort=False)

查看數(shù)據(jù)信息，是否有缺失數(shù)據(jù)/數(shù)據(jù)類型是否正確。

提示：若不會處理缺失值，可以將其舍棄

分析：有的城市不是每日都上報的，如果某日只統(tǒng)計上報的那些城市，那些存在患者卻不上報的城市就會被忽略，數(shù)據(jù)就失真了，需要補全所有城市每日的數(shù)據(jù)，即便不上報的城市也要每日記錄數(shù)據(jù)統(tǒng)計，所以要進行插值處理補全部分數(shù)據(jù)，處理方法詳見數(shù)據(jù)透視與分析

china.info()

? ? <class 'pandas.core.frame.DataFrame'>

? ? Int64Index: 32812 entries, 96106 to 208267

? ? Data columns (total 19 columns):

? ? ?#? ?Column? ? ? ? ? ? ? ? ? ?Non-Null Count? Dtype??

? ? ---? ------? ? ? ? ? ? ? ? ? ?--------------? -----??

? ? ?0? ?continentName? ? ? ? ? ? 32812 non-null? object?

? ? ?1? ?continentEnglishName? ? ?32812 non-null? object?

? ? ?2? ?countryName? ? ? ? ? ? ? 32812 non-null? object?

? ? ?3? ?countryEnglishName? ? ? ?32812 non-null? object?

? ? ?4? ?provinceName? ? ? ? ? ? ?32812 non-null? object?

? ? ?5? ?provinceEnglishName? ? ? 32812 non-null? object?

? ? ?6? ?province_zipCode? ? ? ? ?32812 non-null? int64??

? ? ?7? ?province_confirmedCount? 32812 non-null? int64??

? ? ?8? ?province_suspectedCount? 32812 non-null? float64

? ? ?9? ?province_curedCount? ? ? 32812 non-null? int64??

? ? ?10? province_deadCount? ? ? ?32812 non-null? int64??

? ? ?11? updateTime? ? ? ? ? ? ? ?32812 non-null? object?

? ? ?12? cityName? ? ? ? ? ? ? ? ?32812 non-null? object?

? ? ?13? cityEnglishName? ? ? ? ? 31968 non-null? object?

? ? ?14? city_zipCode? ? ? ? ? ? ?32502 non-null? float64

? ? ?15? city_confirmedCount? ? ? 32812 non-null? float64

? ? ?16? city_suspectedCount? ? ? 32812 non-null? float64

? ? ?17? city_curedCount? ? ? ? ? 32812 non-null? float64

? ? ?18? city_deadCount? ? ? ? ? ?32812 non-null? float64

? ? dtypes: float64(6), int64(4), object(9)

? ? memory usage: 5.0+ MB

? ??

china.head()

你所在省市疫情數(shù)據(jù)選?。ㄗ罱K選取的數(shù)據(jù)命名為myhome）

此步也可在后面用到的再做

myhome = china.loc[data['provinceName'] == '廣東省']

myhome.head()

國外疫情數(shù)據(jù)選取（最終選取的數(shù)據(jù)命名為world）

此步也可在后面用到的再做

world = data.loc[data['countryName'] != '中國']

world.head()

數(shù)據(jù)透視與分析

分析：對china進行插值處理補全部分數(shù)據(jù)

分析：先創(chuàng)建省份列表和日期列表，并初始化一個draft

province = list(set(china['provinceName']))#每個省份

#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#每個省份的城市

date_0 = []

for dt in china.loc[china['provinceName'] ==? province[0]]['updateTime']:

? ? date_0.append(str(dt))

date_0 = list(set(date_0))

date_0.sort()

start = china.loc[china['provinceName'] ==? province[0]]['updateTime'].min()

end = china.loc[china['provinceName'] ==? province[0]]['updateTime'].max()

dates = pd.date_range(start=str(start), end=str(end))

aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[0]]*len(dates)})

aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date

#draft = pd.merge(china.loc[china['provinceName'] ==? province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')

draft = pd.concat([china.loc[china['provinceName'] ==? province[0]], aid_frame], join='outer').sort_values('updateTime')

draft.province_confirmedCount.fillna(method="ffill",inplace=True)

draft.province_suspectedCount.fillna(method="ffill", inplace=True)

draft.province_curedCount.fillna(method="ffill", inplace=True)

draft.province_deadCount.fillna(method="ffill", inplace=True)

分析：補全部分時間，取前日的數(shù)據(jù)進行插值，因為有的省份從4月末開始陸續(xù)就不再有新增病患，不再上報，所以這些省份的數(shù)據(jù)只能補全到4月末，往后的數(shù)據(jù)逐漸失去真實性

分析：同時進行日期格式化

for p in range(1,len(province)):

? ? date_d = []

? ? for dt in china.loc[china['provinceName'] ==? province[p]]['updateTime']:

? ? ? ? date_d.append(dt)

? ? date_d = list(set(date_d))

? ? date_d.sort()

? ? start = china.loc[china['provinceName'] ==? province[p]]['updateTime'].min()

? ? end = china.loc[china['provinceName'] ==? province[p]]['updateTime'].max()

? ? dates = pd.date_range(start=start, end=end)

? ? aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[p]]*len(dates)})

? ? aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date

? ? X = china.loc[china['provinceName'] ==? province[p]]

? ? X.reset_index(drop= True)

? ? Y = aid_frame

? ? Y.reset_index(drop= True)

? ? draft_d = pd.concat([X,Y], join='outer').sort_values('updateTime')

? ? draft = pd.concat([draft,draft_d])

? ? draft.province_confirmedCount.fillna(method="ffill",inplace=True)

? ? draft.province_suspectedCount.fillna(method="ffill", inplace=True)

? ? draft.province_curedCount.fillna(method="ffill", inplace=True)

? ? draft.province_deadCount.fillna(method="ffill", inplace=True)

? ? #draft['updateTime'] = draft['updateTime'].strftime('%Y-%m-%d')

? ? #draft['updateTime'] = pd.to_datetime(draft['updateTime'],format="%Y-%m-%d",errors='coerce').dt.date

china = draft

china.head()

四. 數(shù)據(jù)分析及可視化

在進行數(shù)據(jù)分析及可視化時，依據(jù)每個問題選取所需變量并新建DataFrame再進行分析和可視化展示，這樣數(shù)據(jù)不易亂且條理更清晰。

基礎(chǔ)分析

基礎(chǔ)分析，只允許使用numpy、pandas和matplotlib庫。

可以在一張圖上多個坐標系展示也可以在多張圖上展示

請根據(jù)分析目的選擇圖形的類型(折線圖、餅圖、直方圖和散點圖等等)，實在沒有主意可以到百度疫情地圖或其他疫情分析的站點激發(fā)激發(fā)靈感。

（一）全國累計確診/疑似/治愈/死亡情況隨時間變化趨勢如何？

分析：要獲得全國累計情況隨時間變化趨勢，首先需要整合每日全國累計確診情況做成date_confirmed

分析：要整合每日全國累計確診情況，首先得提取每個省份每日當(dāng)天最新累計確診人數(shù)，省份數(shù)據(jù)求和后形成dataframe，for循環(huán)拼接到date_confirmed中

date = list(set(china['updateTime']))

date.sort()

china = china.set_index('provinceName')

china = china.reset_index()

分析：循環(huán)遍歷省份和日期獲得每個省份每日累計確診，因為需要拼接，先初始化一個date_confirmed

list_p = []

list_d = []

list_e = []

for p in range(0,32):

? ? try:

? ? ? ? con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==? province[p]].iloc[[0]].iloc[0]?

? ? ? ? list_p.append(con_0['province_confirmedCount'])#該日每省的累計確診人數(shù)

? ? except:

? ? ? ? continue

list_d.append(sum(list_p))

list_e.append(str(date[0]))

date_confirmed = pd.DataFrame(list_d,index=list_e)

date_confirmed.index.name="date"

date_confirmed.columns=["China_confirmedCount"]

#date_confirmed

分析：遍歷每個省份拼接每日的總確診人數(shù)的dataframe

l = 0

for i in date[3:]:

? ? list_p = []

? ? list_d = []

? ? list_e = []

? ? l +=1

? ? for p in range(0,32):

? ? ? ? try:

? ? ? ? ? ? con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==? province[p]].iloc[[0]].iloc[0]?

? ? ? ? ? ? list_p.append(con_0['province_confirmedCount'])#該日每省的累計確診人數(shù)

? ? ? ? except:

? ? ? ? ? ? continue

? ? #con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]

? ? #list_p.append(con_0['province_confirmedCount'])#該日每省的累計確診人數(shù)

? ? list_d.append(sum(list_p))

? ? list_e.append(str(date[l]))

? ? confirmed = pd.DataFrame(list_d, index=list_e)

? ? confirmed.index.name="date"

? ? confirmed.columns=["China_confirmedCount"]

? ? date_confirmed = pd.concat([date_confirmed,confirmed],sort=False)

#date_confirmed

分析：去除空值和不全的值

date_confirmed.dropna(subset=['China_confirmedCount'],inplace=True)

date_confirmed.tail(20)

分析：數(shù)據(jù)從4月末開始到5月末就因為缺失過多省份的數(shù)據(jù)(部分省份從4月末至今再也沒有新增病患)而失真，自2020-06-06起完全失去真實性，所以我刪除了2020-06-06往后的數(shù)據(jù)

date_confirmed = date_confirmed.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14',

? ? ? ? ? ? ? ? ? ? ?'2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])

分析：構(gòu)造拼接函數(shù)

def data_frame(self,china,element):

? ? l = 0

? ? for i in date[3:]:

? ? ? ? list_p = []

? ? ? ? list_d = []

? ? ? ? list_e = []

? ? ? ? l +=1

? ? ? ? for p in range(0,32):

? ? ? ? ? ? try:

? ? ? ? ? ? ? ? con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==? province[p]].iloc[[0]].iloc[0]?

? ? ? ? ? ? ? ? list_p.append(con_0[element])

? ? ? ? ? ? except:

? ? ? ? ? ? ? ? continue

? ? ? ? #con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]

? ? ? ? #list_p.append(con_0['province_confirmedCount'])

? ? ? ? list_d.append(sum(list_p))

? ? ? ? list_e.append(str(date[l]))

? ? ? ? link = pd.DataFrame(list_d, index=list_e)

? ? ? ? link.index.name="date"

? ? ? ? link.columns=["China"]

? ? ? ? self = pd.concat([self,link],sort=False)

? ? self.dropna(subset=['China'],inplace=True)

? ? self = self.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14',

? ? ? ? ? ? ? ? ? '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])

? ? return self

分析：初始化各個變量

#累計治愈人數(shù)? date_curedCount

list_p = []

list_d = []

list_e = []

for p in range(0,32):

? ? try:

? ? ? ? con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==? province[p]].iloc[[0]].iloc[0]?

? ? ? ? list_p.append(con_0['province_curedCount'])

? ? except:

? ? ? ? continue

list_d.append(sum(list_p))

list_e.append(str(date[0]))

date_cured = pd.DataFrame(list_d, index=list_e)

date_cured.index.name="date"

date_cured.columns=["China"]

#累計死亡人數(shù)? date_dead

list_p = []

list_d = []

list_e = []

for p in range(0,32):

? ? try:

? ? ? ? con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==? province[p]].iloc[[0]].iloc[0]?

? ? ? ? list_p.append(con_0['province_deadCount'])

? ? except:

? ? ? ? continue

list_d.append(sum(list_p))

list_e.append(str(date[0]))

date_dead = pd.DataFrame(list_d, index=list_e)

date_dead.index.name="date"

date_dead.columns=["China"]

#累計確診患者? date_confirmed

plt.rcParams['font.sans-serif'] = ['SimHei'] #更改字體,否則無法顯示漢字

fig = plt.figure( figsize=(16,6), dpi=100)

ax = fig.add_subplot(1,1,1)

x = date_confirmed.index

y = date_confirmed.values

ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )

ax.set_title('累計確診患者',fontdict={

? ? ? 'color':'black',

? ? ? 'size':24

})

ax.set_xticks( range(0,len(x),30))

#累計治愈患者 date_curedCount

date_cured = data_frame(date_cured,china,'province_curedCount')

fig = plt.figure( figsize=(16,6), dpi=100)

ax = fig.add_subplot(1,1,1)

x = date_cured.index

y = date_cured.values

ax.set_title('累計治愈患者',fontdict={

? ? ? 'color':'black',

? ? ? 'size':24

})

ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )

ax.set_xticks( range(0,len(x),30))

分析：累計疑似無法通過補全數(shù)據(jù)得到

#累計死亡患者 date_dead

date_dead = data_frame(date_dead,china,'province_deadCount')

fig = plt.figure( figsize=(16,6), dpi=100)

ax = fig.add_subplot(1,1,1)

x = date_dead.index

y = date_dead.values

ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )

x_major_locator=MultipleLocator(12)

ax=plt.gca()

ax.set_title('累計死亡患者',fontdict={

? ? ? 'color':'black',

? ? ? 'size':24

})

ax.xaxis.set_major_locator(x_major_locator)

ax.set_xticks( range(0,len(x),30))

分析：疫情自1月初開始爆發(fā)，到2月末開始減緩增速，到4月末趨于平緩。治愈人數(shù)自2月初開始大幅增加，到3月末趨于平緩，死亡人數(shù)自1月末開始增加，到2月末趨于平緩，到4月末因為統(tǒng)計因素死亡人數(shù)飆升后趨于平緩。
分析總結(jié)：確診人數(shù)數(shù)據(jù)和治愈數(shù)據(jù)從4月末開始到5月末就因為缺失過多省份的數(shù)據(jù)(部分省份至今再也沒有新增病患)導(dǎo)致失真，其他數(shù)據(jù)盡量通過補全,越靠近尾部數(shù)據(jù)越失真。死亡數(shù)據(jù)補全較為成功，幾乎沒有錯漏。

標簽：