最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

歡迎光臨散文網(wǎng) 會員登陸 & 注冊

R語言大數(shù)據(jù)分析紐約市的311萬條投訴統(tǒng)計可視化與時間序列分析

2021-03-04 11:00 作者:拓端tecdat  | 我要投稿

原文鏈接:http://tecdat.cn/?p=9800

?

?

介紹

?

本文并不表示R在數(shù)據(jù)分析方面比Python更好或更快速,我本人每天都使用兩種語言。這篇文章只是提供了比較這兩種語言的機(jī)會。

本文中的??數(shù)據(jù)??每天都會更新,我的文件版本更大,為4.63 GB。

CSV文件包含紐約市的311條投訴。它是紐約市開放數(shù)據(jù)門戶網(wǎng)站中最受歡迎的數(shù)據(jù)集。

?

數(shù)據(jù)工作流程

?

  1. install.packages("devtools")

  2. library("devtools")

  3. install_github("ropensci/plotly")

library(plotly)

需要創(chuàng)建一個帳戶以連接到plotly API?;蛘?,可以只使用默認(rèn)的ggplot2圖形。

set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key

?

?

使用dplyr在R中進(jìn)行分析

?

假設(shè)已安裝sqlite3(因此可通過終端訪問)。

  1. $ sqlite3 data.db # Create your database

  2. $.databases ? ? ? # Show databases to make sure it works

  3. $.mode csv

  4. $.import <filename> <tablename>

  5. # Where filename is the name of the csv & tablename is the name of the new database table

  6. $.quit

將數(shù)據(jù)加載到內(nèi)存中。

  1. library(readr)

  2. # data.table, selecting a subset of columns

  3. time_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv',

  4. select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'),

  5. showProgress = T))

kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))

?user.selfsys.selfelapseduser.childsys.childtime_data.table63.5881.95265.63300time_data.table_full205.5713.124208.88000time_readr277.7205.018283.02900

我將使用data.table讀取數(shù)據(jù)。該?fread?函數(shù)大大提高了讀取速度。

關(guān)于dplyr

?

默認(rèn)情況下,dplyr查詢只會從數(shù)據(jù)庫中提取前10行。

  1. library(dplyr) ? ? ?## Will be used for pandas replacement


  2. # Connect to the database

  3. db <- src_sqlite('/users/ryankelly/data.db')

  4. db

?

數(shù)據(jù)處理的兩個最佳選擇(除了R之外)是:

  • 數(shù)據(jù)表

  • dplyr

預(yù)覽數(shù)據(jù)

?

  1. # Wrapped in a function for display purposes

  2. head_ <- function(x, n = 5) kable(head(x, n))


  3. head_(data)

AgencyCreatedDateClosedDateComplaintTypeDescriptorCityNYPD04/11/2015 02:13:04 AM?Noise - Street/SidewalkLoud Music/PartyBROOKLYNDFTA04/11/2015 02:12:05 AM?Senior Center ComplaintN/AELMHURSTNYPD04/11/2015 02:11:46 AM?Noise - CommercialLoud Music/PartyJAMAICANYPD04/11/2015 02:11:02 AM?Noise - Street/SidewalkLoud TalkingBROOKLYNNYPD04/11/2015 02:10:45 AM?Noise - Street/SidewalkLoud Music/PartyNEW YORK

?

選擇幾列

ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDSenior Center ComplaintN/ADFTANoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPD

?

?

ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDSenior Center ComplaintN/ADFTANoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - CommercialLoud Music/PartyNYPDHPD Literature RequestThe ABCs of Housing - SpanishHPDNoise - Street/SidewalkLoud TalkingNYPDStreet ConditionPlate Condition - NoisyDOT

?

使用WHERE過濾行

ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDNoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPD

?

使用WHERE和IN過濾列中的多個值

ComplaintTypeDescriptorAgencyNoise - Street/SidewalkLoud Music/PartyNYPDNoise - CommercialLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPDNoise - Street/SidewalkLoud Music/PartyNYPDNoise - Street/SidewalkLoud TalkingNYPD

?

在DISTINCT列中查找唯一值

  1. ## ? ? ? City

  2. ## 1 BROOKLYN

  3. ## 2 ELMHURST

  4. ## 3 ?JAMAICA

  5. ## 4 NEW YORK

  6. ## 5

  7. ## 6 ?BAYSIDE

?

使用COUNT(*)和GROUP BY查詢值計數(shù)

  1. # dt[, .(No.Complaints = .N), Agency]

  2. #setkey(dt, No.Complaints) # setkey index's the data


  3. q <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())

  4. head_(q)

AgencyNo.Complaints3-1-122499ACS3AJC7ART3CAU8

?

使用ORDER和-排序結(jié)果

?

?

數(shù)據(jù)庫中有多少個城市?

  1. # dt[, unique(City)]


  2. q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())

  3. head(q)

  1. ## ? Number.of.Cities

  2. ## 1 ? ? ? ? ? ? 1818

讓我們來繪制10個最受關(guān)注的城市

?

CityNo.ComplaintsBROOKLYN2671085NEW YORK1692514BRONX1624292?766378STATEN ISLAND437395JAMAICA147133FLUSHING117669ASTORIA90570Jamaica67083RIDGEWOOD66411

?

?

  • 用??UPPER?轉(zhuǎn)換CITY格式。

CITYNo.ComplaintsBROOKLYN2671085NEW YORK1692514BRONX1624292?766378STATEN ISLAND437395JAMAICA147133FLUSHING117669ASTORIA90570JAMAICA67083RIDGEWOOD66411

?

投訴類型(按城市)


  1. # Plot result

  2. plt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) +

  3. geom_bar(stat = 'identity') +

  4. theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))


  5. plt

?

第2部分時間序列運算

提供的數(shù)據(jù)不適合SQLite的標(biāo)準(zhǔn)日期格式。

在SQL數(shù)據(jù)庫中創(chuàng)建一個新列,然后使用格式化的date語句重新插入數(shù)據(jù) 創(chuàng)建一個新表并將格式化日期插入原始列名。

使用時間戳字符串過濾SQLite行:YYYY-MM-DD hh:mm:ss

  1. # dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00',

  2. # ? ? ?.(ComplaintType, CreatedDate, City)]


  3. q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00", ? CreatedDate > "2014-09-16 23:45:00") %>%

  4. select(ComplaintType, CreatedDate, City)


  5. head_(q)

ComplaintTypeCreatedDateCityNoise - Street/Sidewalk2014-11-12 11:59:56BRONXTaxi Complaint2014-11-12 11:59:40BROOKLYNNoise - Commercial2014-11-12 11:58:53BROOKLYNNoise - Commercial2014-11-12 11:58:26NEW YORKNoise - Street/Sidewalk2014-11-12 11:58:14NEW YORK

?

使用strftime從時間戳中拉出小時單位

  1. # dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]


  2. q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>%

  3. select(ComplaintType, CreatedDate, City, hour)


  4. head_(q)

?

ComplaintTypeCreatedDateCityhourNoise - Street/Sidewalk2015-11-04 02:13:04BROOKLYN02Senior Center Complaint2015-11-04 02:12:05ELMHURST02Noise - Commercial2015-11-04 02:11:46JAMAICA02Noise - Street/Sidewalk2015-11-04 02:11:02BROOKLYN02Noise - Street/Sidewalk2015-11-04 02:10:45NEW YORK02

?

匯總時間序列

首先,創(chuàng)建一個時間戳記四舍五入到前15分鐘間隔的新列

  1. # Using lubridate::new_period()

  2. # dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]

  3. q <- data %>%

  4. mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>%

  5. select(CreatedDate, interval)

  6. head_(q, 10)

CreatedDateinterval2015-11-04 02:13:042015-11-04 02:00:002015-11-04 02:12:052015-11-04 02:00:002015-11-04 02:11:462015-11-04 02:00:002015-11-04 02:11:022015-11-04 02:00:002015-11-04 02:10:452015-11-04 02:00:002015-11-04 02:09:072015-11-04 02:00:002015-11-04 02:05:472015-11-04 02:00:002015-11-04 02:03:432015-11-04 02:00:002015-11-04 02:03:292015-11-04 02:00:002015-11-04 02:02:172015-11-04 02:00:00

?

繪制2003年的結(jié)果


R語言大數(shù)據(jù)分析紐約市的311萬條投訴統(tǒng)計可視化與時間序列分析的評論 (共 條)

分享到微博請遵守國家法律
楚雄市| 芜湖县| 府谷县| 分宜县| 肇庆市| 竹溪县| 通州市| 江华| 榕江县| 蒙阴县| 卓资县| 龙泉市| 囊谦县| 靖西县| 伊春市| 息烽县| 常熟市| 汤原县| 玛纳斯县| 玉环县| 三河市| 厦门市| 大余县| 惠州市| 牡丹江市| 徐汇区| 沂源县| 安福县| 新平| 中方县| 泸溪县| 乌兰察布市| 图们市| 芜湖市| 广南县| 南城县| 油尖旺区| 富宁县| 马公市| 繁昌县| 沙湾县|