手機(jī)站首頁(yè)散文詩(shī)歌雜文隨筆日記小小說

散文網(wǎng) » 科技 »學(xué)習(xí) » 拼多多面經(jīng)分享：24個(gè)「數(shù)據(jù)分析師」崗位面試題和答案解析

拼多多面經(jīng)分享：24個(gè)「數(shù)據(jù)分析師」崗位面試題和答案解析

2021-12-30 10:19 作者:愛數(shù)據(jù)分析社區(qū) 0人讀過 | 我要投稿

來? 源：CSDN/作? 者：稻蛙

01?問：貝葉斯公式復(fù)述并解釋應(yīng)用場(chǎng)景

P（A|B) = P(B|A)*P(A) / P(B)

如搜索query糾錯(cuò)，設(shè)A為正確的詞，B為輸入的詞，那么：

P(A|B)表示輸入詞B實(shí)際為A的概率
P(B|A)表示詞A錯(cuò)輸為B的概率，可以根據(jù)AB的相似度計(jì)算（如編輯距離）
P(A)是詞A出現(xiàn)的頻率，統(tǒng)計(jì)獲得
P(B)對(duì)于所有候選的A都一樣，所以可以省去

02?問：如何寫SQL求出中位數(shù)平均數(shù)和眾數(shù)（除了用count之外的方法）

1. 中位數(shù)

方案1（沒考慮到偶數(shù)個(gè)數(shù)的情況）：

set?@m?=?(select?count(*)/2?from?table)select column from table order by column limit @m, 1
方案2（考慮偶數(shù)個(gè)數(shù)，中位數(shù)是中間兩個(gè)數(shù)的平均）：set?@index?=?-1select?avg(table.column)from(select?@index:=@index+1?as?index,?columnfrom?table?order?by?column)?as?twhere t.index in (floor(@index/2),ceiling(@index/2))

2. 平均數(shù)

select avg(distinct column) from table

3. 眾數(shù)

select column, count(*) from table group by column order by column desc limit 1(emmm，好像用到count了）

03?問：如何避免決策樹過擬合

限制樹深
剪枝
限制葉節(jié)點(diǎn)數(shù)量
正則化項(xiàng)
增加數(shù)據(jù)
bagging（subsample、subfeature、低維空間投影）
數(shù)據(jù)增強(qiáng)（加入有雜質(zhì)的數(shù)據(jù)）
早停

04?問：樸素貝葉斯的理解

理解：樸素貝葉斯是在已知一些先驗(yàn)概率的情況下，由果索因的一種方法
其它：樸素的意思是假設(shè)了事件相互獨(dú)立

05?問：SVM的優(yōu)點(diǎn)

優(yōu)點(diǎn)：

能應(yīng)用于非線性可分的情況
最后分類時(shí)由支持向量決定，復(fù)雜度取決于支持向量的數(shù)目而不是樣本空間的維度，避免了維度災(zāi)難
具有魯棒性：因?yàn)橹皇褂蒙倭恐С窒蛄浚プ￡P(guān)鍵樣本，剔除冗余樣本
高維低樣本下性能好，如文本分類

缺點(diǎn)：

模型訓(xùn)練復(fù)雜度高
難以適應(yīng)多分類問題
核函數(shù)選擇沒有較好的方法論

06?問：Kmeans的原理

初始化k個(gè)點(diǎn)
根據(jù)距離點(diǎn)歸入k個(gè)類中
更新k個(gè)類的類中心
重復(fù)②③，直到收斂或達(dá)到迭代次數(shù)

07?問：口答一個(gè)SQL題（要用到row number）

mysql中設(shè)置row number：

SET @row_number = 0; SELECT (@row_number:=@row_number + 1) AS num FROM table

08?問：業(yè)務(wù)場(chǎng)景題，如何分析次日留存率下降的問題

業(yè)務(wù)問題關(guān)鍵是問對(duì)問題，然后才是拆解問題去解決。

1. 兩層模型

從用戶畫像、渠道、產(chǎn)品、行為環(huán)節(jié)等角度細(xì)分，明確到底是哪里的次日留存率下降了

2. 指標(biāo)拆解

次日留存率 = Σ 次日留存數(shù) / 今日獲客人數(shù)

3. 原因分析

內(nèi)部：

運(yùn)營(yíng)活動(dòng)
產(chǎn)品變動(dòng)
技術(shù)故障
設(shè)計(jì)漏洞（如產(chǎn)生可以擼羊毛的設(shè)計(jì)）

外部：

競(jìng)品
用戶偏好
節(jié)假日
社會(huì)事件（如產(chǎn)生輿論）

09?問：處理需求時(shí)的一般思路是什么，并舉例

明確需求，需求方的目的是什么
拆解任務(wù)
制定可執(zhí)行方案
推進(jìn)
驗(yàn)收

10?問：hadoop原理和mapreduce原理

1. Hadoop原理

采用HDFS分布式存儲(chǔ)文件，MapReduce分解計(jì)算，其它先略

2. MapReduce原理

map階段：讀取HDFS中的文件，解析成<k,v>的形式，并對(duì)<k,v>進(jìn)行分區(qū)（默認(rèn)一個(gè)區(qū)），將相同k的value放在一個(gè)集合中
reduce階段：將map的輸出copy到不同的reduce節(jié)點(diǎn)上，節(jié)點(diǎn)對(duì)map的輸出進(jìn)行合并、排序

11?問：現(xiàn)有一個(gè)數(shù)據(jù)庫(kù)表Tourists，記錄了某個(gè)景點(diǎn)7月份每天來訪游客的數(shù)量如下：id date visits 1 2017-07-01 100 …… 非常巧，id字段剛好等于日期里面的幾號(hào)。現(xiàn)在請(qǐng)篩選出連續(xù)三天都有大于100天的日期。上面例子的輸出為：date 2017-07-01 ……

select t1.datefrom Tourists as t1, Tourists as t2, Tourists as t3on t1.id = (t2.id+1) and t2.id = (t3.id+1)where t1.visits >100 and t2.visits>100 and t3.visits>100

12?問：在一張工資表salary里面，發(fā)現(xiàn)2017-07這個(gè)月的性別字段男m和女f寫反了，請(qǐng)用一個(gè)Updae語句修復(fù)數(shù)據(jù)。例如表格數(shù)據(jù)是：id name gender salary month 1 A m 1000 2017-06 2 B f 1010 2017-06

update salaryset gender = replace('mf', gender, '')

13?問：現(xiàn)有A表，有21個(gè)列，第一列id，剩余列為特征字段，列名從d1-d20，共10W條數(shù)據(jù)！另外一個(gè)表B稱為模式表，和A表結(jié)構(gòu)一樣，共5W條數(shù)據(jù) 請(qǐng)找到A表中的特征符合B表中模式的數(shù)據(jù)，并記錄下相對(duì)應(yīng)的id。

有兩種情況滿足要求：?

每個(gè)特征列都完全匹配的情況下
最多有一個(gè)特征列不匹配，其他19個(gè)特征列都完全匹配，但哪個(gè)列不匹配未知

select A.id,((case A.d1 when B.d1 then 1 else 0) +(case A.d2 when B.d2 then 1 else 0) +...) as count_matchfrom A left join Bon A.d1 = B.d1

14?問：我們把用戶對(duì)商品的評(píng)分用稀疏向量表示，保存在數(shù)據(jù)庫(kù)表t里面：t的字段有：uid，goods_id，star uid是用戶id；goodsid是商品id；star是用戶對(duì)該商品的評(píng)分，值為1-5?，F(xiàn)在我們想要計(jì)算向量?jī)蓛芍g的內(nèi)積，內(nèi)積在這里的語義為：對(duì)于兩個(gè)不同的用戶，如果他們都對(duì)同樣的一批商品打了分，那么對(duì)于這里面的每個(gè)人的分?jǐn)?shù)乘起來，并對(duì)這些乘積求和。?

例子，數(shù)據(jù)庫(kù)表里有以下的數(shù)據(jù)：U0 g0 2 U0 g1 4 U1 g0 3 U1 g1 1 計(jì)算后的結(jié)果為：U0 U1 2*3+4*1=10 ……

select uid1, uid2, sum(result) as dotfrom(select t1.uid as uid1, t2.uid as uid2, t1.goods_id, t1.star*t2.star as resultfrom t as t1, t as t2on t1.goods_id = t2.goods_id) as tgroup by goods_id

15?問：統(tǒng)計(jì)教授多門課老師數(shù)量并輸出每位老師教授課程數(shù)統(tǒng)計(jì)表

設(shè)表class中字段為id，teacher，course

1. 統(tǒng)計(jì)教授多門課老師數(shù)量

select count(*) from classgroup by teacher having count(*) > 1

2. 輸出每位老師教授課程數(shù)統(tǒng)計(jì)

select teacher, count(course) as count_coursefrom classgroup by teacher

16?問：四個(gè)人選舉出一個(gè)騎士，統(tǒng)計(jì)投票數(shù)，并輸出真正的騎士名字

設(shè)表tabe中字段為id，knight，vote_knight

select knight from tablegroup by vote_knightorder by count(vote_knight) limit 1

17?問：?jiǎn)T工表，宿舍表，部門表，統(tǒng)計(jì)出宿舍樓各部門人數(shù)表

設(shè)：

員工表為employee，字段為id，employee_name，belong_dormitory_id，belong_department_id；

宿舍表為dormitory，字段為id，dormitory_number；
部門表為department，字段為id，department_name

select dormitory_number, department_name, count(employee_name) as count_employeefrom employee as eleft join dormitory as dor on e.belong_dormitory_id = dor.idleft join department as dep on e.belong_department_id = dep.id

18?問：給出一堆數(shù)和頻數(shù)的表格，統(tǒng)計(jì)這一堆數(shù)中位數(shù)

設(shè)表table中字段為id,number,frequency

set @sum = (select sum(frequency)+1 as sum from table)set @index = 0set @last_index = 0select avg(distinct t.frequecy)from(select @last_index := @index, @index := @index+frequency as index, frequencyfrom table) as twhere t.index in (floor(@sum/2), ceiling(@sum/2))or (floor(@sum/2) > t.last_index and ceiling(@sum.2) <= t.index)

19?問：中位數(shù)，三個(gè)班級(jí)合在一起的一張成績(jī)單，統(tǒng)計(jì)每個(gè)班級(jí)成績(jī)中位數(shù)

設(shè)表table中字段為id，class，score

select t1.class, avg(distinct t1.score) as medianfrom table t1, table t2 on t1.id = t2.idgroup by t1.class, t1.scorehaving sum(case when t1.score >= t2.score then 1else 0 end) >=(select count(*)/2 from table where table.class = t1.class)andhaving sum(case when t1.score <= t2.score then 1else 0 end) >=(select count(*)/2 from table where table.class = t1.class)

20?問：交易表結(jié)構(gòu)為user_id,order_id,pay_time,order_amount

寫sql查詢過去一個(gè)月付款用戶量（提示：用戶量需去重）最高的3天分別是哪幾天

寫sql查詢做昨天每個(gè)用戶最后付款的訂單ID及金額

select count(distinct user_id) as c from table group by month(pay_time) order by c desc limit 3

select order_id, order_amount from ((select user_id, max(pay_time) as mt from table group by user_id where DATEDIFF(pay_time, NOW()) = -1 as t1) left join table as t2 where t1.user_id = t2.user_id and t1.mt == t2.pay_time)

21?問：PV表a(表結(jié)構(gòu)為user_id,goods_id),點(diǎn)擊表b(user_id,goods_id),數(shù)據(jù)量各為50萬條，在防止數(shù)據(jù)傾斜的情況下，寫一句sql找出兩個(gè)表共同的user_id和相應(yīng)的goods_id

select * from awhere a.user_id exsit (select user_id from b)

22?問：表結(jié)構(gòu)為user_id,reg_time,age, 寫一句sql按user_id隨機(jī)抽樣2000個(gè)用戶? 寫一句sql取出按各年齡段（每10歲一個(gè)分段，如（0,10））分別抽樣1%的用戶

1. 隨機(jī)抽樣2000個(gè)用戶

select * from table order by rand() limit 2000

2. 取出各年齡段抽樣1%的用戶

set @target = 0set @count_user = 0select @target:=@target+10 as age_right, *from table as t1where t1.age >=@target-10 and t1.age < (@target)and t1.id in(select floor(count(*)*0.1） from table as t2where t1.age >=@target-10 and t1.age < (@target)order by rand() limit ??)

注：mysql下按百分比取數(shù)沒有想到比較好的方法，因?yàn)閘imit后面不能接變量。想到的方法是先計(jì)算出每個(gè)年齡段的總數(shù)，然后計(jì)算出1%是多少，接著給每一行加一個(gè)遞增+1的行標(biāo)，當(dāng)行標(biāo)=1%時(shí)，結(jié)束

23?問：用戶登錄日志表為user_id,log_id,session_id,plat,visit_date 用sql查詢近30天每天平均登錄用戶數(shù)量? 用sql查詢出近30天連續(xù)訪問7天以上的用戶數(shù)量

1. 近三十天每天平均登錄用戶數(shù)量

select visit_date, count(distince user_id)group by visit_date

2. 近30天連續(xù)訪問7天以上的用戶數(shù)量

select t1.datefrom table t1, table t2, ..., table t7on t1.visit_date = (t2.visit_date+1) and t2.visit_date = (t3.visit_date+1)and ... and t6.visit_date = (t7.visit_date+1）

24?問：表user_id,visit_date,page_name,plat ?統(tǒng)計(jì)近7天每天到訪的新用戶數(shù) 統(tǒng)計(jì)每個(gè)訪問渠道plat7天前的新用戶的3日留存率和7日留存率

1. 近7天每天到訪的新用戶數(shù)

select day(visit_date), count(distinct user_id)from tablewhere user_id not in(select user_id from tablewhere day(visit_date) < date_sub(visit_date, interval 7day))

2. 每個(gè)渠道7天前用戶的3日留存和7日留存

# 三日留存# 先計(jì)算每個(gè)平臺(tái)7日前的新用戶數(shù)量select t1.plat, t1.c/t2.c as retention_3(select plat, count(distinct user_id)from tablegroup by plat, user_idhaving day(min(visit_date)) = date_sub(now(), interval 7 day)) as t1left join(select plat, count(distinct user_id) as cfrom tablegroup by user_id having count(user_id) > 0having day(min(visit_date)) = date_sub(now(), interval 7 day)and day(max(visit_date)) > date_sub(now(), interval 7 day)and day(max(visit_date)) <= date_sub(now(), interval 4day)) as t2on t1.plat = t2.plat