拼多多面經(jīng)分享:24個(gè)「數(shù)據(jù)分析師」崗位面試題和答案解析
來? 源:CSDN/作? 者:稻蛙
01?問:貝葉斯公式復(fù)述并解釋應(yīng)用場(chǎng)景
P(A|B) = P(B|A)*P(A) / P(B)
如搜索query糾錯(cuò),設(shè)A為正確的詞,B為輸入的詞,那么:
P(A|B)表示輸入詞B實(shí)際為A的概率
P(B|A)表示詞A錯(cuò)輸為B的概率,可以根據(jù)AB的相似度計(jì)算(如編輯距離)
P(A)是詞A出現(xiàn)的頻率,統(tǒng)計(jì)獲得
P(B)對(duì)于所有候選的A都一樣,所以可以省去
02?問:如何寫SQL求出中位數(shù)平均數(shù)和眾數(shù)(除了用count之外的方法)
1. 中位數(shù)
方案1(沒考慮到偶數(shù)個(gè)數(shù)的情況):
set?@m?=?(select?count(*)/2?from?table)
select column from table order by column limit @m, 1
方案2(考慮偶數(shù)個(gè)數(shù),中位數(shù)是中間兩個(gè)數(shù)的平均):
set?@index?=?-1
select?avg(table.column)
from
(select?@index:=@index+1?as?index,?column
from?table?order?by?column)?as?t
where t.index in (floor(@index/2),ceiling(@index/2))
2. 平均數(shù)
select avg(distinct column) from table
3. 眾數(shù)
select column, count(*) from table group by column order by column desc limit 1(emmm,好像用到count了)
03?問:如何避免決策樹過擬合
限制樹深
剪枝
限制葉節(jié)點(diǎn)數(shù)量
正則化項(xiàng)
增加數(shù)據(jù)
bagging(subsample、subfeature、低維空間投影)
數(shù)據(jù)增強(qiáng)(加入有雜質(zhì)的數(shù)據(jù))
早停
04?問:樸素貝葉斯的理解
理解:樸素貝葉斯是在已知一些先驗(yàn)概率的情況下,由果索因的一種方法
其它:樸素的意思是假設(shè)了事件相互獨(dú)立
05?問:SVM的優(yōu)點(diǎn)
優(yōu)點(diǎn):
能應(yīng)用于非線性可分的情況
最后分類時(shí)由支持向量決定,復(fù)雜度取決于支持向量的數(shù)目而不是樣本空間的維度,避免了維度災(zāi)難
具有魯棒性:因?yàn)橹皇褂蒙倭恐С窒蛄浚プ£P(guān)鍵樣本,剔除冗余樣本
高維低樣本下性能好,如文本分類
缺點(diǎn):
模型訓(xùn)練復(fù)雜度高
難以適應(yīng)多分類問題
核函數(shù)選擇沒有較好的方法論
06?問:Kmeans的原理
初始化k個(gè)點(diǎn)
根據(jù)距離點(diǎn)歸入k個(gè)類中
更新k個(gè)類的類中心
重復(fù)②③,直到收斂或達(dá)到迭代次數(shù)
07?問:口答一個(gè)SQL題(要用到row number)
mysql中設(shè)置row number:
SET @row_number = 0; SELECT (@row_number:=@row_number + 1) AS num FROM table
08?問:業(yè)務(wù)場(chǎng)景題,如何分析次日留存率下降的問題
業(yè)務(wù)問題關(guān)鍵是問對(duì)問題,然后才是拆解問題去解決。
1. 兩層模型
從用戶畫像、渠道、產(chǎn)品、行為環(huán)節(jié)等角度細(xì)分,明確到底是哪里的次日留存率下降了
2. 指標(biāo)拆解
次日留存率 = Σ 次日留存數(shù) / 今日獲客人數(shù)
3. 原因分析
內(nèi)部:
運(yùn)營(yíng)活動(dòng)
產(chǎn)品變動(dòng)
技術(shù)故障
設(shè)計(jì)漏洞(如產(chǎn)生可以擼羊毛的設(shè)計(jì))
外部:
競(jìng)品
用戶偏好
節(jié)假日
社會(huì)事件(如產(chǎn)生輿論)
09?問:處理需求時(shí)的一般思路是什么,并舉例
明確需求,需求方的目的是什么
拆解任務(wù)
制定可執(zhí)行方案
推進(jìn)
驗(yàn)收
10?問:hadoop原理和mapreduce原理
1. Hadoop原理
采用HDFS分布式存儲(chǔ)文件,MapReduce分解計(jì)算,其它先略
2. MapReduce原理
map階段:讀取HDFS中的文件,解析成<k,v>的形式,并對(duì)<k,v>進(jìn)行分區(qū)(默認(rèn)一個(gè)區(qū)),將相同k的value放在一個(gè)集合中
reduce階段:將map的輸出copy到不同的reduce節(jié)點(diǎn)上,節(jié)點(diǎn)對(duì)map的輸出進(jìn)行合并、排序
11?問:現(xiàn)有一個(gè)數(shù)據(jù)庫(kù)表Tourists,記錄了某個(gè)景點(diǎn)7月份每天來訪游客的數(shù)量如下:id date visits 1 2017-07-01 100 …… 非常巧,id字段剛好等于日期里面的幾號(hào)。現(xiàn)在請(qǐng)篩選出連續(xù)三天都有大于100天的日期。上面例子的輸出為:date 2017-07-01 ……
select t1.date
from Tourists as t1, Tourists as t2, Tourists as t3
on t1.id = (t2.id+1) and t2.id = (t3.id+1)
where t1.visits >100 and t2.visits>100 and t3.visits>100
12?問:在一張工資表salary里面,發(fā)現(xiàn)2017-07這個(gè)月的性別字段男m和女f寫反了,請(qǐng)用一個(gè)Updae語句修復(fù)數(shù)據(jù)。例如表格數(shù)據(jù)是:id name gender salary month 1 A m 1000 2017-06 2 B f 1010 2017-06
update salary
set gender = replace('mf', gender, '')
13?問:現(xiàn)有A表,有21個(gè)列,第一列id,剩余列為特征字段,列名從d1-d20,共10W條數(shù)據(jù)!另外一個(gè)表B稱為模式表,和A表結(jié)構(gòu)一樣,共5W條數(shù)據(jù) 請(qǐng)找到A表中的特征符合B表中模式的數(shù)據(jù),并記錄下相對(duì)應(yīng)的id。
有兩種情況滿足要求:?
每個(gè)特征列都完全匹配的情況下
最多有一個(gè)特征列不匹配,其他19個(gè)特征列都完全匹配,但哪個(gè)列不匹配未知
select A.id,
((case A.d1 when B.d1 then 1 else 0) +
(case A.d2 when B.d2 then 1 else 0) +
...) as count_match
from A left join B
on A.d1 = B.d1
14?問:我們把用戶對(duì)商品的評(píng)分用稀疏向量表示,保存在數(shù)據(jù)庫(kù)表t里面:t的字段有:uid,goods_id,star uid是用戶id;goodsid是商品id;star是用戶對(duì)該商品的評(píng)分,值為1-5?,F(xiàn)在我們想要計(jì)算向量?jī)蓛芍g的內(nèi)積,內(nèi)積在這里的語義為:對(duì)于兩個(gè)不同的用戶,如果他們都對(duì)同樣的一批商品打了分,那么對(duì)于這里面的每個(gè)人的分?jǐn)?shù)乘起來,并對(duì)這些乘積求和。?
例子,數(shù)據(jù)庫(kù)表里有以下的數(shù)據(jù):U0 g0 2 U0 g1 4 U1 g0 3 U1 g1 1 計(jì)算后的結(jié)果為:U0 U1 2*3+4*1=10 ……
select uid1, uid2, sum(result) as dot
from
(select t1.uid as uid1, t2.uid as uid2, t1.goods_id, t1.star*t2.star as result
from t as t1, t as t2
on t1.goods_id = t2.goods_id) as t
group by goods_id
15?問:統(tǒng)計(jì)教授多門課老師數(shù)量并輸出每位老師教授課程數(shù)統(tǒng)計(jì)表
設(shè)表class中字段為id,teacher,course
1. 統(tǒng)計(jì)教授多門課老師數(shù)量
select count(*) from class
group by teacher having count(*) > 1
2. 輸出每位老師教授課程數(shù)統(tǒng)計(jì)
select teacher, count(course) as count_course
from class
group by teacher
16?問:四個(gè)人選舉出一個(gè)騎士,統(tǒng)計(jì)投票數(shù),并輸出真正的騎士名字
設(shè)表tabe中字段為id,knight,vote_knight
select knight from table
group by vote_knight
order by count(vote_knight) limit 1
17?問:?jiǎn)T工表,宿舍表,部門表,統(tǒng)計(jì)出宿舍樓各部門人數(shù)表
設(shè):
員工表為employee,字段為id,employee_name,belong_dormitory_id,belong_department_id;
宿舍表為dormitory,字段為id,dormitory_number;
部門表為department,字段為id,department_name
select dormitory_number, department_name, count(employee_name) as count_employee
from employee as e
left join dormitory as dor on e.belong_dormitory_id = dor.id
left join department as dep on e.belong_department_id = dep.id
18?問:給出一堆數(shù)和頻數(shù)的表格,統(tǒng)計(jì)這一堆數(shù)中位數(shù)
設(shè)表table中字段為id,number,frequency
set @sum = (select sum(frequency)+1 as sum from table)
set @index = 0
set @last_index = 0
select avg(distinct t.frequecy)
from
(select @last_index := @index, @index := @index+frequency as index, frequency
from table) as t
where t.index in (floor(@sum/2), ceiling(@sum/2))
or (floor(@sum/2) > t.last_index and ceiling(@sum.2) <= t.index)
19?問:中位數(shù),三個(gè)班級(jí)合在一起的一張成績(jī)單,統(tǒng)計(jì)每個(gè)班級(jí)成績(jī)中位數(shù)
設(shè)表table中字段為id,class,score
select t1.class, avg(distinct t1.score) as median
from table t1, table t2 on t1.id = t2.id
group by t1.class, t1.score
having sum(case when t1.score >= t2.score then 1else 0 end) >=
(select count(*)/2 from table where table.class = t1.class)
and
having sum(case when t1.score <= t2.score then 1else 0 end) >=
(select count(*)/2 from table where table.class = t1.class)
20?問:交易表結(jié)構(gòu)為user_id,order_id,pay_time,order_amount
寫sql查詢過去一個(gè)月付款用戶量(提示:用戶量需去重)最高的3天分別是哪幾天
寫sql查詢做昨天每個(gè)用戶最后付款的訂單ID及金額
select count(distinct user_id) as c from table group by month(pay_time) order by c desc limit 3
select order_id, order_amount from ((select user_id, max(pay_time) as mt from table group by user_id where DATEDIFF(pay_time, NOW()) = -1 as t1) left join table as t2 where t1.user_id = t2.user_id and t1.mt == t2.pay_time)
21?問:PV表a(表結(jié)構(gòu)為user_id,goods_id),點(diǎn)擊表b(user_id,goods_id),數(shù)據(jù)量各為50萬條,在防止數(shù)據(jù)傾斜的情況下,寫一句sql找出兩個(gè)表共同的user_id和相應(yīng)的goods_id
select * from a
where a.user_id exsit (select user_id from b)
22?問:表結(jié)構(gòu)為user_id,reg_time,age, 寫一句sql按user_id隨機(jī)抽樣2000個(gè)用戶? 寫一句sql取出按各年齡段(每10歲一個(gè)分段,如(0,10))分別抽樣1%的用戶
1. 隨機(jī)抽樣2000個(gè)用戶
select * from table order by rand() limit 2000
2. 取出各年齡段抽樣1%的用戶
set @target = 0
set @count_user = 0
select @target:=@target+10 as age_right, *
from table as t1
where t1.age >=@target-10 and t1.age < (@target)
and t1.id in
(select floor(count(*)*0.1) from table as t2
where t1.age >=@target-10 and t1.age < (@target)
order by rand() limit ??)
注:mysql下按百分比取數(shù)沒有想到比較好的方法,因?yàn)閘imit后面不能接變量。想到的方法是先計(jì)算出每個(gè)年齡段的總數(shù),然后計(jì)算出1%是多少,接著給每一行加一個(gè)遞增+1的行標(biāo),當(dāng)行標(biāo)=1%時(shí),結(jié)束
23?問:用戶登錄日志表為user_id,log_id,session_id,plat,visit_date 用sql查詢近30天每天平均登錄用戶數(shù)量? 用sql查詢出近30天連續(xù)訪問7天以上的用戶數(shù)量
1. 近三十天每天平均登錄用戶數(shù)量
select visit_date, count(distince user_id)
group by visit_date
2. 近30天連續(xù)訪問7天以上的用戶數(shù)量
select t1.date
from table t1, table t2, ..., table t7
on t1.visit_date = (t2.visit_date+1) and t2.visit_date = (t3.visit_date+1)
and ... and t6.visit_date = (t7.visit_date+1)
24?問:表user_id,visit_date,page_name,plat ?統(tǒng)計(jì)近7天每天到訪的新用戶數(shù) 統(tǒng)計(jì)每個(gè)訪問渠道plat7天前的新用戶的3日留存率和7日留存率
1. 近7天每天到訪的新用戶數(shù)
select day(visit_date), count(distinct user_id)
from table
where user_id not in
(select user_id from table
where day(visit_date) < date_sub(visit_date, interval 7day))
2. 每個(gè)渠道7天前用戶的3日留存和7日留存
# 三日留存
# 先計(jì)算每個(gè)平臺(tái)7日前的新用戶數(shù)量
select t1.plat, t1.c/t2.c as retention_3
(select plat, count(distinct user_id)
from table
group by plat, user_id
having day(min(visit_date)) = date_sub(now(), interval 7 day)) as t1
left join
(select plat, count(distinct user_id) as c
from table
group by user_id having count(user_id) > 0
having day(min(visit_date)) = date_sub(now(), interval 7 day)
and day(max(visit_date)) > date_sub(now(), interval 7 day)
and day(max(visit_date)) <= date_sub(now(), interval 4day)) as t2
on t1.plat = t2.plat
-END-
掃碼回復(fù)【大廠】即可免費(fèi)獲取7個(gè)大廠筆試&面經(jīng)合集
