觸類旁通!一文掌握泛癌之感興趣基因集SNV分析

Pancancer一詞,在生信文章中出現(xiàn)的頻率越來(lái)越高,已經(jīng)成為一個(gè)網(wǎng)紅分析內(nèi)容,趁著今天的機(jī)會(huì),今天小云想為大家分享一下泛癌分析之感興趣基因集SNV分析,接下來(lái)小云帶著大家開(kāi)始今天的學(xué)習(xí)之旅。
1.?如何進(jìn)行泛癌中感興趣基因集SNV分析?
在進(jìn)行分析之前,小云想為大家科普一下何為SNV,SNV為單核苷酸多態(tài)性,主要包括在coding區(qū)域和非coding區(qū)域發(fā)生突變,編碼區(qū)的SNV可能影響蛋白質(zhì)序列,而非編碼區(qū)的的SNV可能影響基因的表達(dá)和剪接過(guò)程,這就是小云對(duì)SNV的簡(jiǎn)單介紹哈!
接下來(lái)小云為大家介紹一哈咋進(jìn)行該分析,其實(shí)很簡(jiǎn)單,只需要下載pancer突變數(shù)據(jù),以及自己感興趣的基因集,就可以來(lái)統(tǒng)計(jì)感興趣基因集在不同腫瘤組織中發(fā)生突變的頻率,以及發(fā)生的突變類型,小云已經(jīng)迫不及待想開(kāi)始實(shí)操啦,那就開(kāi)始吧!
2.需要準(zhǔn)備的R包
#安裝需要的R包
install.packages("ggplot2")
install.packages("readxl")
install.packages("stringr")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("reshape2")
install.packages("RcolorBrewer")
install.packages("magrittr")
BiocManager::install("maftools")
#導(dǎo)入需要的R包
library(magrittr)
library(readxl)
library(stringr)
library(ggplot2)
library(maftools)
library(dplyr)
library(reshape2)
library(tidyverse)
library(RColorBrewer)
3.?數(shù)據(jù)準(zhǔn)備
在進(jìn)行感興趣基因在泛癌中的突變情況時(shí),我們只需要兩個(gè)文件就可以順利的完成分析,包括TCGA pancancer突變數(shù)據(jù)和TCGA pancancer 樣本信息文件,小云為大家提供了數(shù)據(jù)下載網(wǎng)址:https://gdc.cancer.gov/about-data/publications/pancanatlas,下載起來(lái)so esay啦!,小云并為大家貼上該網(wǎng)址,方便大家迅速完成數(shù)據(jù)準(zhǔn)備工作。

#TCGA pancancer突變數(shù)據(jù)下載
mc3.v0.2.8.PUBLIC.maf.gz,泛癌突變數(shù)據(jù)文件,需要關(guān)注基因名,變異類型以及樣本名這三列信息。

#泛癌樣本信息文件下載
merged_sample_quality_annotations.tsv,只需要關(guān)注前三列數(shù)據(jù)就可以,樣本信息和腫瘤類型。

?4.?泛癌中感興趣基因集SNV分析
如果小伙伴已經(jīng)下載好了所需要的數(shù)據(jù),小云將通過(guò)樣本信息文件,泛癌突變數(shù)據(jù)文件和感興趣基因集,來(lái)提取不同腫瘤組織中目標(biāo)基因集發(fā)生的突變文件,統(tǒng)計(jì)目標(biāo)基因發(fā)生在不同腫瘤中的突變頻率和突變類型,通過(guò)該分析小伙伴會(huì)掌握很多數(shù)據(jù)處理的小技巧,小云強(qiáng)烈安利小伙伴認(rèn)真學(xué)習(xí),真的是干活滿滿。
# 修正TCGA名稱
#讀取樣本信息文件
rawAnno <- read.delim("merged_sample_quality_annotations.tsv",sep = "\t",row.names = NULL,check.names = F,stringsAsFactors = F,header = T)
#提取1-15位字符串
rawAnno$simple_barcode <- substr(rawAnno$aliquot_barcode,1,15)
#去除重復(fù)的樣本名
samAnno<- rawAnno[!duplicated(rawAnno$simple_barcode),c("cancer?type", "simple_barcode")]
#去除cancer type 為空的數(shù)據(jù)
samAnno <- samAnno[which(samAnno$`cancer type` != ""),]
#保存簡(jiǎn)化的樣本信息文件
write.table(samAnno,"simple_sample_annotation.txt",sep = "\t",row.names = F,col.names = T,quote = F)
# 加載泛癌突變MAF文件,讀取文件有點(diǎn)慢,耐心等待奧。
panmaf <- read_tsv("mc3.v0.2.8.PUBLIC.maf.gz", comment = "#")
# 感興趣基因列表
genelist<- c("PER3","PER2","GPR50","CYP2C19","MTNR1B","CLOCK","CYP1A2","RORA","ARNTL","MTNR1A","ASMT","AANAT")
#codeing區(qū)域
vc_nonSyn <- c("Frame_Shift_Del",
???????????????"Frame_Shift_Ins",
???????????????"Splice_Site",
???????????????"Translation_Start_Site",
???????????????"Nonsense_Mutation",
???????????????"Nonstop_Mutation",
???????????????"In_Frame_Del",
???????????????"In_Frame_Ins",
???????????????"Missense_Mutation")
## 非coding區(qū)域
vc_Syn <- c("3'UTR",
????????????"5'UTR",
????????????"3'Flank",
????????????"5'Flank",
????????????"IGR",
????????????"Intron",
????????????"Silent")
# 獲取不同癌種的樣本數(shù)
mafsam <- data.frame(samID = panmaf$Tumor_Sample_Barcode,
?????????????????????simple_barcode = substr(panmaf$Tumor_Sample_Barcode,1,15),
?????????????????????stringsAsFactors = F)
#去除重復(fù)的樣本
mafsam <- mafsam[!duplicated(mafsam$simple_barcode),]
#合并兩個(gè)文件
mafsam <- merge(mafsam,samAnno, by = "simple_barcode", all.x = TRUE)
mafsam <- as.data.frame(na.omit(mafsam))
?
txt <- data.frame(label = paste0(names(table(mafsam$`cancer type`))," (n=", as.numeric(table(mafsam$`cancer type`)),")"),
??????????????????"cancer type" = names(table(mafsam$`cancer type`)),
??????????????????check.names = F) # 將腫瘤和樣本數(shù)標(biāo)簽合并
#合并兩個(gè)文件
mafsam <- merge(mafsam, txt, by = "cancer type", all.x = TRUE)
rownames(mafsam) <- mafsam$simple_barcode
# 取感興趣的突變子集
panmaf <- panmaf[which(panmaf$Hugo_Symbol %in% genelist),]
panmaf$Tumor_Sample_Barcode <- substr(panmaf$Tumor_Sample_Barcode,1,15)
write.table(panmaf, file = "maf_of_genes_of_interest.txt",sep = "\t",row.names = F,col.names = T,quote = F)
# 創(chuàng)建突變矩陣
panmaf <- read_tsv("maf_of_genes_of_interest.txt", comment = "#")
nonSyncount <- matrix(0, nrow = length(genelist), ncol = length(mafsam$simple_barcode),
???????????????????dimnames = list(genelist, mafsam$simple_barcode))
Syncount <- matrix(0, nrow = length(genelist), ncol = length(mafsam$simple_barcode),
???????????????????dimnames = list(genelist, mafsam$simple_barcode))
nonSyncount <- as.data.frame(t(nonSyncount))
Syncount <- as.data.frame(t(Syncount))
#通過(guò)循環(huán)計(jì)算感興趣基因在每種腫瘤中出現(xiàn)的coding和非coding的數(shù)目
for (i in genelist) {
??message(i," starts...")
??submaf <- panmaf[which(panmaf$Hugo_Symbol == i),] # 取出當(dāng)前基因的突變
??for (j in 1:nrow(submaf)) { # 循環(huán)每一行
????tmp <- submaf[j,]
????mutGene <- i # 獲取當(dāng)前突變的基因
????mutSam <- tmp$Tumor_Sample_Barcode # 獲取當(dāng)前突變的樣本
????if(is.element(tmp$Variant_Classification, vc_nonSyn)) { # 如果突變類型為coding類
??????nonSyncount[mutSam, mutGene] <- 1 # 則coding計(jì)數(shù)矩陣設(shè)為1
????}
????if(is.element(tmp$Variant_Classification, vc_Syn)) { # 如果突變類型為非coding類
??????Syncount[mutSam, mutGene] <- 1 # 則非coding計(jì)數(shù)矩陣設(shè)為1
????}
??}
}
# 根據(jù)癌種計(jì)算累積突變樣本量
nonSyncount$tumor <- mafsam[rownames(nonSyncount),"label"]
Syncount$tumor <- mafsam[rownames(Syncount),"label"]
nonSyncount.tumor <- as.data.frame(apply(nonSyncount[,setdiff(colnames(nonSyncount), "tumor")], 2, function(x) tapply(x, INDEX=factor(nonSyncount$tumor), FUN=sum, na.rm=TRUE))) # 對(duì)同一種癌種的非沉默突變求和
Syncount.tumor <- as.data.frame(apply(Syncount[,setdiff(colnames(Syncount), "tumor")], 2, function(x) tapply(x, INDEX=factor(Syncount$tumor), FUN=sum, na.rm=TRUE))) # 對(duì)同一種癌種的沉默突變求和
mutCount <- matrix(NA, nrow = length(genelist), ncol = length(unique(mafsam$label)),
???????????????????dimnames = list(genelist, unique(mafsam$label)))
mutCount <- as.data.frame(mutCount)
mutFreq <- mutCount
#通過(guò)循環(huán)計(jì)算感興趣基因在每種腫瘤中出現(xiàn)的coding和非coding的突變頻率。
for (i in genelist) {
??for(j in unique(mafsam$label)) {
????tumor <- sapply(strsplit(j," (", fixed = T),"[",1)
????if(nonSyncount.tumor[j,i] > 0) { # 若非沉默突變存在
??????mutCount[i,j] <- nonSyncount.tumor[j,i] # 記錄該癌種的非沉默突變數(shù)
??????mutFreq[i,j] <- nonSyncount.tumor[j,i]/as.numeric(table(mafsam$`cancer type`)[tumor]) * 100 # 根據(jù)該癌種的樣本數(shù)計(jì)算突變頻率(個(gè)人認(rèn)為原文并沒(méi)有如此計(jì)算,原文的顏色映射似乎是按照突變數(shù)目來(lái)的,而不是突變頻率)
????}
????if(nonSyncount.tumor[j,i] == 0 & Syncount.tumor[j,i] > 0) { # 若非沉默突變不存在,而沉默突變存在
??????mutCount[i,j] <- 0 # 則記錄該癌種的突變數(shù)目為0
??????mutFreq[i,j] <- 0 # 突變頻率為0
????}
????if(nonSyncount.tumor[j,i] == 0 & Syncount.tumor[j,i] == 0) { # 若非沉默突變不存在,沉默突變也不存在
??????mutCount[i,j] <- NA # 則此單元為空值
??????mutFreq[i,j] <- 0 # 突變頻率為0
????}
??}
}
5.繪制熱圖和瀑布圖
#長(zhǎng)寬數(shù)據(jù)轉(zhuǎn)換
ggCount <- melt(as.matrix(mutCount), varnames = c("Cancer", "Gene"), as.is = T)
ggFreq <- melt(as.matrix(mutFreq), varnames = c("Cancer", "Gene"), as.is = T)
#合并兩個(gè)文件
ggData <- cbind.data.frame(ggCount, Freq = ggFreq$value)
#設(shè)置列名
colnames(ggData) <- c("Cancer","Gene","Count","Freq")
#開(kāi)始繪制熱圖
p <-
??ggplot(data = ggData) +
??geom_tile(mapping = aes(Gene, Cancer, fill = Freq), col = "grey80") +
??geom_text(mapping = aes(Gene, Cancer, label = Count), size =3) +
??scale_fill_gradient(low = "white","high" = "red") +
??scale_x_discrete(position = "top") +
??labs(fill = "Mutation Frequency (%)") +
??theme_bw() + #theme(axis.text.x = element_text(angle = 45))
??theme(axis.title = element_blank(),
????????axis.text.x = element_text(hjust = 0, size = 10, color = "black", angle = 45),
????????axis.text.y = element_text(hjust = 1, size = 10, color = "black"),
????????axis.ticks = element_line(size=0.2, color="black"),
????????axis.ticks.length = unit(0.2, "cm"),
????????panel.background = element_blank(),
????????panel.grid = element_blank())
ggsave(p, filename = "frequency_heatmap.pdf", width = 10,height = 4)
#繪制瀑布圖
# 讀取感興趣的基因子集突變文件
panmaf <- read.maf(maf = "maf of genes of interest.txt",
???????????????????removeDuplicatedVariants = TRUE,
???????????????????isTCGA = FALSE,
???????????????????vc_nonSyn = c("Frame_Shift_Del",
?????????????????????????????????"Frame_Shift_Ins",
?????????????????????????????????"Splice_Site",
?????????????????????????????????"Translation_Start_Site",
?????????????????????????????????"Nonsense_Mutation",
?????????????????????????????????"Nonstop_Mutation",
?????????????????????????????????"In_Frame_Del",
?????????????????????????????????"In_Frame_Ins",
?????????????????????????????????"Missense_Mutation"))
mafAnno<-mafsam[which(mafsam$simple_barcode%in%panmaf@data$Tumor_Sample_Barcode),]
colnames(mafAnno)[1:2] <- c("Cancer","Tumor_Sample_Barcode")
# 設(shè)置突變顏色
mutcol <- brewer.pal(n = 10, name = 'Paired')
names(mutcol)<-c('Frame_Shift_Del','Missense_Mutation','Nonsense_Mutation', 'Frame_Shift_Ins','In_Frame_Ins','Splice_Site','In_Frame_Del','Nonstop_Mutation','Translation_Start_Site','Multi_Hit')
cancercolors<-NMF:::ccRamp(brewer.pal(n=12,name='Paired'),length(unique(mafAnno$Cancer)))
names(cancercolors) <- unique(mafAnno$Cancer)
#繪制瀑布圖
annocolors = list(Cancer = cancercolors)
pdf(file = "oncoprint.pdf", width = 10, height = 5)
oncoplot(maf = panmaf, # MAF文件
?????????colors = mutcol, # 突變類型的顏色
?????????minMut = 0.05 * nrow(mafAnno), # 所顯示的突變至少有5%的突變頻率
?????????bgCol = "grey95", # 瀑布圖背景色
?????????borderCol = NA, # 突變的邊框(由于樣本太多所以不設(shè)置邊框)
?????????annotationDat = mafAnno, # 樣本注釋文件
?????????annotationColor = annocolors,
?????????clinicalFeatures = "Cancer", # 從注釋文件中顯示的信息
?????????sortByAnnotation = T) # 按照mafAnno的第一列排序,即按照Cancer排序
invisible(dev.off())
6.?結(jié)果文件
1.?maf_of_genes_of_interest.txt
該文件為提取的感興趣基因集在泛癌中的突變數(shù)據(jù);重點(diǎn)關(guān)注基因名,變異類型和樣本信息這三列信息。

2.?simple_sample_annotation.txt
該文件為簡(jiǎn)化后的泛癌樣本信息文件,第一列樣本名,第二列腫瘤類型。

3.?frequency_heatmap.pdf
該圖為目標(biāo)基因集在不同腫瘤組織中突變頻率熱圖,橫坐標(biāo)為腫瘤類型以及每種腫瘤所包含的樣本數(shù)目,縱坐標(biāo)為目標(biāo)基因,通過(guò)顏色深淺來(lái)表示突變頻率高低。

4.?frequency_heatmap.pdf
該圖為繪制的感興趣基因在不同腫瘤組織中突變情況的瀑布圖,利用不同的顏色分別來(lái)表示不同腫瘤類型和發(fā)生的變異類型。

小云最終繪制了反映感興趣基因集突變情況的熱圖和瀑布圖,只需要輸入自己感興趣的基因集文件,就可以輕松的完成該分析,泛癌分析的其他分析內(nèi)容可以嘗試使用本公司新開(kāi)發(fā)的云平臺(tái)生物信息分析小工具,零代碼完成分析,云平臺(tái)網(wǎng)址:http://www.biocloudservice.com/home.html;主要包括泛癌中體細(xì)胞拷貝數(shù)變異分析(http://www.biocloudservice.com/704/704.php),基因在腫瘤與正常組織中差異分析(http://www.biocloudservice.com/710/710.php)等相關(guān)泛癌分析小工具,歡迎大家和小云一起討論學(xué)習(xí)奧,下期再見(jiàn)奧。
好了,今天小云的分享就到這里啦,完整版代碼上“云生信學(xué)生物信息學(xué)”公正號(hào)回復(fù)“代碼”領(lǐng)取喲~
