泛癌多組學之基因表達與體細胞拷貝數(shù)變異相關性分析

今天小云想帶著小伙伴進行泛癌聯(lián)多組學分析,通過該推文完全掌握泛癌中基因表達與體細胞拷貝數(shù)變異相關性分析,通過基因發(fā)生的拷貝數(shù)變異情況,來研究拷貝數(shù)變異對基因表達的影響,從而可以推斷出影響該疾病的分子機制,全方面來研究調(diào)控機制,分析內(nèi)容會更有說服性,接下來跟這小云開始今天的學習奧。
1.如何進行基因表達與體細胞拷貝數(shù)變異相關性分析?
其實該分析很簡單,只需要準備感興趣基因集文件,泛癌表達矩陣文件和泛癌樣本信息文件,分別提取基因集的基因表達和體細胞CNV文件,最后進行二者相關性分析,并繪制相關性氣泡圖,小云覺得小伙伴肯定能迅速掌握該分析奧,那就跟著小云開始實操學習吧!
2.需要的R包
#下載需要的R包
BiocManager::install("ComplexHeatmap")
install.packages("ggpubr", "randomcoloR")
install.packages("ggplot2")
install.packages("data.table")
BiocManager::install("impute")
#載入需要的R包
library(ggplot2)
library(data.table)
library(impute)
library(ComplexHeatmap)
library(randomcoloR)
library(ggpubr)
library(GSVA)
library(clusterProfiler)
3.數(shù)據(jù)下載
#泛癌樣本信息下載-https://gdc.cancer.gov/about-data/publications/pancanatlas
merged_sample_quality_annotations.tsv,第一列和第二列為樣本信息,第三列為腫瘤類型;其他列可以忽略奧。

#泛癌表達矩陣文件下載-https://gdc.cancer.gov/about-data/publications/pancanatlas
EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv,第一列為基因名,其他列為樣本名;注意,基因名是以“|”分割,需要做格式轉化。

#泛癌體細胞拷貝數(shù)變異文件-https://xenabrowser.net/datapages/?dataset=TCGA.PANCAN.sampleMap%2FGistic2_CopyNumber_Gistic2_all_data_by_genes&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
Gistic2_CopyNumber_Gistic2_all_data_by_genes.gz,第一列為基因名,其他列為樣本信息。

4.泛癌中基因表達與體細胞拷貝數(shù)變異相關性分析
# 獲得同時有腫瘤和正常樣本的腫瘤名
tumors <- c("BLCA","BRCA","CESC","CHOL","COAD",
????????????"ESCA","GBM","HNSC","KICH","KIRC",
????????????"KIRP","LIHC","LUAD","LUSC","PAAD",
????????????"PRAD","READ","STAD","THCA","UCEC")
# 獲得感興趣的基因集
frg<- c("CDKN1A","HSPA5","TTC35","SLC7A11","NFE2L2","MT1G","HSPB1","GPX4","FANCD2","CISD1","FDFT1","SLC1A5","SAT1",
?????????"TFRC","RPL8","NCOA4","LPCAT3","GLS2","DPP4","CS","CARS","ATP5G3","ALOX15","ACSL4","EMC2")
# 簡化泛癌樣本信息文件
rawAnno <- read.delim("merged_sample_quality_annotations.tsv",sep = "\t",row.names = NULL,check.names = F,stringsAsFactors = F,header = T) # 數(shù)據(jù)來自PanCanAtlas
#提取樣本名1-15個字符
rawAnno$simple_barcode <- substr(rawAnno$aliquot_barcode,1,15)
#去除重復的樣本名
samAnno <- rawAnno[!duplicated(rawAnno$simple_barcode),c("cancer type", "simple_barcode")]
#去除組織類型為空的數(shù)據(jù)
samAnno <- samAnno[which(samAnno$`cancer type` != ""),]
write.table(samAnno,"simple_sample_annotation.txt",sep = "\t",row.names = F,col.names = T,quote = F)
# 快速讀取表達矩陣文件
expr <- fread("EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv",sep = "\t",stringsAsFactors = F,check.names = F,header = T)
#將數(shù)據(jù)轉換為數(shù)據(jù)框
expr <- as.data.frame(expr)
#將第一列作為行名
rownames(expr) <- expr[,1]
#刪除第一列
expr <- expr[,-1]
# 調(diào)整基因名
gene <- sapply(strsplit(rownames(expr),"|",fixed = T), "[",1)
expr$gene <- gene
# 移除重復樣本
expr <- expr[!duplicated(expr$gene),]
rownames(expr) <- expr$gene
expr <- expr[,-ncol(expr)]
# 取部分表達譜,感興趣的基因集
comgene <- intersect(rownames(expr),frg)
expr_sub <- expr[comgene,]
colnames(expr_sub) <- substr(colnames(expr_sub),1,15)
expr_sub <- expr_sub[,!duplicated(colnames(expr_sub))]
#通過循環(huán)提取每種腫瘤組織的表達矩陣文件
for (i in tumors) {
??message("--",i,"...")
??sam <- samAnno[which(samAnno$`cancer type` == i),"simple_barcode"] # 獲取該腫瘤的全部樣本
??comsam <- intersect(colnames(expr_sub), sam) # 取出與基因表達泛癌數(shù)據(jù)交集的樣本
??
??tumsam <- comsam[substr(comsam,14,14) == "0"] # 獲得腫瘤樣本
??norsam <- comsam[substr(comsam,14,14) == "1"] # 獲得正常樣本
??expr_subset <- expr_sub[,c(tumsam,norsam)]
??expr_subset[expr_subset < 0] <- 0 # 這份數(shù)據(jù)里存在負值,即便負值比較小,但也要矯正,如果使用其他泛癌表達譜根據(jù)情況而定
??expr_subset <- as.data.frame(impute.knn(as.matrix(expr_subset))$data)
??write.table(expr_subset, paste0("TCGA_",i,"_expr_subset.txt"),sep = "\t",row.names = T,col.names = NA,quote = F)
}
#快速讀取CNV數(shù)據(jù)
cna <- fread("Gistic2_CopyNumber_Gistic2_all_data_by_genes.gz",sep = "\t",stringsAsFactors = F,header = T,check.names = F)
#將數(shù)據(jù)轉換為數(shù)據(jù)框
cna <- as.data.frame(cna)
comgene <- intersect(cna$Sample,frg)
# 只包含感興趣基因的CNA子集
cna <- cna[cna$Sample %in% comgene,]
rownames(cna) <- cna[,1]
cna <- cna[,-1]
#相關性計算
corCnaExpr <- NULL
for (i in tumors) {
??message("--",i,"...")
??# 讀取上面拆分好的表達數(shù)據(jù)
??expr_subset <- read.table(paste0("TCGA_",i,"_expr_subset.txt"),sep = "\t",row.names = 1,check.names = F,stringsAsFactors = F,header = T)
??# 取出同時具有表達且具有拷貝數(shù)的樣本
??comsam <- intersect(colnames(expr_subset),colnames(cna))
??cna_subset <- cna[,comsam] # 生成拷貝數(shù)子集
??rownames(cna_subset) <- gsub("EMC2","TTC35",rownames(cna_subset)) # EMC2是原文使用的TTC35基因的同名,使用子集數(shù)據(jù)時無須修正
??expr_subset <- expr_subset[,comsam] # 生成基因表達子集
??# 把cnv按癌癥名保存到單個文件
??write.table(cna_subset, paste0("TCGA_",i,"_broadcnv_subset_matched_expr.txt"),sep = "\t",row.names = T,col.names = T,quote = F)
??# 計算相關系數(shù)
??corTab <- NULL
??for (j in rownames(cna_subset)) {
????tmp1 <- as.numeric(cna_subset[j,]) # 相應的CNA數(shù)據(jù)
????tmp2 <- as.numeric(expr_subset[j,]) # 相應的表達數(shù)據(jù)
????cor.res <- cor.test(tmp1,tmp2, method = "spearman") # spearman相關性
????#將計算的相關性矩陣合并起來
????corTab <- rbind.data.frame(corTab,
???????????????????????????????data.frame(gene = j,
??????????????????????????????????????????tumor = i,
??????????????????????????????????????????Correlation = ifelse(is.na(cor.res$estimate), 0, cor.res$estimate),
??????????????????????????????????????????Pvalue = ifelse(is.na(cor.res$p.value), 1, cor.res$p.value),
??????????????????????????????????????????stringsAsFactors = F),
???????????????????????????????stringsAsFactors = F)
??}
??corCnaExpr <- rbind.data.frame(corCnaExpr,
?????????????????????????????????corTab,
?????????????????????????????????stringsAsFactors = F)
}
5.繪制相關性氣泡圖
# 設置顏色
blue <- "#4577FF"
red <- "#C2151A"
orange <- "#E45737"
#確定繪圖基因的順序
corCnaExpr$gene <- factor(corCnaExpr$gene,
??????????????????????????levels= rev(c("CDKN1A","HSPA5","TTC35","SLC7A11","NFE2L2","MT1G","HSPB1","GPX4","FANCD2","CISD1", ???????????????????????????????????????"FDFT1","SLC1A5","SAT1","TFRC","RPL8","NCOA4","LPCAT3","GLS2","DPP4","CS","CARS","ATP5G3","ALOX15","ACSL4")))
#繪制泡泡圖
my_palette <- colorRampPalette(c(blue,"white",orange), alpha=TRUE)(n=128)
ggplot(corCnaExpr, aes(x=tumor,y=gene)) +
??geom_point(aes(size=-log10(Pvalue), color=Correlation)) +
??scale_color_gradientn('Correlation',
????????????????????????colors=my_palette) +
??scale_size_continuous(range = c(1,12)) + #圓點的大小范圍
??#修改主題
??theme_bw() +
??theme(axis.text.x = element_text(angle = 45, size = 12, hjust = 0.3, vjust = 0.5, color = "black"),
????????axis.text.y = element_text(size = 12, color = rep(c(red,blue),c(14,10))),
????????axis.title = element_blank(),
????????panel.border = element_rect(size = 0.7, linetype = "solid", colour = "black"),
????????legend.position = "bottom",
????????plot.margin = unit(c(1,1,1,1), "lines"))
#保存圖片
ggsave("correlation_between_cna_and_expression_of_interested_genes_in_pancancer.pdf", width = 8,height = 8)
6.結果文件
1.?simple_sample_annotation.txt
該文件為提取的簡化的泛癌樣本信息文件,第一列為腫瘤組織,第二列為樣本名。

2.?TCGA_BLCA_expr_subset.txt
該文件為提取的感興趣基因集在特定腫瘤組織中的表達矩陣文件,行名為基因名,列名為樣本信息。

3.?TCGA_BLCA_broadcnv_subset_matched_expr.txt
該文件為提取的感興趣基因集在特定腫瘤組織中的CNV文件,行名為基因名,列名為樣本信息。

4.?TCGA_pancan_correlation_cnv_expr_subset.txt
該文件為計算的感興趣基因集在泛癌中基因表達與CNV的相關性分析結果文件,第一列為基因名,第二列為腫瘤類型,第三列為計算的相關系數(shù),第四列為Pvalue值。

5.?correlation_between_cna_and_expression_of_interested_genes_in_pancancer.pdf
該圖為感興趣基因在泛癌中基因表達與體細胞CNV的相關性氣泡圖,橫坐標為腫瘤組織名稱,縱坐標為基因名,點的大小表示pvalue值,用不同顏色表示相關系數(shù)大小。

小云通過一文完成了泛癌中感興趣基因集基因表達與體細胞拷貝數(shù)變異之間的相關性分析,該代碼只需要提供感興趣基因集文件就可以完成分析滴,泛癌分析的其他相關內(nèi)容可以嘗試使用本公司新開發(fā)的云平臺生物信息分析小工具,零代碼完成分析,云平臺網(wǎng)址:http://www.biocloudservice.com/home.html,主要包括泛癌中特定通路基因的突變分析(http://www.biocloudservice.com/740/740.php),泛癌中拷貝數(shù)變異分析(http://www.biocloudservice.com/700/700.php),泛癌中熱圖和箱線圖(http://www.biocloudservice.com/773/773.php),基因在腫瘤與正常組織中差異分析(http://www.biocloudservice.com/710/710.php)等相關泛癌分析小工具
好了,今天小云的分享就到這里啦,完整版代碼上“云生信學生物信息學”公正號回復“代碼”領取喲~小伙伴們有什么問題歡迎來和小云討論分享呀。
