【ROSALIND】【練Python,學生信】10 尋找DNA的consensus string

如果第一次閱讀本系列文檔請先移步閱讀【ROSALIND】【練Python,學生信】00 寫在前面 ?謝謝配合~

題目:
尋找最可能的祖先序列
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
所給:不超過十條等長DNA序列,長度均不超過1kb,以FASTA格式給出。
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
需得:這些DNA的一致序列和profile矩陣(如果存在多條一致序列,返回任意一條即可)。
?
測試數(shù)據(jù)
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
測試輸出
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
?
背景
在理想狀態(tài)下,當我們有多條長度相等的同源序列時,可以通過序列比對推到出最有可能的祖先序列,方法如下:
將m條長度為n的序列組成一個m×n的矩陣
???????? G G G C A A C T
???????? A T G G A T C T
? ? ? ? ?A A G C A A C C
???????? T T G G A A C T
???????? A T G C C A T T
???????? A T G G C A C T?
?比對之后可以得到一個4×n的矩陣,分別統(tǒng)計A、C、G、T在各位點出現(xiàn)的次數(shù):?
?
A?? ? 5 1 0 0 5 5 0 0
C?? ? 0 0 1 4 2 0 6 1
G?? ? 1 1 6 3 0 1 0 0
T?? ? 1 5 0 0 0 1 1 6
取每個位點出現(xiàn)次數(shù)最多的符號即可(出現(xiàn)次數(shù)最多的符號可能有多個,因此一致序列可能有多個可能)。
?
思路
需要統(tǒng)計各符號在每條序列每個位點上出現(xiàn)的頻率,因此考慮用兩層循環(huán),第一層用于掃描位點,第二層用于掃描各序列,即可統(tǒng)計出各符號出現(xiàn)的次數(shù)。
?
代碼
def readfasta(lines):? # 用封裝好的函數(shù)讀取FASTA文件
??? seq = []
??? index = []
??? seqplast = ""
??? numlines = 0
??? for i in lines:
??????? if '>' in i:
??????????? index.append(i.replace("\n", "").replace(">", ""))
??????????? seq.append(seqplast.replace("\n", ""))
??????????? seqplast = ""
??????????? numlines += 1
??????? else:
??????????? seqplast = seqplast + i.replace("\n", "")
??????????? numlines += 1
??????? if numlines == len(lines):
??????????? seq.append(seqplast.replace("\n", ""))
??? seq = seq[1:]
??? return index, seq
?
?
f = open('rosalind_cons.txt', 'r')
lines = f.readlines()
f.close()
[index, seq] = readfasta(lines)
?
A = []
G = []
T = []
C = []
i = 0
while i < len(seq[0]):? # 第一層循環(huán),掃描位點
??? j = 0
??? a = 0
??? g = 0
??? c = 0
??? t = 0
??? while j < len(seq):? # 第二層循環(huán),掃描各序列
??????? if seq[j][i] == 'A':? # 判斷該序列該位點是否為”A”
??????????? a += 1? # 是則相應計數(shù)加一
??????? elif seq[j][i] == 'G':
??????????? g += 1
??????? elif seq[j][i] == 'C':
??????????? c += 1
??????? elif seq[j][i] == 'T':
??????????? t += 1
??????? j += 1
??? i +=1
??? A.append(a)? # 記錄該位點出現(xiàn)”A”的次數(shù)
??? G.append(g)
??? C.append(c)
??? T.append(t)
?
# 推出一致序列
common = ''
base = 0
while base < len(A):
??? if max(A[base],G[base],C[base],T[base]) == A[base]:
??????? common += 'A'
??? elif max(A[base],G[base],C[base],T[base]) == G[base]:
??????? common += 'G'
??? elif max(A[base],G[base],C[base],T[base]) == T[base]:
??????? common += 'T'
??? elif max(A[base],G[base],C[base],T[base]) == C[base]:
??????? common += 'C'
base += 1
?
# 把計數(shù)各項從整數(shù)改為字符串型,方便寫入輸出文件
i = 0
while i < len(A):
??? A[i] = str(A[i])
??? G[i] = str(G[i])
??? C[i] = str(C[i])
??? T[i] = str(T[i])
??? i += 1
?
f = open('output.txt','a')
f.write(common + '\n')
f.write("A: ")
f.write(' '.join(A) + '\n')
f.write("C: ")
f.write(' '.join(C) + '\n')
f.write("G: ")
f.write(' '.join(G) + '\n')
f.write("T: ")
f.write(' '.join(T))
f.close()