【ROSALIND】【練Python,學生信】25 序列拼接

如果第一次閱讀本系列文檔請先移步閱讀【ROSALIND】【練Python,學生信】00 寫在前面 ?謝謝配合~

題目:
將基因組片段組裝為最短“超串”
Given: At most 50 DNA strings of approximately equal length, not exceeding 1 kbp, in FASTA format (which represent reads deriving from the same strand of a single linear chromosome). The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length..
所給:不超過50個DNA序列,長度相近且不會超過1kb,,以fasta格式給出,代表來自一個線性染色體上同一條鏈的片段。同時數(shù)據(jù)集滿足如下要求:肯定有唯一解法,使這些片段兩兩重合長度大于其一般長度,從而組裝成一個完整序列。
Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).
需得:由所給的所有序列組成的最短完整序列。
?
特別提醒
所給數(shù)據(jù)集是極其理想化的,在實際中reads不可能只來自同一條鏈。而且在實際中直接拼接出完整基因組的可能性極小,通常只能得到臨近的部分序列,稱為contigs(重疊群)。
?
測試數(shù)據(jù)
>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59
GCCGGAATAC
測試輸出
ATTAGACCTGCCGGAATAC
?
背景
測序使我們進行生物學研究的重要方法,現(xiàn)代高通量測序中,我們只能得到長度有限的部分序列片段,稱為reads,如何把大量reads拼接成完整的基因組成為亟待解決的重要問題。序列拼接的示意圖如下:

?
思路
我的想法是將所有序列兩兩比較,每次都將重合度最大的兩個序列合并,重復這個過程直至所有序列合并為一個序列。在這個思路的指導下,我用了大量的循環(huán),最多處達到三層循環(huán),但只要不被循環(huán)繞暈,整體思路還是較為簡單直接的。
?
?
代碼
def readfasta(lines):
"""讀入fasta格式文件的函數(shù)"""
??? seq = []
??? index = []
??? seqplast = ""
??? numlines = 0
??? for i in lines:
??????? if '>' in i:
??????????? index.append(i.replace("\n", "").replace(">", ""))
??????????? seq.append(seqplast.replace("\n", ""))
??????????? seqplast = ""
??????????? numlines += 1
??????? else:
??????????? seqplast = seqplast + i.replace("\n", "")
??????????? numlines += 1
??????? if numlines == len(lines):
??????????? seq.append(seqplast.replace("\n", ""))
??? seq = seq[1:]
??? return index, seq
?
def findoverlap(s1,s2):
??? """返回兩個序列重疊堿基數(shù)量的函數(shù)"""
??? n1 = len(s1)
??? n2 = len(s2)
??? if n1 >= n2:
??????? i = n2
??????? flag = 0
??????? while i > 0:
??????????? if s1[-i:] == s2[:i]: # 把前一個序列的后部與第二個序列的前部依次比較
??????????????? flag = i # 用flag變量記錄重疊的堿基數(shù)
??????????????? break
??????????? i -= 1
??? else:
??????? i = n1 -1
??????? flag = 0
??????? while i > 1:
??????????? if s1[-i:] == s2[:i]:
??????????????? flag = i
??????????????? break
?????????? ?i -= 1
??? return flag
def isoverlap(s1, s2):
??? """判斷兩個序列前后順序的函數(shù)"""
??? flag1 = findoverlap(s1, s2)
??? flag2 = findoverlap(s2, s1)
??? if flag1 >= flag2: # 比較兩個序列的先后位置對重疊堿基數(shù)的影響
??????? flag = flag1
??????? orders = 1 # 用orders變量記錄先后順序
??? elif flag2 > flag1:
??????? flag = flag2
??????? orders = 2
??? return flag, orders
def addseq(n, seq1, seq2):
??? """將兩個序列合并成一個的函數(shù)"""
??? n1 = len(seq1)
??? aseq = seq1[:n1 - n] + seq2
??? return aseq
f = open('input.txt', 'r')
lines = f.readlines()
f.close()
[index, seq] = readfasta(lines)
seq1 = ''
seq2 = ''
while len(seq) > 1: # 用一個大循環(huán)控制,直至所有片段合成一個
??? for i in range(len(seq)): # 下面的兩個for循環(huán)實現(xiàn)seq中所有序列兩兩比對
??????? miniseq = [] # miniseq用來防止序列自身比對
??????? k =0
??????? while k < len(seq):
??????????? if k != i:
??????????????? miniseq.append(seq[k])
??????????? k += 1
??????? maxoverlap = 0 # 用maxoverlap記錄當前輪比較最多的重合堿基數(shù)
??????? for j in range(len(miniseq)):
??????????? [overlap, orders] = isoverlap(seq[i], miniseq[j])
??????????? if overlap > maxoverlap: # 本次比較與之前的最大重合數(shù)比,若更大,則代替后者稱為新的最大重合堿基數(shù)
??????????????? maxoverlap = overlap
??????????????? if orders == 1: # seq1和seq2記錄帶來最大重合堿基數(shù)的兩個序列,orders指示順序
??????????????????? seq1 = seq[i]
??????????????????? seq2 = miniseq[j]
??????????????? else:
??????????????????? seq1 = miniseq[j]
??????????????????? seq2 = seq[i]
??? totalseq = addseq(maxoverlap, seq1, seq2) # 將本輪具有最大重合度的兩個序列合并
??? seq.remove(seq1) # 刪去已被合并的兩個序列
??? seq.remove(seq2)
??? seq.append(totalseq) # 將合并的新串加入seq序列,進入下一輪比較
f = open('output.txt', 'w')
f.write(totalseq)
f.close()
?