【ROSALIND】【練Python,學(xué)生信】34 測(cè)序片段錯(cuò)誤修正

如果第一次閱讀本系列文檔請(qǐng)先移步閱讀【ROSALIND】【練Python,學(xué)生信】00 寫(xiě)在前面 ?謝謝配合~

題目:
修正測(cè)序片段中的錯(cuò)誤(Error Correction in Reads)
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read s in the dataset, one of the following applies:
s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).
所給:約1000條等長(zhǎng)的測(cè)序片段,最多不長(zhǎng)過(guò)50bp,以FASTA格式給出。一些片段包含單個(gè)的堿基替換。單個(gè)片段s符合以下兩種情況之一:
????????s是正確的序列,在數(shù)據(jù)集中至少出現(xiàn)兩次(可能以反向互補(bǔ)序列的形式給出);
????????s包含錯(cuò)誤堿基,在數(shù)據(jù)集中只出現(xiàn)一次,與正確序列僅相差一個(gè)堿基。
Return: A list of all corrections in the form "[old read]->[new read]". (Each correction must be a single symbol substitution, and you may return the corrections in any order.)
需得:以錯(cuò)誤序列->修正序列形式給出的列表,可以用任何順序排列。
?
測(cè)試數(shù)據(jù)
>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC
測(cè)試輸出
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA
?
生物學(xué)背景
????????二代測(cè)序讀長(zhǎng)很短,錯(cuò)誤率較高且無(wú)法預(yù)測(cè)。當(dāng)發(fā)現(xiàn)有堿基替換時(shí),需要考慮是真正的突變,還是由測(cè)序錯(cuò)誤導(dǎo)致。
?
思路
????????我的思路是用兩重循環(huán)掃描數(shù)據(jù)集,通過(guò)挨個(gè)比較,將數(shù)據(jù)分為正確序列和錯(cuò)誤序列兩部分。輸出的時(shí)候,再用兩重循環(huán),遍歷正確和錯(cuò)誤的部分,得到結(jié)果。
?
代碼
def readfasta(lines):
??? """讀入fasta格式文件的函數(shù)"""
??? seq = []
??? index = []
??? seqplast = ""
??? numlines = 0
??? for i in lines:
??????? if '>' in i:
??????????? index.append(i.replace("\n", "").replace(">", ""))
??????????? seq.append(seqplast.replace("\n", ""))
??????????? seqplast = ""
??????????? numlines += 1
??????? else:
??????????? seqplast = seqplast + i.replace("\n", "")
??????????? numlines += 1
??????? if numlines == len(lines):
??????????? seq.append(seqplast.replace("\n", ""))
??? seq = seq[1:]
??? return index, seq
def recom(s):
??? """輸出原序列反向互補(bǔ)序列的函數(shù)"""
??? re = s[::-1]
??? c = ""
??? for i in re:
??????? if i == 'A':
??????????? c = c + 'T'
??????? elif i == 'G':
??????????? c = c + 'C'
??????? elif i == 'T':
??????????? c = c + 'A'
??????? elif i == 'C':
??????????? c = c + 'G'
??? return c
def hamm(s1, s2):
??? """計(jì)算漢明距離的函數(shù)"""
??? hs = 0
??? for i in range(len(s1)):
??????? if s1[i] != s2[i]:
??????????? hs += 1
??? return hs
?
?
f = open('input.txt', 'r')
lines = f.readlines()
f.close()
[index, seq] = readfasta(lines)
recomseq = [] # 存儲(chǔ)反向互補(bǔ)序列
correct = [] # 存儲(chǔ)正確的序列
wrong = [] # 存儲(chǔ)錯(cuò)誤的序列
length = len(seq)
for i in range(length): # 依次取出每個(gè)測(cè)序片段
??? flag = 0
??? recomseq = recom(seq[i]) # 將其轉(zhuǎn)為反向互補(bǔ)序列
??? if seq[i] not in correct and recomseq not in correct: # 如果正反序列都不在已有的正確序列中
??????? for j in range(i+1,len(seq)):
??????????? if hamm(seq[i], seq[j]) == 0: # 如果在還未掃描的讀長(zhǎng)中發(fā)現(xiàn)了相同的序列
??????????????? flag = 1 # flag標(biāo)記是否有相同序列
??????????? elif hamm(recomseq, seq[j]) == 0: # 如果在還未掃描的讀長(zhǎng)中發(fā)現(xiàn)了反向互補(bǔ)序列的相同序列
??????????????? flag = 1
??????? if flag == 1: # 如果有相同序列,這個(gè)讀長(zhǎng)是正確的
??????????? correct.append(seq[i])
??????? else: # 反之,是錯(cuò)誤序列
??????????? wrong.append(seq[i])
f = open('output.txt', 'a')
for i in wrong:
??? for j in correct: # 依次取出錯(cuò)誤序列,與正確序列挨個(gè)比較
??????? recomseq = recom(j)
??????? if hamm(i, j) == 1: # 如果存在一個(gè)堿基替換
??????????? f.write(i + '->' + j)
??????????? f.write('\n')
??????????? # print(i, end='->')
??????????? # print(j)
??????? elif hamm(i, recomseq) == 1:
??????????? f.write(i, + '->' + recomseq)
??????????? f.write('\n')
??????????? # print(i, end='->')
??????????? # print(recomseq)
f.close()