【ROSALIND】【練Python,學生信】18 搜索開放讀碼框(ORF)

如果第一次閱讀本系列文檔請先移步閱讀【ROSALIND】【練Python,學生信】00 寫在前面 ?謝謝配合~

題目:
尋找所有開放讀碼框(ORF)
Given: A DNA string s of length at most 1 kbp in FASTA format.
所給:一條不超過1kb長的DNA序列s,以FASTA格式給出。
Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.
需得:所有ORF得到的蛋白序列,可以以任意順序給出。
?
測試數(shù)據(jù)
>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
測試輸出
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
?
背景
遺傳信息由DNA流向蛋白質(zhì),但不是所有的DNA序列都可以編碼出蛋白質(zhì)。一個開放讀碼框(open reading frame, ORF)由起始密碼子開始,由終止密碼子結(jié)束,且序列中間沒有終止密碼子。將ORF中的序列翻譯為多肽即為候選的蛋白質(zhì)序列。
DNA序列可以按六種框架閱讀和翻譯,取決于我們?nèi)绾谓M合三聯(lián)密碼子。比如...AUGCUGAC... 可以被閱讀為...AUGCUG...,或 ...UGCUGA...或...GCUGAC...;兩條鏈都可以作為模板鏈,因此還需要得到原序列的反向互補序列,以同樣的思路得到另外三種閱讀方式。
?
思路
本題可拆解為如下幾個問題:
其一,得到DNA的反向互補序列,并將兩個序列都轉(zhuǎn)錄為RNA。在之前的文檔里這一步已有解答,套用即可。
其二,把ORF序列找到并分別存儲下來。這里只需挨個比對是否包含起始密碼子或終止密碼子即可,找到就把編碼序列存在列表里。
其三,把核酸序列翻譯成蛋白序列。直接用之前的解答即可。
?
代碼
codon_table = {
??? 'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A', 'CGU':'R', 'CGC':'R',
??? 'CGA':'R', 'CGG':'R', 'AGA':'R', 'AGG':'R', 'UCU':'S', 'UCC':'S',
??? 'UCA':'S', 'UCG':'S', 'AGU':'S', 'AGC':'S', 'AUU':'I', 'AUC':'I',
??? 'AUA':'I', 'UUA':'L', 'UUG':'L', 'CUU':'L', 'CUC':'L', 'CUA':'L',
??? 'CUG':'L', 'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G', 'GUU':'V',
??? 'GUC':'V', 'GUA':'V', 'GUG':'V', 'ACU':'T', 'ACC':'T', 'ACA':'T',
??? 'ACG':'T', 'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'AAU':'N',
??? 'AAC':'N', 'GAU':'D', 'GAC':'D', 'UGU':'C', 'UGC':'C', 'CAA':'Q',
??? 'CAG':'Q', 'GAA':'E', 'GAG':'E', 'CAU':'H', 'CAC':'H', 'AAA':'K',
??? 'AAG':'K', 'UUU':'F', 'UUC':'F', 'UAU':'Y', 'UAC':'Y', 'AUG':'M',
??? 'UGG':'W',
??? 'UAG':'Stop', 'UGA':'Stop', 'UAA':'Stop'
??? }
?
?
def readfasta(lines):
'''閱讀fasta文件的函數(shù)'''
??? seq = []
??? index = []
??? seqplast = ""
??? numlines = 0
??? for i in lines:
??????? if '>' in i:
??????????? index.append(i.replace("\n", "").replace(">", ""))
??????????? seq.append(seqplast.replace("\n", ""))
??????????? seqplast = ""
??????????? numlines += 1
??????? else:
??????????? seqplast = seqplast + i.replace("\n", "")
??????????? numlines += 1
??????? if numlines == len(lines):
??????????? seq.append(seqplast.replace("\n", ""))
??? seq = seq[1:]
??? return index, seq
?
?
def trans(seq):
'''將RNA序列翻譯成多肽的函數(shù)'''
??? i = 0
??? p = ""
??? while i < len(seq)/3:
??????? n = seq[3 * i] +seq[3*i+1] + seq[3*i+2]
??????? r = codon_table[n]
??????? i += 1
??????? p = p + r
??? return p
?
?
f = open('input.txt', 'r')
lines = f.readlines()
f.close()
?
[index, seq] = readfasta(lines)
seq = seq[0]
c = ''
?
for i in range(len(seq)):? # 得到原序列的互補序列
??? if seq[i] == 'A':
??????? c += 'T'
??? elif seq[i] == 'G':
??????? c += 'C'
??? elif seq[i] == 'T':
??????? c += 'A'
??? elif seq[i] == 'C':
??????? c += 'G'
?
rseq = c
rseq = rseq[::-1]? # 得到反向序列
?
seq = seq.replace('T','U')
rseq = rseq.replace('T','U')? # 將DNA轉(zhuǎn)錄為RNA
?
start = 'AUG'
stop = ['UAG', 'UGA', 'UAA']
i = 0
j = 0
result = []
while i < len(seq) - 2:
??? if seq[i:i+3] == start:? # 找start codon
??????? j = i
??????? sequence = ""
??????? while i < len(seq) - 2:
??????????? if seq[i:i+3] == stop[0] or seq[i:i+3] == stop[1] or seq[i:i+3] == stop[2]: ?# 找stop codon
??????????????? result.append(sequence)
??????????????? break
??????????? sequence = sequence + seq[i:i+3]
??????????? i += 3
??? i = j + 1
??? j += 1
?
i = 0
j = 0
while i < len(rseq) - 2:? # 對反向互補序列進行相同操作,最好包裝為函數(shù)實現(xiàn)復用
??? if rseq[i:i+3] == start:
??????? j = i
??????? sequence = ""
??????? while i < len(rseq) - 2:
??????????? if rseq[i:i+3] == stop[0] or rseq[i:i+3] == stop[1] or rseq[i:i+3] == stop[2]:
??????????????? result.append(sequence)
??????????????? break
??????????? sequence = sequence + rseq[i:i+3]
??????????? i += 3
??? i = j + 1
??? j += 1
?
result2=[]
for i in result:
??? if not i in result2:
??????? result2.append(i)
result = result2? # 用result2為中介除去重復出現(xiàn)的序列
# print(result)
?
i = 0
proteins = []
while i < len(result):? # 翻譯序列為多肽
??? protein = trans(result[i])
??? proteins.append(protein)
??? print(protein)
??? i += 1
?
f = open('output.txt','a')? # 寫入文件便于提交
i = 0
while i < len(proteins):
??? f.write(proteins[i])
??? f.write('\n')
??? i += 1
f.close()