二代測(cè)序數(shù)據(jù)篩選序列
一般來(lái)說(shuō)二代測(cè)序數(shù)據(jù)的數(shù)據(jù)量是非常龐大的,其中也包含了一些我們不需要的數(shù)據(jù),經(jīng)過(guò)fastp軟件篩選之后,還存在大量不符合我們期望的數(shù)據(jù),那如何進(jìn)一步篩選我們想要的數(shù)據(jù)呢,今天就讓小云帶大家看看數(shù)據(jù)是怎么被進(jìn)一步篩選的吧。
下面小云將用堿基編輯的二代測(cè)序數(shù)據(jù)做一個(gè)詳細(xì)的步驟介紹。首先我們要擁有堿基編輯前的原序列信息,這樣才能對(duì)我們編輯后的信息做對(duì)比,原信息如圖所示。

import pandas as pd
flank = pd.read_csv("./TCDG_TDGNG_CDGNGG-第三版.csv", index_col=None, header=0,sep = ',',encoding="utf-8")
flank
擁有序列原信息后,就能從編輯后的序列中挑選我們想要的序列了,我們拿五組已經(jīng)經(jīng)過(guò)barcode分組的序列信息,根據(jù)barcode分組的詳細(xì)步驟已經(jīng)在上一篇文章中講述,還沒(méi)有掌握的小伙伴記得去看哦,我們將這五組的數(shù)據(jù)進(jìn)行篩選。

篩選的原理就是我們利用GN20與N20的上下游序列不變,來(lái)根據(jù)上下游序列定位出GN20與N20,將定位出來(lái)的信息與原表格信息進(jìn)行對(duì)比,我們規(guī)定如果對(duì)比相似度在0.7以上則留下該序列。
from Bio.Seq import Seq#導(dǎo)入庫(kù)
from fuzzysearch import find_near_matches#序列對(duì)比
from difflib import SequenceMatcher#導(dǎo)入庫(kù)
def similarity(a, b):#相似度對(duì)比
????return SequenceMatcher(None, a, b).ratio()
import time#計(jì)時(shí)
GN20_list = list(flank['GN20'])#提取表格信息
for sample_barcode in ['A1','A2','A3','A4','A5']:#遍歷每個(gè)txt文件
????start = time.time()
????B4 = open("./1_%s.txt"%sample_barcode)
????N20 = open('./1_%s_N20.txt'%sample_barcode,'w')#創(chuàng)建新的txt,寫(xiě)入結(jié)果
????i = 0
????o = 0
????k = 0
????for line in B4:#遍歷每條序列
????????gN20_right = 'GTTTTAGAGCTAGAAATA'
????????N20_right = 'AAAGAATTCTCGACCT'
????????gN20_left = 'GTGGAAAGGACGAAACACC'
????????gN20_left_index = find_near_matches(gN20_left, line, max_l_dist=2)#序列對(duì)比
????????if gN20_left_index:
????????????gN20_right_index = find_near_matches(gN20_right, line, max_l_dist=2)
????????????if gN20_right_index: ?????????
????????????????gN20_left_end = gN20_left_index[0].end ???
????????????????gN20_right_start = gN20_right_index[0].start
????????????????gn20 = line[gN20_left_end:gN20_right_start]#確定GN20
????????????????if gn20 in GN20_list:
????????????????????o+=1 ?????????????
????????????????????N20_right_index = find_near_matches(N20_right, line, max_l_dist=2)
????????????????????if N20_right_index: ???????
????????????????????????N20_right_start = N20_right_index[0].start
????????????????????????#ACGT relative position
????????????????????????N20_left_index = line[N20_right_start-29:N20_right_start-17].rfind('ACGT')
????????????????????????if N20_left_index>=0:
????????????????????????????N20_left_end = N20_left_index + N20_right_start-29
????????????????????????????n20 = line[N20_left_end+4:N20_right_start-3] ?#確定N20
????????????????????????????if similarity(gn20,n20)>0.7:#相似度
????????????????????????????????N20.write(gn20+','+n20+','+line)
????????????????????????????????k+=1
????????if i%500000==0:#程序進(jìn)程
????????????print(sample_barcode,i,o,k)
????????i+=1
????end = time.time()
????print(end-start)#記錄時(shí)間
????B4.close()
N20.close()
當(dāng)腳本運(yùn)行結(jié)束,我們就得到了符合條件的所有序列。

好了,今天小云的分享就到這里啦,小伙伴們有什么問(wèn)題歡迎來(lái)和小云討論分享呀。

