STMLST: Serotype Identification By Multi-loci Sequence Typing
2????? Methods
The analysis procedure of STMLST is depicted in Figure 1. STMLST firstly reformats the input file to FASTA format and maps the input sequences against a alleles sequences database. After parsing the mapping result, STMLST obtains the formatted data that could be used to identify a list of pertinent organisms. STMLST records a high score to an organization if the “Subject sequence length”, “Alignment length” and “Number of identical matches” in the formatted data are equal, and a low score if they are not. At this point, we can get a list containing the organisms and the corresponding scores. The above operations are based on the following principle: if the input sequences have high similarity to the alleles of an organism, the input sequences have a high probability of belonging to that organism. STMLST uses the information of organism with the highest score to construct a search statement and searches the sequence type and serotype database with this statement. Finally, STMLST outputs the subtyping result of the input sequences. Detailed data collection and algorithmic explanation of STMLST are in Methods.
2.1 Data Preprocessing
The data required to run the full functionality of STMLST is divided into three parts: a key alleles database for finding similar key alleles, a sequence type database for finding sequence types based on key alleles, and a serotype database for finding the corresponding serotypes based on sequence types. All three types of data are downloaded from PUBMLST, and the relationship between them is shown in Figure 2. The key alleles database consists of downloaded key alleles from more than one hundred organisms. We write local scripts to download these gene sequences and build a blast index to find similar key alleles by fast alignment. The sequence type database is used to store the mapping of different combinations of key alleles to the sequence types of the organism, which we download and store in the SQLite database using a local script. There is a non-one-to-one mapping relationship between serotypes and sequence types, which we extract from PUBMLST and store in the SQLite database.

2.2 Identification Strategy
?
We first align the input sequenced sequences with the key alleles database, record the key gene sequences that are successfully aligned with the input sequenced sequences, and mark the records of key alleles into three states according to the different degrees of similarity of the alignment. After all markers were recorded, each candidate organism is given a score based on the marker results. The rules for scoring are shown in Figure 3. According to Equation 2, x represents the number of different alleles that are similar for a given allele, the more the better thus the higher the final score f. θ represents the weights corresponding to different degrees of similarity, with larger θ representing greater similarity and thus higher final score f. The calculated s is the score corresponding to the alignment result of one allele of the organism. According to Equation 2, after accumulating the scores of all alleles to obtain the final score f, STMLST obtained the most likely organism to which the input sequencing data belongs. This organism is then searched in the sequence type database using the key alleles in the records, and the sequence type to which the input sequencing data may belong is obtained based on the mapping relationship between the key alleles and the sequence types. Finally, the possible serotypes are obtained by searching in the serotype database based on the sequence type. Since the data on serotype identification is not yet complete, we have combined it with SeqSero2 as a supplement. We import the serotype identification results of SeqSero2 as a supplement when the data is not sufficient resulting in a null result for Salmonella serotype identification. This measure combines the advantages of two different implementations of the subtyping and could effectively improve identification accuracy.
?? ?(1)
?
???????????????????? ??(2)
?
