tRNAscan-SE Output Legend and Search Methods

Output Format

Tabular Output Format

Sequence                tRNA Bounds     tRNA    Anti    Intron Bounds   Cove
Name            tRNA #  Begin   End     Type    Codon   Begin   End     Score
--------        ------  -----   ---     ----    -----   -----   -----   -----
CELF22B7        1       12619   12738   Leu     CAA     12657   12692   60.01 
CELF22B7        2       19480   19561   Ser     AGA     0       0       80.44 
CELF22B7        3       26367   26439   Phe     GAA     0       0       80.32 
CELF22B7        4       26992   26920   Phe     GAA     0       0       80.32 
CELF22B7        5       23765   23694   Pro     CGG     0       0       75.76

Each new tRNA in a sequence is consecutively numbered in the "tRNA #" column. "tRNA Bounds" specify the starting (5') and ending (3') nucleotide bounds for the tRNA. tRNAs found on the reverse (lower) strand are indicated by having the Begin (5') bound greater than the End (3') bound (see tRNAs #4 & #5 in output above).

The "tRNA Type" is the predicted amino acid charged to the tRNA molecule based on the predicted "Anticodon" (written 5'->3') displayed in the next column. tRNAs that fit criteria for potential pseudogenes (poor primary or secondary structure, see Pseudogene Detection), will be marked with "Pseudo" in the "tRNA Type" column. If there is a predicted intron in the tRNA, the next two columns indicate the nucleotide bounds. If there is no predicted intron, both of these columns contain zero. The final column is the Cove score for the tRNA in bits. Note that this score will vary somewhat depending on the particular tRNA covariance model used in the analysis (the search mode selects which tRNA covariance model will be used: eukaryote-specific, prokaryote-specific, archae-specific, or general). tRNAscan-SE counts any sequence that attains a score of >= 20.0 bits as a tRNA (based on empirical studies conducted by Eddy & Durbin, 1999).

Secondary Structure Format

The first line contains the sequence name, trna#, tRNA bounds (in parentheses), and length of the tRNA. The next line contains the isoacceptor tRNA Type, Anticodon (with tRNA-relative and sequence-absolute bounds), and the Cove Score. This is identical information as would be seen in the tabular output format, excluding the anticodon bounds. The next line contains hash marks every 5 and 10 bp to ease position identification in the tRNA sequence that appears on the following line. On the sequence line, nucleotides matching the "consensus" tRNA model used in Cove analysis appear in upper case, while introns and other nucleotides in non-conserved positions are printed in lower-case letters. The last line contains predicted secondary structure folding of the tRNA, with nested ">" and "<" symbols representing base pairings. The various tRNA features are labelled in this example.
CELF22B7.trna4 (26992-26920)    Length: 73 bp
Type: Phe       Anticodon: GAA at 34-36 (26959-26957)   Score: 73.88

         *    |    *    |    *    |    *    |    *    |    *    |    *    |
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
     |     |  |              | |               |     |               ||     |
     +-----+  +--------------+ +---------------+     +---------------++-----+
        |        D-stem/loop       Anticodon            TPC stem/loop    |
        |                           stem/loop                            |
                                Isoacceptor stem

tRNAscan-SE Detection Algorithm

tRNAscan-SE does no tRNA detection itself, but instead combines the strengths of three independent tRNA prediction programs by negotiating the flow of information between them, performing a limited amount of post-processing, and outputting the results. The program works in three main phases. In the first stage, it runs two independent tRNA detection programs on the input DNA sequence. These relatively fast, first-pass detection programs include a modified, optimized version of tRNAscan 1.3 (1), and EufindtRNA, an implementation of another tRNA search algorithm previously described (3).

tRNAscan 1.3 detects tRNAs by initially looking for short, well conserved intragenic promoter sequences (A & B boxes in eukaryotes) found in the TPC and D arm regions of prototypic tRNAs. Once a specific number of nucleotides in the sequence match the consensus promoter (defined by an arbitrary score threshold), the program then progressively attempts to identify the various stem-loop structures found in the tRNA "clover leaf". As each arm is identified by the presence of base-pairing in the stem, correct loop size, and several invariant and semi-invariant bases, a "general score" counter is incremented. If the final score exceeds an empirically determined threshold, the tRNA location, anticodon, and type are saved.

EufindtRNA, on the other hand, only searches for linear sequence signals. A step-wise algorithm uses newly developed log-odds score matrices to first identify A and B box promoter elements that exceed an empirically determined cutoff. The scores for these A and B boxes are then added to a log odds score for the nucleotide distance between the A and B boxes to produce an intermediate score. Finally, a log odds score for the distance to the nearest downstream poly-T pol III termination signal is added to the intermediate score to obtain a final score. If the final score is above a final score cutoff, the tRNA identity and location is saved. tRNAscan-SE uses a less selective version of this algorithm that does not look for pol III termination signals, thus uses the intermediate score as a final cutoff. Also, the intermediate score cutoff is loosened slightly relative to the intermediate cutoff described in the original algorithm (3). These modifications increase the algorithm's sensitivity but greatly reduce EufindtRNA's selectivity. This does not reduce the final selectivity of tRNAscan-SE since a secondary filter (Cove) is being used to eliminate false positives. The sensitivity of EufindtRNA is roughly comparable to tRNAscan 1.3, but it appears to be complementary in that EufindtRNA tends to identify tRNAs missed by tRNAscan 1.3 and vice versa (3). tRNAscan-SE takes advantage of this fact, and saves results from both tRNAscan 1.3 and EufindtRNA, then merges them into one list of non-redundant "candidate" tRNA identifications.

In the second stage, tRNAscan-SE extracts the DNA subsequences identified as possible tRNAs and passes only these segments to an RNA search program in the Cove program suite (covels) for analysis. Cove programs look for tRNAs in a very different way. A probabilistic model for tRNA has been developed by aligning known tRNAs and giving a base-specific probability score to every nucleotide in the tRNA model. Also, Cove uses a special method for capturing secondary RNA structure information using a type of language referred to as a stochastic context-free grammar (SCFG). Cove applies this probabilistic model to the entire windowed sequence, and produces a probability score that the sequence matches the tRNA model. If the score exceeds 20.0 bits, the tRNA is considered a true tRNA (based on empirical studies in ref. 2).

In the final phase, tRNAscan-SE takes those tRNAs confirmed as such and runs another Cove program (coves) that displays RNA secondary structure. The tRNA type is predicted by identifying the anticodon within the structure output. Introns are also automatically identified from the structure output as runs of five or more consecutive non-consensus nucleotides within the anticodon loop.

Pseudogene Detection

tRNAscan-SE uses heuristics to try to distinguish pseudogenes from true tRNAs, primarily on lack of tRNA-like secondary structure. A second tRNA covariance model was created from the original 1415-tRNA alignment, under the constraint that no secondary structure is conserved (this model is effectively just a sequence profile, or hidden Markov model (HMM)). By subtracting a tRNA's similarity score to the primary structure-only model ("HMM Score" column) from that using the complete tRNA model, a secondary structure-only score ("2'Str Score" column) is obtained. We have observed that tRNAs with low scores for either component of the total score were often pseudogenes. Thus, tRNAs are marked as likely pseudogenes if they have either a score of less than 10 bits for the primary sequence component of the total score (HMM Score), or a score of less than 5 bits for the secondary structure component (2'Str Score) of the total score. Selenocysteine tRNAs are not checked by these rules since they have atypical primary and secondary structure. Also, use of the -O option (search for organellar tRNAs) disables pseudogene checking since these criteria are geared towards detecting cytoplasmic pseudogenes (some true non-eukaryotic tRNA are marked as pseudogenes by this analysis).

tRNA with Anticodon CAT

The current version of tRNAscan-SE does not include the analysis of identifying between initiator tRNA-Met, elongator tRNA-Met, and tRNA-Ile with anticodon CAT. The annotations presented in this database, therefore, do not distinguish these tRNAs. This feature may be included in the future version of tRNAscan-SE.

For more details on the program algorithm & implementation, see the Nucleic Acids Research paper (Lowe & Eddy, 1997).