snoGPS README

Using the snoGPS Web Server

The snoGPS server is accessed via the Lowe Lab Webserver Interface at http://lowelab.ucsc.edu/snoGPS/.

User Interface

The snoGPS interface consists of four major components:
* Search mode selection
* Query sequence selection
* Target sequence selection
* Configuration of search-mode and options for displaying results

The search mode determines which probabilistic model to use in searches – each model is based on snoRNA training data from selected species or phylogenetic groups (i.e., mammals, yeasts, archaea). If no explicit model for the species of interest is available in the user interface, specifying either a general model or a model from a related species generally yields good results. Different search modes can offer varying speed and sensitivity. For snoGPS, search mode selection involves specifying either single hairpin (“one-stem”) mode or double hairpin (“two-stem”) mode. Because eukaryotic H/ACA snoRNAs have two hairpins, two-stem mode searches have greater specificity and therefore are usually preferable. However, in archaea, several known H/ACA RNAs have only a single hairpin and would be missed with two-stem mode. Moreover, in certain eukaryotic species (e.g. S. cerevisiae), for some of the known snoRNAs, one hairpin has a highly irregular structure, which would evade snoGPS. In these situations, scans using the one-stem search mode can be effective if the two-stem scanner does not yield results.

Query sequence selection is used to specify the sequences to be searched for snoRNAs. Raw or formatted sequence data can be pasted directly into the query sequence box or can be uploaded from a local file. Links to sample query sequence data at http://lowelab.ucsc.edu/snoGPS/U65.raw are also available for demonstration purposes . snoGPS also expects “target sequences” – i.e. sequences that may base-pair and be modified by the query snoRNA sequence. Preloaded target sequences may be chosen, including rRNA from human, yeast, and other model organisms. Alternatively, the user can specify a custom target RNA sequence. As with the query sequence, a custom target sequence can be pasted into a box in either raw or formatted form, or can be uploaded from a file. When using a custom target sequence, by default, every uridine in the sequence is treated as a potential target. Alternatively, the user can specify a subset of the target sequence nucleotides by uploading a custom “pseudouridylation file” that indicates which nucleotides to use as target sites. A sample pseudouridylation file is included on the server via http://lowelab.ucsc.edu/snoGPS/puSiteFile.html. When pseudouridylation positions are known, restricting the search space to these known target sites has the advantage of decreasing search time and the number of false positive “hits”.

The server also has a set of program-specific search and output-display options such as limits on the minimum number of base pairings in the guide region. In addition, the server has an adjustable cutoff score enabling tradeoffs between scan sensitivity and specificity. In most cases, the default parameter choices will be satisfactory and should be selected – especially by new users. However, more experienced users are able to exert some control over the program’s results by manipulating these parameters.

Output formats

snoGPS output includes a summary listing of the candidate hits followed by graphical representations of their predicted secondary structure. The summary listing for each candidate sequence includes:

* Query sequence name and snoRNA start and end positions within the query sequence
* snoGPS score
* Target sequence name and target uridine position
* Total number of base pairings and mismatches in the guide region
* Number of base pairs in the hairpin surrounding the guide region
* Annotation as to whether the guide region is in the 5' or 3'   hairpin
* The length and strand of the candidate subsequence

snoGPS scores for known snoRNA sequences for various species are available on the website for comparison. The predicted secondary structure of the sequence is depicted in two ways. First the entire sequence is displayed along with an “annotation” sequence string identifying the various sequence motifs. Then the snoRNA guide region along with its basepairing to the predicted target sequence is shown in graphical format.

Sample (abbreviated) snoGPS Output

>YourSeq.3 33.73 (70-135) Cmpl: H_sapiens_LSU.U4417 U65 Pairs: 11/0/10/6/c 66 nt (W)
CCCCAGCTTAGGAAACAGGGTTGTTCTTCATGTGGATGACTCTGTGCCGAAAGCATGGGAACAGCT
X12 56XLLLLLLI12345678I           I87654321IRRRRR X65 21X AAAAAA
              ---
             /   \
             \   /
              G-C
              G.U
              G-C
              A-U
              C-G
              A-U
5'-(N7)UAGGAA    GCCGA(N11)ACA
       ||||||    |||||
       AUCCUU CY CGGCU

Interpreting snoGPS scores:

Since snoGPS is a hybrid deterministic-probabalisitic model, its "bit scores" do not have the direct probabalistic interpretation of a purely stochastic model. In particular, snoGPS scores depend on the specific model being used. By running snoGPS on known H/ACA snoRNAs with the selected search model and run parameters in the species of interest, the user can determine what range of scores to expect for actual H/ACA sequences. For example, runnning the default 2-stem yeast model against 40 rRNA pseudouridylation sites for which snoGPS is able to make a prediction yields the following scores (numbers add up to more than the number of yeast H/ACA snoRNAs since some snoRNAs have more than one target site):

Score Range	Number of known yeast snoRNAs
0 - 25	4
25 - 35	7
35 - 45	18
45 - 55	8
55 - 65	3

Next, the scores for 78 H/ACA RNAs found by snoGPS in humans are shown. In this case, since there is no experimental verification available for any target assignments, each snoRNA score listed is the highest one for any known pseudouridylation site. These scores used the mammalian 2-stem model.

Score Range	Number of yeast snoRNAs
0 - 35	8
35 - 45	28
45 - 55	34
55 - 65	8

In both of these examples, one sees that most known snoRNAs are found with scores greater than 35 bits, using the two stem scanner.

Scores with the one-stem scanners are typically lower. For example, when testing S. cerevisiae H/ACA snoRNAs with the one stem scanner, one finds the following range of scores:

Score Range (Yeast 1 stem scanner)	Number of yeast snoRNAs
0 - 20	4
20 - 25	10
25 - 30	9
30 - 35	13
35 - 40	2

As can be seen, the one-stem scores are significantly lower; a cutoff of around 20 is needed to detect nearly all the known yeast H/ACA snoRNAs with the one stem scanner.

snoGPS structure annotations:

Each snoGPS record begins with a header line, e.g.:
>YourSeq.3 33.73 (70-135) Cmpl: H_sapiens_LSU.U4417 U65 Pairs: 11/0/10/6/c 66 nt (W)

Data in the header line are:
* sequence name,
* snoGPS score,
* coordinates of predicted snoRNA within the sequence,
* predicted target uridine,
* any annotation that may be associated with target,
* the number of basepairings and mismatches between the target and the guide region,
* the number of basepairings in the upper (or 'internal") and lower (or 'external') stems adjacent to the target region
* an 'h' or a 'c' indicating whether snoGPS found the guide region in the 5' ( "h") or the 3' ("aCa") hairpin
* predicted snoRNA length in nucleotides
* and whether the candidate was found on the positive ('W') or reverse-complement ('C') strand.

The snoGPS output includes a predicted secondary structure for each identified H/ACA snoRNA, for example:

CCCCAGCTTAGGAAACAGGGTTGTTCTTCATGTGGATGACTCTGTGCCGAAAGCATGGGAACAGCT
X12 56XLLLLLLI12345678I           I87654321IRRRRR X65 21X AAAAAA

In this structure annotation, 'HHHHHH' and 'CCCCCC' represent H or ACA motifs respectively. ('AAAAAA' may indicate either an H or an ACA motif, for example if only a scan for only a single hairpin is performed as in the example above). 'LLLL'and 'RRRR' are the two parts of the rRNA guide-sequence motif. Bases annotated with 'X'and 'I' or a single digit between a pair of X's or a pair of I's refer to the lower (or 'external') and upper (or 'internal') stems of either the 5' or 3' hairpins. The use of digits to annotate the bases within a stem is intended to facilitate visualizing which bases are predicted to pair in the stem. In cases of an internal mismatch within a stem or guide region, the corresponding number or letter is deleted from the annotation.

In addition to the above annotation of the entire predicted secondary structure, a graphical representation of the guide region and its matching target region is also included. Here is the guide representation output by snoGPS for the same sequence:

              ---
             /   \
             \   /
              G-C
              G.U
              G-C
              A-U
              C-G
              A-U
5'-(N7)UAGGAA    GCCGA(N11)ACA
       ||||||    |||||
       AUCCUU CY CGGCU

Further information

The snoGPS algorithm is described in:
Schattner P, Decatur WA, Davis C, Ares, M, Fournier, MJ and Lowe, TM, (2004) "Genome-wide Searching for Pseudouridylation Guide snoRNAs: Analysis of the Saccharomyces cerevisiae genome”, Nucleic Acids Res vol. 32(14): pp. 4281-4296

Additional information can be found in the documentation to the stand-alone version of the program available at:
http://lowelab.ucsc.edu/software/snoGPS-0.2.tar.gz