GeneSeqer@PlantGDB - Guided Tutorial

This step-by-step tutorial covers the basic concepts behind setting up a GeneSeqer@PlantGDB gene structure analysis, understanding the results of an analysis, and applying these results to further investigations. This tutorial illustrates the individual steps necessary to use the GeneSeqer@PlantGDB web service and tips to make its use more efficient and rewarding.

Setting up GeneSeqer@PlantGDB

Interpreting the results

Further investigation & refined analysis

Setting up GeneSeqer@PlantGDB

Step 1. Getting started

The GeneSeqer@PlantGDB web service is intended primarily for the purpose of performing spliced alignment of query sequences (sequences representing transcribed genes, i.e. ESTs, cDNAs, and proteins) with a target sequence (genomic DNA). The results of such an alignment are useful in many studies. Because there are a number of questions which may be addressed by spliced alignment, the process by which you use GeneSeqer@PlantGDB may vary. For instance, your choice of input sequences can be determined based on similarity to other sequences (both genomic and transcribed) or you may already possess an uncharacterized sequence you wish to know more about. In this example, the later case applies. The form elements found in this tutorial are identical to those of the standard web service. You may wish to open GeneSeqer@PlantGDB in a new window and follow along in parallel with this tutorial however this is not required.

Step 2. Entering genomic (target) sequence

To demonstrate a typical use of this service, we have choosen to characterize a segment of genomic sequence from Sorghum bicolor (GenBank accession AF503433). This bacterial artificial chromosome (BAC) represents approximately 142,000 bases of genomic sequence. While this sequence could be pasted into the large text area below or even saved to a local file and uploaded, we have choosen to simply enter the accession number in the appropriate field as shown. To finish this step, we select the GenBank format. A detailed description of optional formats can be found at the Select format link.

  • Enter sequence accesion number

  • Select appropriate format

STEP 2: Input genomic DNA sequence

Sequences should be in the one-letter-code ({a,b,c,g,h,k,m,n,r,s,t,u,w,y}), upper or lower case; all other characters are ignored during input. Multiple sequence input is accepted in FASTA format (sequences separated by identifier lines of the form ì>SQ;name_of_sequence commentsî) or in GenBank format.Paste your genomic DNA sequence here:

... or upload your sequence file (specify file name):

... or type in the GenBank accession number of your sequence:

Select format: plain FASTA GenBank
Sequence name: (optional, used in plain sequence format only)
From position: to position: Strand: original reverse both

Step 3. Entering transcribed (query) sequence(s)

While an exhaustive alignment of "All Plant" ESTs and cDNAs is possible, the size of our genomic sequence in this example calls for a more efficient approach. Generally, when characterizing a large genomic sequence (as is being demonstrated), the detection of genic regions is the primary goal. Only after their detection, are these regions looked at in greater detail. Note the options for alignment of specific data types as well as individual or logical species groups. These functions as well as alignment of sequences of your own choosing are discussed in the refined analysis section later in this tutorial.
For now, we choose to align only the representative TUG collection of all plants. This sequence collection represents the Tentative Unique Gene clusters assembled using the PlantGDB contiging method.

  • Select "All Plants" TUG database

STEP 3: Select or input cDNA/EST sequences

Spliced Alignment: The output will show an optimal threading of a significantly matching cDNA/EST sequence into the genomic DNA by aligning putative exons only and displaying putative introns as (long) gaps in the cDNA/EST. You can supply your own EST/cDNA sequence as well as your own protein sequence to do the spliced alignment. However, your own EST/cDNA sequence can NOT be combined with our pre-processed databases (EST, TUG, or cDNA).
Database EST TUG cDNA
All Plants
All Monocots
All Dicots
All Grasses
Arabidopsis thaliana
Avena sativa
Beta vulgaris subsp. vulgaris
Glycine max
Gossypium arboreum
Gossypium hirsutum
Hordeum vulgare
Lotus japonicus
Lycopersicon esculentum
Lycopersicon hirsutum
Lycopersicon pennellii
Marchantia polymorpha
Medicago sativa
Medicago truncatula
Mesembryanthemum crystallinum
Oryza sativa
Pinus taeda
Populus tremula X Populus tremuloides
Secale cereale
Solanum tuberosum
Sorghum bicolor
Sorghum propinquum
Triticum aestivum
Zea mays
OR paste your own EST/cDNA sequence(s) here in FASTA format
... and/or paste your own Protein sequence(s) here in FASTA format
... and/or upload your EST/cDNA sequence file (specify file name):
... and/or upload your Protein sequence file (specify file name):

Step 4. Choosing parameters and options

We now choose the maize splicing model parameter, being the most closely related model available.

  • Select maize splice site model

STEP 1: Select splice site model

Species (selects species-specific splice site model)

Step 5. Submission

And finally our sequences are submitted. For a large sequence such as this, analysis may take as long as 30 min. Thus results are available via email as well. The results of this demonstration have been cached and are therefore immediately available by clicking the submit button.

  • Submit for processing

STEP 4: Submit job

Select "Submit" to send the job to the server. By default, output will be posted to your browser. You may select to have the ouput sent to you by email instead. Selection of this option is advised in the rare case that output posting is slow due to server overload.Click here to send the output to this email address:
HTML formatted output [default: simple text].This option will not work if your mailer wraps long lines.

Interpreting the results

If neccessary, click here to open the results of the GeneSeqer@PlantGDB analysis described above.

NOTE: For large files, most browsers will take a while to correctly process all location tags. This may cause links on the graphic to appear non-functional. After the browser has completely loaded the page however, all links will be fully functional.

TODO: This image text is not helpful at all.

The summary graphic

Large genomic sequences are broken into fragments of 60000 bases for visualization. Each of these segments can be viewed by selecting it in the drop down menu on the left of the results window. The corresponding graphic summary for each segment is displayed in the upper pane of the results window. The summary graphic is clickable; by selecting a structure (colored arrow) within the graphic, the alignment file in the lower pane will be scrolled to the appropriate section dealing with the element represented by your selection. Colored arrows represent aligned sequences and predicted gene structures according to their unique color. In this example, red arrows represent predicted open reading frames; green arrows represent possible gene structures (possibly alternative structures); and blue arrows represent the alignment of EST or cDNA sequences. For all arrow drawings exons are represented as colored rectangles connect by thin lines which depict introns. A legend as to the color scheme is shown when you move your cursor over the "PREDICTION SUMMARY" title above the graphic.

The alignment file

The alignment file found in the lower pane of the results window is the heart of the GeneSeqer@PlantGDB output. This text shows the base-to-base alignment of the expressed sequence(s) with the genomic DNA. Predicted introns are shown as strings of periods '.'. Score statistics for the alignment quality as well as the predicted splice site quality are shown for each aligned sequence. In addition, links to the source of each sequence are provided above their respective alignments.

The predicted gene structures and ORFs

The culmination of the GeneSeqer@PlantGDB analysis is the prediction of an accurate gene structure. The quality of this prediction can be assessed by prediction of a probable open reading frame (ORF) and comparison to know proteins. Predicted ORFs are shown as red arrows in the summary graphic. Additionaly, the longest ORF as well as its translation frame is displayed in the alignment file. The NCBI blastp link following the translated ORF sequence in the alignment file will allow you to more easily find putative homologs for this putative gene.

Further investigation & refined analysis

Detailed (Refined) annotation using GeneSeqer@PlantGDB

Interesting gene regions found through the process described above can be further refined through various methods. One such method, demonstrated in this paragraph, involves a detailed look at the evidence (ESTs and cDNAs) supporting a given gene structure. Through the spliced alignment of "All Plants" ESTs and cDNAs to the restricted region, insight into possible alternative gene structures, polymorphisms, and differential transcription is made possible. To demonstrate this concept, we have choosen the 15kb region extending from base 7500 to base 22500 of the Sorghum bicolor BAC analyzed above. The results are available here. This analysis was done in the same manor as above with the exceptions that the 7500 to 22500 range was input in step 2 and the "All Plants" EST and cDNA options were choosen in step 3.

TODO: This image text is not helpful at all.

As shown by the summary graphic, three distinct gene regions have been characterized. These three gene regions putatively represent a mitochondrial carrier protein, subunit 1 of a cleavage stimulation factor, and a serine threonine kinase based on BlastP queries with the NCBI non-redundant database as described in the next section. Interestingly, spliced alignment of non-native (non Sorghum) transcripts alone are responsible for the characterization of the mitochondiral carrier protein in the 7800 to 11800 region shown to the left. Also noteworthy is the apparent alternative gene structure represented by an exon in the 9438 to 9477 region of this gene. The native transcript presumably encoded by this gene region is assumed to lack this exon or to express it as an alternatively spliced product due to the low local alignment similarity of the homologous sequence alignments. Investigation as to the origin of the transcripts corresponding to each gene structure reveal two (2) transcripts arising from monocotyledons (Secale cereale (rye) gi:10093099; Oryza sativa (rice) gi:27547342) and two (2) transcripts arising from dicotyledons (Solanum tuberosum (potato) gi:17074557l Lycopersicon esculentum (tomato) gi:18260535). In this example, the gene structure lacking the exon in question is supported by spliced alignment of the monocot homologs and thus as assumed before most likely represents the native Sorghum gene transcript.

Homologous protein alignment using GeneSeqer@PlantGDB

Determining the complete gene structure, representing the entire coding region, of a gene is in some cases not possible using the alignment of transcribed sequences alone. As mentioned above, inclusion of homologous transcripts can increase the coverage of these alignments but is not always sufficient to produce a complete gene structure. For this reason, GeneSeqer@PlantGDB includes an interface allowing the alignment of homologous proteins. These homologs may be determined through the use of the NCBI blastp link provided in the ORF section of the web service results. The results shown here represent such alignments in the 7800 to 11800 region of the Sorghum BAC used throughout this demonstration.

The 10 putatively homologous proteins aligned in this example were obtained using the external BlastP link provided in the GeneSeqer@PlantGDB text results. Each ORF prediction found in the text results is followed by this link to facilitate searches against the NCBI non-redundant database. In this example, all 10 proteins demonstrated e-values of at most 8e-66. As was shown in the preceding section, homologous alignments of two (2) putative Arabidopsis thaliana proteins suggest an alternative gene structure while alignment of the other eight (8) protein sequences confers the predicted native gene structure.

TODO: This image text is not helpful at all.

Loading Help Page...Thanks for your patience!

Loading Video...Thanks for your patience!

Loading Image...Thanks for your patience!