Anonymous

Anonymous

Search:

Login / Register

Login / Register


GeneSeqer@AtGDB - Guided Tutorial


This step-by-step tutorial covers the basic concepts behind setting up and using the GeneSeqer@AtGDB web service. We illustrate the individual steps necessary to use the GeneSeqer@AtGDB web service and tips to make its use more efficient and rewarding.


Setting up GeneSeqer@AtGDB

Interpreting the results

What's Next?




Setting up GeneSeqer@AtGDB

Getting started

The GeneSeqer@AtGDB web service is intended primarily for the purpose of performing spliced alignment of query sequences (sequences representing transcribed genes, i.e. ESTs, cDNAs, and proteins) with a target region of the Arabidopsis genome. The results of such an alignment are useful in many studies. Because there are a number of questions which may be addressed by spliced alignment, the process by which you use GeneSeqer@AtGDB may vary. For instance, your choice of input sequence may be determined based on similarity to other sequences (both genomic and transcribed) or you may wish to discover more about a specific gene region. In this example, the later case applies. The form elements found in this tutorial are identical to those of the standard GeneSeqer@AtGDB web service. You may wish to open GeneSeqer@AtGDB in a new window and follow along in parallel with this tutorial however this is not required.



Step 1. Selecting genomic (target) sequence range

To demonstrate a typical use of this service, we have chosen to characterize a previously annotated segment of genomic sequence from Arabdiopsis thaliana (chromosome 5). This 10 kilobase (Kb) region of genomic sequence has been annotated as containing the gene models At5g625900 and At5g62600 in Version 4 of the Arabidopsis Genome Initiative [AGI] chromosome annotation. Visualization of this region at AtGDB however, suggests that a single gene occupies the region. GeneSeqer@AtGDB is used for further investigation.

While the genomic segment in question could certainly be parsed from the whole chromosome sequence and then pasted into the standard GeneSeqer@PlantGDB service, a more direct route is provided by GeneSeqer@AtGDB as shown below. For this example, simply selecting chromosome 5 and entering the region boundary is sufficient.
TIP: On every page of the AtGDB website the 'run GeneSeqer' and 'GeneSeqer@AtGDB' links will open a page to the GeneSeqer webservice and automatically select the chromosome and region corresponding to that which is currently being viewed.

  • Select chromosome

  • Enter appropriate range
    (ie. region boundary)


STEP 1:    Select Arabidopsis genomic DNA segment

Arabidopsis thaliana chromosome  12345
From position: to position: Strand: original reverse both


 




Step 2. Entering transcribed (query) sequence(s)

In this situation an exhaustive alignment of "All Plant" ESTs and cDNAs is possible given the resonable size of our genomic sequence. Therefore these selections are chosen.

The pre-gathered sequence collection choices (such as 'All Plants' and 'All Monocots') represent sequence groups made available by PlantGDB. Feel free however to include your own sequences with which to make alignments by entering them in the appropriate text boxes or uploading files containing them in FASTA format. A demonstration of this feature using homologous proteins aligned with our example region is shown below.

  • Select "All Plants" EST & cDNA databases


STEP 2:    Select or input cDNA/EST sequences

Spliced Alignment: The output will show an optimal threading of a significantly matching cDNA/EST sequence into the genomic DNA by aligning putative exons only and displaying putative introns as (long) gaps in the cDNA/EST. You can supply your own EST/cDNA sequence as well as your own protein sequence to do the spliced alignment. However, your own EST/cDNA sequence can NOT be combined with our pre-processed databases (EST, TUG, or cDNA).

Database EST TUG cDNA
Shortcut
All Plants
All Monocots
All Dicots
All Grasses
Arabidopsis thaliana
Avena sativa
Beta vulgaris subsp. vulgaris
Glycine max
Gossypium arboreum
Gossypium hirsutum
Hordeum vulgare
Lotus japonicus
Lycopersicon esculentum
Lycopersicon hirsutum
Lycopersicon pennellii
Marchantia polymorpha
Medicago sativa
Medicago truncatula
Mesembryanthemum crystallinum
Oryza sativa
Pinus taeda
Populus tremula X Populus tremuloides
Secale cereale
Solanum tuberosum
Sorghum bicolor
Sorghum propinquum
Triticum aestivum
Zea mays
OR paste your own EST/cDNA sequence(s) here in FASTA format
... and/or paste your own Protein sequence(s) here in FASTA format
... and/or upload your EST/cDNA sequence file (specify file name):
... and/or upload your Protein sequence file (specify file name):



Step 3. Submission

And finally our sequences are submitted. For a sequence such as this, analysis generally take only 10 min. For regions requiring longer processing the results are available via email as well. The results of this demonstration have been cached and are therefore immediately available by clicking the submit button.
TIP: For large in-house analysis, the stand alone GeneSeqer application can be downloaded here

  • Submit for processing


STEP 3:    Submit job

Select "Submit" to send the job to the server. By default, output will be posted to your browser. You may select to have the ouput sent to you by email instead. Selection of this option is advised in the rare case that output posting is slow due to server overload.

Click here to send the output to this email address:
HTML formatted output [default: simple text].This option will not work if your mailer wraps long lines.

     




Interpreting the results

If neccessary, click here to open the results of the GeneSeqer@AtGDB analysis described above.

NOTE: For large files, most browsers will take a while to correctly process all location tags. This may cause links on the graphic to appear non-functional. After the browser has completely loaded the page however, all links will be fully functional.


The summary graphic

Large genomic sequences are broken into fragments of 60000 bases for visualization. Each of these segments can be viewed by selecting it in the drop down menu on the left of the results window. The corresponding graphic summary for each segment is displayed in the upper pane of the results window. The summary graphic is clickable; by selecting a structure (colored arrow) within the graphic, the alignment file in the lower pane will be scrolled to the appropriate section dealing with the element represented by your selection. Colored arrows represent aligned sequences and predicted gene structures according to their unique color. In this example, red arrows represent predicted open reading frames; green arrows represent possible gene structures (possibly alternative structures); and blue arrows represent the alignment of EST or cDNA sequences. For all arrow drawings exons are represented as colored rectangles connect by thin lines which depict introns. A legend as to the color scheme is shown when you move your cursor over the "PREDICTION SUMMARY" title above the graphic.


The alignment file

The alignment file found in the lower pane of the results window is the heart of the GeneSeqer@AtGDB output. This text shows the base-to-base alignment of the expressed sequence(s) with the genomic DNA. Predicted introns are shown as strings of periods '.'. Score statistics for the alignment quality as well as the predicted splice site quality are shown for each aligned sequence. In addition, links to the source of each sequence are provided above their respective alignments.


The predicted gene structures and ORFs

The culmination of the GeneSeqer@AtGDB analysis is the prediction of an accurate gene structure. The quality of this prediction can be assessed by prediction of a probable open reading frame (ORF) and comparison to know proteins. Predicted ORFs are shown as red arrows in the summary graphic. Additionaly, the longest ORF as well as its translation frame is displayed in the alignment file. The NCBI blastp link following the translated ORF sequence in the alignment file will allow you to more easily find putative homologs for this putative gene.



What's Next?

Testing annotation results by homologous protein alignment


Determining the complete gene structure, representing the entire coding region, of a gene is in some cases not possible using the alignment of transcribed sequences alone. As mentioned above, inclusion of homologous transcripts can increase the coverage of these alignments but is not always sufficient to produce a complete gene structure. For this reason, GeneSeqer@AtGDB includes the ability to align homologous proteins to your region of interest. These homologs may be determined through the use of the NCBI blastp link provided in the ORF section of the web service results. The results shown here represent such alignments with the chromosome 5 region of the Arabidopsis thaliana genome used throughout this demonstration.

The 8 putatively homologous proteins aligned in this example were obtained using the external BlastP link provided in the GeneSeqer@AtGDB text results. Each ORF prediction found in the text results is followed by this link to facilitate searches against the NCBI non-redundant database. In this example, all 8 proteins demonstrated e-values of at most 4e-54 and covered at least 900 of the 962 predicted amino acids of the longest ORF. These 8 putatively homologous proteins were then downloaded from NCBI (in FASTA format). These sequences were subsequently used as query sequences in the GeneSeqer analysis referenced above by uploading this file of FASTA formated sequences in step 2 of the GeneSeqer@ATGDB job submission. As was shown in the preceding section, homologous alignments of all 8 proteins suggests a single gene spanning this genomic region.



Submitting refined annotation


Static gene annotation, such as that provided for the Arabidopsis thaliana genome sequence, is inherently unreliable with respect to updated sequence collections. In particular, matching ESTs, cDNAs, or proteins precisely supporting a given gene structure may not have been available at the time the annotation was last frozen for submission. This problem, referred to as annotation lag is overcome only by dynamic assesment of gene annotation in light of all current evidence. In order to facilitate collection of such assesments on a timely basis a User Contributed Annotation (UCA) service has been established at AtGDB. Community contributed (and accessible) entry of annotation corrections (as demonstrated in this tutorial), weighted ranking of annotation accuracy, cross-referencing of static annotation versions corresponding to UCAs, and news forum style commentary on individual gene annotations are among the project goals of the AtGDB UCA system. Instructions on how to contribute new annotations and annotation refinements such as the one described herein are available at the main AtGDB tutorial page.





© 2006 Shannon D. Schlueter

AtGDB

PlantGDB

MaizeGDB

NSF Plant Genome Research

Brendel Group

Plant Sciences Institute

Iowa State University