Sorghum GSS Assembly and Annotation

A total 534,052 Sorghum bicolor GSSs (cultivar ATx623) were downloaded (in November, 2004) from GenBank. More than 98% of those sequences were generated by Orion Genomics using the methylation filtration approach (Bedell et al., 2005). The GSSs were assembled using the PCAP program similarly to our maize GSS assembly.

Assembly Result

The assembly resulted in total 79,343 contigs assembled from total 476,807 member GSSs. The total size of contigs is 120,440,790 bp. The table below shows their length distribution.

Size No. of contigs
<=1 kb 26,336
1 kb > size <= 2 kb 36,421
2 kb > size <= 3 kb 10,884
3 kb > size <= 4 kb 3,641
4 kb > size <= 5 kb 1,263
5 kb > size <= 6 kb 483
6 kb > size <= 7 kb 187
7 kb > size <= 8 kb 75
8 kb > size <= 9 kb 25
9 kb > size <= 10 kb 12
> 10 kb 16


All contig sequences are available for:

Full-Length Sorghum Genes

The PlantGDB assembly implements a bottom-up annotation protocol, which seeks to identify contigs containing full-length or near full-length sorghum genes with accurate exon-intron gene structures annotated. Here, full-length is meant with respect to the encoded translation product, not necessarily including all the untranlated transcript regions and promoter and terminator regions. Rather than relying upon ab inito or BLAST-like similarity searches to assign gene structure and putative function to GSS contigs, the PlantGDB annotation pipeline is based upon accurate spliced-alignment of contigs to homologous protein sequences using the GeneSeqer suite of programs (Usuka and Brendel, 2000; Brendel et al., 2004). In the derived set of 79,343 contigs, we identified 1,561 genes that contain a full-length or near full-length protein coding region. A total of 903 Arabidopsis and 1,199 rice proteins match to both the maize and sorghum full-length gene sets, resulting in a putative ortholog set of 77 genes conserved across all four species.

