Maize GSS Assembly and Annotation
A total of 2,681,812 maize (cultivar B73) GSS entries available in November 2004 were downloaded from GenBank (Benson et al., 2005). The set includes reads from a genome filtration project (Whitelaw et al., 2003) based on methylation filtration (recovery of sequences cloned in bacterial hosts that degrade methylated DNA; Palmer et al., 2004) and high-Cot libraries (depleted of repetitive DNA by excluding rapidly re-associating fractions after denaturation; Yuan et al., 2003), reads from a BAC-end sequencing project (http://pgir.rutgers.edu/), reads from a random whole genome shotgun sequencing project (http://www.jgi.doe.gov/), and a few other B73 GSS entries. Phred-quality scorese (http://www.phrap.org/) were available from the GenBank Trace Archive (ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/zea_mays/) for 2,396,370 of the sequences. The cumulative size of all the input sequences for our GSS assembly was more than two billion bp.
The GSSs were assembled using the PCAP program. A Phred quality score of 25 was assigned to each base of all the reads for which no experimentally determined quality scores were available. The selection of the low quality score 25 allowed PCAP to tolerate polymorphic differences in true overlaps. A distance range of 100 to 3,500 bp was selected for every forward-reverse read pair based on our evaluation of the distances of the read pairs in an initial assembly. A large distance range of 50,000 bp to 300,000 bp was selected for every BAC-end read pair. The assembly was performed on a SGI Altix computer with sixteen 900 MHz processors and 30 Gb of main memory, with at most eight processors used at any time. The default values were used for all the PCAP parameters.
The assembly resulted in total 294,425 contigs
assembled from total 1,661,712 member GSSs. The
total size of contigs is 503,497,339 bp. The
table below shows their length
|Size||No. of contigs|
|1 kb > size <= 2 kb||160,428|
|2 kb > size <= 3 kb||38,460|
|3 kb > size <= 4 kb||14,588|
|4 kb > size <= 5 kb||6,856|
|5 kb > size <= 6 kb||3,454|
|6 kb > size <= 7 kb||1,864|
|7 kb > size <= 8 kb||1,018|
|8 kb > size <= 9 kb||576|
|9 kb > size <= 10 kb||335|
|> 10 kb||438|
Availability. All contig sequences are available for:
You can access the original, independent assembly of methylation-filtered and High-Cot selected GSSs (which does not include RescueMu-derived sequences) at The Institute for Genomic Research (TIGR), a member of the Consortium for Maize Genomics group. Another independent assembly is available from the Maize Genome Assembly Project.
Full-Length Maize Genes
The PlantGDB assembly implements a bottom-up annotation protocol, which seeks to identify contigs containing full-length or near full-length maize genes with accurate exon-intron gene structures annotated. Here, full-length is meant with respect to the encoded translation product, not necessarily including all the untranlated transcript regions and promoter and terminator regions. Rather than relying upon ab inito or BLAST-like similarity searches to assign gene structure and putative function to GSS contigs, the PlantGDB annotation pipeline is based upon accurate spliced-alignment of contigs to homologous protein sequences using the GeneSeqer suite of programs (Usuka and Brendel, 2000; Brendel et al., 2004). So far, we have confidently derived 4,062 maize genes that contain a full-length or near full-length protein coding region based on high-quality alignment with 5,116 annotated Arabidopsis and 8,016 annotated rice proteins. These identified maize genes belong to 32 super-families and 252 gene families and provide a significant addition to our current knowledge of the maize gene space.