SRGD - Data and Methods:
SRGD was initially developed by Wang & Brendel (The ASRG database). Using their set of 395 splicing related genes in Arabidopsis (ASRG395) as a starting point, Chen & Brendel (Identification and Survey of Splicing-related Proteins in 10 Plant Species (unsubmitted)) surveyed 10 plant genomes for splicing related genes. This page provides links to data sets, software, and scripts used in that work.
In our study, we researched 10 plant species including six dicots (Arabidopsis thaliana (At), Glycine max (Gm), Lotus japonicus (Lj), Medicago truncatula (Mt), Populus trichocarpa (Pt), and Vitis vinifera (Vv),), three monocots (Rice Oryza sativa (Os), Sorghum bicolour (Sb), and Zea mays (Zm)), and one moss (Physcomitrella patens (Pp)). Among these 10 plants, there is one non-flowering plant (Pp), and three legumes (Glycine max (Gm), Lotus japonicus (Lj), and Medicago truncatula (Mt)).
The following table provides the data source and downlodable linkage. This table contains five columns, including species, data source&version, genomic or gene sequences that are needed for generating CIWOG information files, protein sequences (The whole annotated protein sequeces), and annotation files including gff files and xml files.
|Species||Source||Genomic/Gene Sequences||Protein Sequences||Annotation Files|
|At||TAIR (TAIR9_blastsets)||At(TAIR9_seq_20090619)||At_aa (TAIR9_pep_20090619)||TAIR9_3_utr_20090619
|Mt||Medicago.org (Mt2.0)||Mt(Mt2.0_pseudomolecule.tar.gz)||Mt_aa (20080227_imgag_protMAPPED_NO_OVERLAP.fa.tar.gz)||MT2.0_medicago_chrX_20080103_NoOverlap.xml.tar.gz|
|Os||Plantbiology.msu (version_6.1)||Os (all.seq)||Os_aa (all.pep)||all.gff3|
|Pt||JGI(v1.1)||Pt (poplar.unmasked.fasta.gz)||Pt_aa (Proteins.Poptr1_1.JamboreeModels.fasta.gz)||Poptr1_1.JamboreeModels.gff|
|Vv||genoscope.cns.fr(Unmarked)||Vv (unmasked)||Vv_aa (unmasked)||Vitis_vinifera_annotation_v1.gff|
Tools (Software used):
Common software used in this work was obtained from the respective public distribution sites:
- Blast+ (blast-2.2.18) : ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.18
- OrthoMCL (v1.4) (Ortholog Groups for Eukaryotic Genomes) : http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi
- CIWOG (Common Introns Within Orthologous Genes): http://ciwog.gdcb.iastate.edu/
- Figtree (v1.3.1) (Grafical Viewer of Phylogenetic Trees): http://tree.bio.ed.ac.uk/software/figtree/
- Muscle: (v3.8.31) (Multiple Sequence Comparison by Log-expectation): http://www.drive5.com/muscle/downloads.htm
- PHYLIP (v3.69) (Phylogeny Inference Package): http://phylip.com//
- ClustalX/ClustalW(v2.1)( Multiple Sequence Alignment): http://www.clustal.org/#Download
Scripts and Pipeline:
A three-round BLASTp search was used to identify pre-mRNA splicing-related proteins in 10 plant species as follows:
Initially, a comprehensive set of 395 pre-mRNA splicing-related proteins in Arabidopsis was downloaded from ASRG database. This set will be referred to as AtSRP (href="/SRGD/Atortho/gdna/ASRG395">ASRG395). Complete sets of predicted protein sequences of 10 plant species derived from from the respective genome annotations were obtained from each species as mentioned in the source of datasets.
AtSRP was then used as the query in local BLASTp search against each of the annotated protein sets. All hits with e-value of less than 10^-20 were retained for futher analysis.The BLASTp result for each species are dowloadable ont the follwoing table, which are refered to At_**(BLASTp result of AtSRP against **), where ** represents one of 10 plant species.
- The comand line is:
- $formatdb -i hugefasta -p F
- $blastall -i infile -d hugefasta -p blastp -o out -m 8
In order to identify potential additional homologs not idnetified in the initial search, all hits from the first stage were retrieved, pooled, and then used as the query in a second local BLASTp search against the combined set of all annotated proteins from all 10 species. New hits at a cutoff e-value of 10^-20 were added to the set of candidate plant pre-mRNA splicing-related proteins.The BLASTp result is on the following table as reffered to all_all.
The following table provides links to Blast output files for each species:
All candidates of splicing-related proteins were blastp-searched against themselves in order to obtain pairwise sequence similarities for input into OrthoMCL. The output of BLASTP result from step 3 (also the input of OrthoMCL), and the output of OrthoMCL are provide via the following links:
- All against all BLASTp result: allblastall
- The clusters of interest are saved into a separate file as following.
- At395_orthoMCL (csv format) contains clusters with at least one of ASRG395 genes
- Novel_orthoMCL.out contains clusters with novel identified splicing-related proteins
- Commond used in runing OrthoMCL:
- %orthomcl.pl --model 3 --blast_file allblastall_20 --gg_file id2010.gg
- The result all_orthoMCL can be found under the OrthoMCL directory, which contains all genes from the all2all_blastp.out result. CSV format is also available: all_orthoMCL.csv
For each gene cluster, CIWOG was used to identify the common intron positions and types. For each cluster, two files were built to be processing with CIWOG software. One file contained muscle format of proteins alignments from the same cluster, and another one contained CIWOG required format of information including gene names, gene structures, gene transcription start and stop sites, gene translation start and stop codons, and genome sequences.
- The perl scripts were written to process the annotation file, genome file and the gff file to generate the CIWOG information file (The genome sequences in Mt are already included in the gff file). We can download these files from the dataset section and put them into the same folder to run the perl script
- Because Lj has different gene names in the annotation file and the PlantGDB, we only use 9 plants in the CIWOG result.
- CIWOG scripts and output files:
Species At Gm Lj Mt Os Pp Pt Sb Vv Zm Perl Scripts ciwog_at.perl ciwog_gm.perl ciwog_lj.perl ciwog_mt.perl ciwog_os.perl ciwog_pp.perl ciwog_pt.perl ciwog_sb.perl ciwog_vv.perl ciwog_zm.perl CIWOG Result At.ciwog Gm.ciwog Lj.ciwog Mt.ciwog Os.ciwog Pp.ciwog Pt.ciwog Sb.ciwog Vv.ciwog Zm.ciwog
- (1) Gff file and xml file were used to generate CIWOG information file and further formatted for each cluster based on the information on PlantGDB.
- (2) Muscle was used to generate the alignment file for each cluster.
- (3) For each cluster, the alignment and CIWOG information file were saved in a single directory named as the cluster number. The alignment file and Ciwog information file in the same directory were named as the same name as the directory but different suffix. For example, for cluster 1, the directory is named as 1, which composed of two files: 1.aln (alignment file) and 1.ciwog(CIWOG information file), which can be also downloaded at here.