About TIGR Plant Transcript Assemblies
The construction of plant transcript assemblies (TAs) is similar to the TIGR gene indices. The sequences that are used to build the plant TAs are expressed transcripts collected from dbEST (ESTs) and the NCBI GenBank nucleotide database (full length and partial cDNAs). "Virtual" transcript sequences derived from whole genome annotation projects are not included. All plant species for which more than 1,000 ESTs or cDNA sequences are available are included in this project. TAs are clustered and assembled using the TGICL tool (Pertea et al., 2003), Megablast (Zhang et al., 2000) and the CAP3 assembler (Huang and Madan, 1999). TGICL is a wrapper script which invokes Megablast and CAP3. Sequences are initially clustered based on an all-against-all comparisons using Megablast. The initial clusters are assembled to generate consensus sequences using CAP3. Assembly criteria include a 50 bp minimum match, 95% minimum identity in the overlap region and 20 bp maximum unmatched overhangs.
Any EST/cDNA sequences that are not assembled into TAs are included as singletons. All singletons retain their GenBank accession numbers as identifiers. Plant TA identifiers are of the form TAnumber_taxonID, where number is a unique numerical identifier of the transcript assembly and taxonID represents the NCBI taxon id.
In order to provide annotation for the TAs, each TA/singleton was aligned to the UniProt Uniref database. For release 1 TAs, a masked version of the Uniref90 database was used. For release 2 and onwards, a masked version of the UniRef100 database is used. Alignments were required to have at least 20% identity and 20% coverage. The annotation for the protein with the best alignment to each TA or singleton was used as the annotation for that sequence. Additionally, the relative orientation of each TA/singleton to the best matching protein sequence was used to determine the orientation of each TA/singleton. Some sequences did not have alignments to the protein database that met our quality criteria, and those sequences have neither annotation nor orientation assignments.
The release number for the plant TAs refers to the release version for a particular species. For the initial build, all TA sets are of version 1. Subsequent TA updates for new releases will be carried out when the percentage increase of the EST and cDNA counts exceeds 10% of the previous release and when the increase contains more than 1,000 new sequences. New releases will also include additional plant species with more than 1,000 EST or cDNA sequences that have become publicly available.
Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP. 2007. The TIGR Plant Transcript Assemblies database. Nucleic Acids Res. 2006 Nov 6 [Epub ahead of print]
Plant Transcript Assemblies Overview