TIGRFAMs Terms

TIGRFAMs: TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects.

HMM: A Hidden Markov Model, or HMM, is a statistical model for any system that can be represented as a succession of transitions between discrete states. In this case, the discrete states correspond to the successive columns of a protein multiple sequence alignment. In principle, HMMs can be developed from unaligned sequences by successive rounds of optimization, but in practice, protein profile HMMs are simply built from curated multiple sequence alignments. HMM searches resemble later round PSI-BLAST searches (although based on curated alignments), with position-specific scoring for each of the amino acid, insertion, and deletion over the length of the sequence. Scores are reported both in bits of information and as an E-value.

Equivalog: Equivalogs describe members of a set of homologous proteins that are conserved with respect to function since their last common ancestor. Related proteins are grouped into equivalog families where possible, and otherwise into protein families with other hierarchically defined homology types. Equivalogs are constructed to be full-length or as nearly so as is possible, homologous regions of conserved function which are generally substrings of longer proteins are classified as "equivalog domain" (see domain, below). In certain cases a family of proteins may be noted which have the phylogenetic characteristics of an equivalog, but the specific function has not yet been experimentally determined, models of these families are classified as "hypothetical equivalog" (or "hypothetical equivalog domain")

Orthologs: Proteins related to each other by descent from a common ancestral sequence by speciation. Orthologs may differ in function.

Superfamily: The complete set of proteins having sequence homology over essentially their full length.

Subfamily: Where superfamilies are presumed to be complete, subfamilies represent an incomplete set of homologous proteins which yet encompass proteins of diverse function. Since superfamilies, however, are often impractical to construct, subfamilies are far more common in TIGRFAMs. Subfamilies fulfill a number of useful roles in tha annotation process. The construction of equivalogs is a process which results in models which may not identify all homologous sequences of conserved function due to the limits of current experimental characterization - a subfamily may encompass all of the sequences included within one (or more) equivalog(s) as well as related sequences which fall outside of an equivalog's scope. Subfamilies, then, are often a hierarchical level above equivalogs and may provide information through their associated comment fields which may assist in the naming of genes which are not members of any current equivalog. In certain situations equivalogs may not be constructed because of the rapid change of function (substrate variability) within a related group of sequences relative to the slower drift of evolutionary changes to which HMMs are sensitive. In these cases, subfamily models will be the most specific models available, and will provide warnings that assignment of specific fucntion based on pairwise alignments may result in errors of annotation. Like equivalogs and superfamilies, subfamilies are presumed to be full-length models. Models of portions of longer proteins which encompass more than one function are classified as "subfamily domain" (see domain, below).

Domain: A region of sequence homology among sets of proteins that are not all full-length homologs. Homology domains often, but not always, correspond to recognizable protein folding domains.

Motif: Generally, a small region of sequence similarity (not necessarily homology) characterized by distinct patterns of amino acids at specific positions. An example of a motif is the N-glycosylation site motif N{P}[ST] (Asn, anything but Pro, choice of Ser or Thr).

EGAD: A database used to store gene, protein and TIGRFam/HMM information.

Noise Cutoff: The HMM score below which hits to the HMM are considered uninteresting.

Trusted Cutoff: The HMM score above which there should be no false positive hits.

PFAM: A collection of HMM models of protein families complementary to TIGRFAMs. While the names of TIGRFAMs models (particularly equivalogs) strive to provide functionally accurate names which can be applied to the genes which they describe, PFAM models tend to use names which may represent only a subset (or even one) of the functions of the genes which they describe. PFAM models are constrained to be non-overlapping with one another and thus are more likely to describe domains rather than full-length proteins. Some PFAM models have the properties of equivalogs and many of these have been classified as "PFAM equivalog".