[ Japanese ]

9.0 H-InvDB glossary

a b c d e f g h i j k l m n o p q r s t u v w x y z
BodyMap-EST is a dataset from the BodyMap database. BodyMap is a human and mouse gene expression database that has been maintained since 1993. It is based on site-directed 3'-ESTs collected from non-biased cDNA libraries constructed at Osaka University and contains >270,000 sequences from 60 human and 38 mouse tissues. The site-directed nature of the sequence tags allows unequivocal grouping of tags representing the same transcript and provides abundant information for each transcript in different parts of the body.
CAGE tag
Cap Analysis Gene Expression (CAGE) tags are experimentally detected tag sequences that imply 5'-end sequences of transcripts. CAGE tags are available from here (http://genomenetwork.nig.ac.jp/public/contents/description.html).
DiseaseInfo Viewer
DiseaseInfo Viewer is a viewer that displays the disease-related information in H-InvDB. This viewer provides two kinds of disease information related to an H-Inv locus: known disease-related genes and orphan disease related regions co-localized with 1,000kb added on each side.
Expectation value. This value indicates the statistical significance of the result of sequence comparison (significance of the alignment score). For details, please refer to "http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head2".
Ensembl is the name of a joint project between the EMBL-EBI and the Wellcome Trust Sanger Institute. It has developed a system that maintains automatic annotation of large eukaryotic genomes. (For details, please refer to "http://www.ebi.ac.uk/ensembl/")
Evola is one of the largest ortholog databases of human genes in the world. This database provides orthologs between human and 13 vertebrates (chimpanzee, mouse, dog, chicken, fugu, etc.), featuring computational analysis and manual curation.
Functional ncRNAs
Non-coding RNAs (ncRNAs) play important roles in a variety of biological processes. There are two kinds of ncRNA. One is mRNA-like ncRNA and the other is short ncRNA such as miRNA, piRNA, snRNA, snoRNA and scaRNA. However, miRNAs are not targets in H-Inv DB because the lengths of miRNAs are under 30bp.
"G-integra" is the genome browser, which was mainly designed to show the gene structures defined by "H-inv transcripts". G-integra contains many other genome annotation data. We can visualize several types of genome annotation information on a specific location using G-integra.
GIIP project
This is the project aiming to develop a database of the complete set of human genes, with various annotations, and to publicize the database, initiated by JBiC and JBIRC, AIST, started from June 15, 2005, and with an end date of March 31, 2008. It was publicly offered by the Ministry of Economy, Trade and Industry. This is the project that succeeds the "H-Invitational project".
GeneChip array U95Av2
GeneChip is a DNA oligomer microarray, and the trademark of Affymetrix. The probes are oligonucleotides of 25 nucleotides, and a bundle of probes specifically matching to a different point of a sequence make a probe set for a gene. There are several versions for GeneChip microarrays, and those in U95 series are one of the earliest.
H-ANGEL is the abbreviation for Human Anatomic Gene Expression Library. It is a gene expression profile database for normal adult human tissues, and the data have been collected from data sets taken on multiple platforms. The comparability of the expression patterns of a gene between platforms is achieved by links to H-inv genes and the anatomical categorization of the samples.
H-Inv cDNA
"H-inv cDNA" is the term for the cDNA clones collected in the H-Invitational project, or their cDNA sequences. In total 56,419 cDNAs were provided by six institutes and DDBJ by September 1, 2003.
H-Invitational project
A project aiming to construct high-quality databases of human cDNAs with manual annotation, initiated by JBIRC (AIST, JBiC) and DDBJ, and started in 2002. To discuss annotation strategy and conduct actual annotation, about 120 persons from forty-four organizations in twelve countries gathered in the "Human Full-Length cDNA Annotation Invitational" (H-Invitational meeting), held during the period from August 25 to September 3, 2002. Later, for the increment of cDNA data and the more sophisticated annotation, the second meeting "H-Invitational 2 Functional Annotation" was held during the period of November 10-15, 2003. As a result, the database H-InvDB was opened on April 16th, 2004, via http://www.h-invitational.jp/. This was followed by the subsequent project "Genome information integration project".
HIP (H-Invitational protein) ID: Prefix HIP plus 9 digit number plus version_number; e.g. HIP000000001.1. We defined an HIP ID for each unique translation, which is a stable and unique identifier for each H-Invitational protein.
HIT (H-Invitational transcript): Prefix HIT plus 9 digit number plus version_number; e.g. HIT000000001.1. We defined an HIT ID for each H-Inv cDNA, mRNA or RNA entry, which is a stable and unique identifier for each H-Invitational transcript. In order to identify the modification in sequence or annotation of an H-Inv transcript entry, an HIT version is assigned to each HIT ID and always stated with the HIT ID. Each transcript which is located at multiple positions on the genome are assigned HIT IDs with an additional multi-location_number plus version_number; e.g. HIT000000001_01.1 for each transcript.
HIX (H-Invitational cluster) ID: Prefix HIX plus 7 digit number plus version_number; e.g. HIX0000001.1. We defined an HIX ID for each H-Inv cluster, which is a stable and unique identifier for each H-Inv cluster. A unique HIX ID is assigned to each H-Inv transcript entry identifying the location in the human genome or the unmapped cluster. In order to identify the modification in location in the human genome or annotation of the H-Inv cluster entry, an HIX version is assigned to each HIX ID and always stated with the HIX ID.
Locus view
In Locus view, H-InvDB provides annotation items for each HIX (H-Invitational cluster), such as genome mapping, gene structure, alternative splicing variants, gene-expression profiles, disease-related information, etc.
Long oligomer array
This is a dataset taken with a custom-made microarray platform developed by a laboratory participating in a project of NEDO. The probes are oligonucleotides of between 50 and 60 nt in length.
Low Complexity
This means a consecutive sequence of specific nucleotides/residues, such as the amino acid sequence of a proline-rich protein. This kind of sequence often causes problems in the similarity search by generating a huge number of biologically insignificant hits.
MutationView is a graphical database of disease-causing mutations in human, containing relevant information such as the organ and population in which a disease occurs, genome position, gene structure, frequency of literature, estimation of splicing alternation caused by mutation in splice sites, etc. Gene information has links to other DBs such as DNA sequence DBs, OMIM, GDB, and H-InvDB. Collection and human-curation of data and database software construction was conducted by Keio University School of Medicine and Photon Medical Research Center, Hamamatsu University School of Medicine.
NJ method
Neighbor-joining method. One of the most useful methods for constructing phylogenetic trees. This method can efficiently construct reliable phylogenetic trees without massive computational calculation time (Saitou and Nei 1987).
NJML+ method
A hybrid of the NJ and ML methods. This method can efficiently construct reliable phylogenetic trees. (Ota and Li 2000, 2001). The NJML+ method is an extended version of the original NJML method for handling amino acid sequences.
Navi is a page with which to search cDNA (or loci) in H-InvDB, aiming to guide the users, especially beginners of H-InvDB. The users can search H-InvDB, by selecting the purpose in using the H-InvDB, and items and/or keywords of their interest.
PPI view
The PPI view displays human protein-protein interaction (PPI) information. We collected PPI data from five major public PPI databases and integrated them with the H-InvDB proteins.
Pattern of alternative splicing isoforms
There are five alternative splicing (AS) patterns, namely 1) cassette (exon skipping), 2) internal acceptor (alternative 3'-end), 3) internal donor (alternative 5'-end), 4) mutually exclusive and 5) retained intron (intron retention).
Position of alternative splicing isoforms
There are three alternative splicing (AS) positions, namely 1) 5'-end, 2) internal and 3) 3'-end.
A comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. It is provided by NCBI. For details, please refer to "http://www.ncbi.nlm.nih.gov/RefSeq/".
Repeat Mask Viewer
Repeat Mask Viewer is a tool for providing the information on repetitive sequences detected by Repeat Masker.
Transcript view
In Transcript view, H-InvDB provides annotation items for each HIT (H-Invitational transcript), such as gene function, predicted CDS, InterProScan, GO, subcellular localization, protein 3D structure (GTOP), evolutionary analysis, etc. The previous name for this view was cDNA view (mRNA view).
The Genome Bioinformatics Group at the University of California Santa Cruz. This group provides the latest genome assemblies of various species. Their genome browser is one of the most famous and useful browsers worldwide (http://genome.ucsc.edu/).
Universal Protein Resource database. This database provides the world's most comprehensive dataset of proteins of various species (http://www.uniprot.org/).
Alignment is the adjustment of an object in relation with other objects, or a static orientation of some object or set of objects in relation to others. In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
canonical splice site
This corresponds to the GU-AG splice site, which accounts for more than 99% of all splice sites.
cDNA microarray (CNRS)
A custom-made cDNA (I.M.A.G.E clones collected by CNRS) array. Probes on glass slides were prepared from total and poly(A)+ RNAs using a direct labeling protocol with a reference design experiment. Double color hybridization with dye swap was performed on the microarrays.
In H-InvDB, the term "cluster" is often used to represent a group of transcripts that belong to the same locus.
cluster member
"Cluster member" often represents a transcript that belongs to a locus.
In H-InvDB, "clustering" often represents positional grouping of genes that belong to the same locus.
Nonsynonymous substitution rate (dN) per synonymous substitution rate (dS). A widely-used index to represent natural selection onto protein-coding genes in evolutionary analyses. Also described as "Ka/Ks". Refer to positive selection and negative selection.
Splice alignment tool implemented in the EMBOSS package. In H-InvDB, the tool is mainly used for mapping transcript sequences onto the genome.
gene locus
The position of a gene on a chromosome.
A gene consisting of a nucleotide sequence or amino acid sequence similar to a specific gene. Usually, homologs are produced by gene duplication in the evolutionary lineage. They could be orthologs or paralogs for specific genes.
modified Nei-Gojobori method
A modified version of the Nei-Gojobori method. One of the most useful methods for estimating synonymous and nonsynonymous substitution sites (Nei and Gojobori 1986, Zhang et al. 1998).
natural selection
A process by which mutations in nucleotide sequences are fixed or eliminated in a population. This can be positive selection and negative (purifying) selection. In the process, mutations are classified as deleterious, neutral, or advantageous for survival of individuals.
negaitive selection (purifying selection)
Natural selection that causes elimination of mutations and results in highly conserved sequence regions between species. This is because mutations in the regions are not tolerable for the survival of individuals. In protein coding sequences, the synonymous substitution rate tends to overwhelm the nonsynonymous substitution rate (dN/dS << 1).
non-protein-coding transcript/ Non-protein coding transcript
These are transcripts which are not translated to proteins. These include functional ncRNAs, as described above.
nonsynonymous substitution
A nucleotide substitution that changes an amino acid.
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Orthologs usually represent the most relevant relationship in comparing gene functions and structures between species. For example, human alpha-hemoglobin and mouse alpha-hemoglobin are orthologs. Not only one-to-one orthologs, but also many-to-many orthologs appear in the case of species-specific gene duplication or loss.
Paralogs are genes that evolved by duplication in a species. They usually have similar nucleotide or amino acid sequences, and consist of a duplicate gene family. For example, human alpha-hemoglobin and beta-hemoglobin are paralogs.
positive selection
Natural selection that causes accumulation of mutations and results in divergent sequence regions between species. This is because mutations in the regions are advantageous for survival of individuals. In protein coding sequences, the nonsynonymous substitution rate tends to be larger than synonymous substitution rate (dN/dS > 1). These genes could contain new functions or new structures.
predicted gene
Predicted genes in H-InvDB are the protein coding genes predicted solely from the genomic sequences by gene prediction programs. The prediction programs take account of the statistical features of the boundaries of coding regions (start codon, stop codon, and splice sites), and coding/non-coding sequences, and predict starting and ending bases of coding sequences (CDSs) in the genome, with their reading frames, and combination of CDSs as genes. The programs used are GENSCAN, FGENESH, and HMMgenes. Also those predictions are combined by the JIGSAW program, which integrates multiple sources of information (predictions) and predicts more accurate CDSs, typically reducing false positives.
protein-protein interaction
PPI is an abbreviation for Protein-Protein Interaction, which stands for the interactions among protein molecules. All proteins have their interacting partners and protein interactions make it possible for them to have functions in vivo. Therefore, PPI information is essential for understanding of the protein functions required for processes, networks and pathways in cells.
representative alternative splicing variant (RASV)
A representative Alternative Splicing Variant (RASV) is a selected AS variant within an AS variant group whose members have the same AS structure.
synonymous substitution
A nucleotide substitution that does not change an amino acid.
syntenic region
A similar genomic region between species in which genes are located in the same order and orientation. This refers to the conserved genomic segment having experienced large-scale genomic rearrangements.
transcribed pseudogene
Most pseudogenes are usually neither translated nor transcribed. However, some of them retain their transcriptional ability. Such pseudogenes are called "transcribed pseudogenes"
Revised: December 18, 2008