NCBI resources

NCBI (National Center for Biotechnology Information) is a resource for molecular biology information. NCBI creates and maintains public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information. The NCBI site is constantly being updated and some of the changes include new databases and tools for data mining.

NCBI offers several searchable literature, molecular and genomic databases and many bioinformatic tools. An up-to-date list of databases and tools can be found on the NCBI Sitemap and Entrez Data Model.

Location: www.ncbi.nlm.nih.gov

Literature Databases:

  • Bookshelf: A collection of searchable biomedical books linked to PubMed.

  • PubMed: Allows searching by author names, journal titles, and a new Preview/Index option. PubMed database provides access to over 18 million MEDLINE citations back to the mid-1950's. It includes History and Clipboard options which may enhance your search session. NCBI provides a simple PubMed tutorial.
  • PubMed Central: The U.S. National Library of Medicine digital archive of life science journal literature. Access is completely free and unrestricted.
  • OMIM: Online Mendelian Inheritance in Man is a database of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and colleagues at Johns Hopkins and elsewhere. NCBI provides a short tutorial for searching the OMIM database.

  • OMIA: Online Mendelian Inheritance in Animals is a database of genes, inherited disorders and traits in animal species other than human and mouse.

  • Journals: Search the journals database for links to journals in the Entrez system, including the genetic database.

A selection of Molecular Databases:

  • Nucleotide Database: The nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. Nucleotide allows the user to retrieve nucleotide sequences in both GenBank and FASTA formats.
  • Protein Database: The protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to UniProt (collaboration between PIR, EBI and SIB) and the Protein Data Bank (PDB) (sequences from solved structures).
  • Structure Database: The structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. NCBI provides a structure database tutorial (requires installation of Cn3D).
  • PopSet Database: A PopSet is a set of DNA sequences that have been collected to analyse the evolutionary relatedness of a population. The population could originate from different members of the same species, or from organisms from different species. The PopSet database contains aligned sequences submitted as a set resulting from a population genetic, phylogenetic, or mutation study describing evolutionary events and population variation. The PopSet database contains both nucleotide and protein sequence data.
  • Taxonomy Database: The taxonomy database contains the names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence. You can search for nucleotide, protein, and structure data from specific taxonomic groupings, from the domain level (archaea, bacteria, eukaryota) down to the species level.
  • Gene Expression Database: The gene expression database is a gene expression/molecular abundance repository and a curated, online resource for gene expression data browsing, query and retrieval.

A selection of Genome Databases:

  • Genomes: The Genomes database provides views for a variety of genomes, complete chromosomes, contiged sequence maps, and integrated genetic and physical maps. The whole genomes of over 4500 organisms can be found here. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses and organelles.
  • Clusters of Orthologous Groups (COG) Database: Phylogenetic classification of proteins encoded in completed genomes. COGs were identified by comparison of protein sequences from 43 complete genomes, representing 30 major phylogenetic lineages. Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three different lineages. Assuming that orthologs have similar functions, the COG grouping allows transfer of functional information from one member to the entire COG.

  • Homologene: A database of calculated and curated orthologs. The calculated homologs are the result of nucleotide sequence comparisons between organisms.
  • Entrez Gene: A single query interface to curated sequence information. Includes information on official nomenclature, aliases, sequence accessions, phenotypes, homology, map locations and related web sites.
  • TaxPlot: This feature enables the user to compare the similarity of a query genome to different species.
  • SKY/M-FISH & CGH Database: This Spectral Karyotyping (SKY), Multiplex Fluorescence In Situ Hybridization (M-FISH) and Comparative Genomic Hybridization (CGH) provides a public platform for investigators to share and compare molecular cytogenetic data.

A selection of Tools:

  • Entrez: Entrez is a retrieval system designed for searching several linked databases, including Nucleotide, Protein, Genome, Structure and PopSet. Entrez categories can be searched using subject, author, or unique identifiers such as accession numbers, phrases, truncated terms, and combined sets. There is also a simple Entrez tutorial.
  • BLAST: BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. For a better understanding of BLAST you can refer to the BLAST Course which explains the basics of the BLAST algorithm, or to the NCBI BLAST tutorial.
  • Map Viewer: Integrated views of chromosome maps for 78 organisms. Useful for the identification and localization of genes.
  • ORF Finder: Graphical tool which finds all ORF (Open Reading Frames) based on a set of criteria. Can be used with standard and alternative genetic codes.
  • VecScreen: A tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases.
  • Spidey: Aligns an mRNA sequence to a genomic sequence. Can determine the intron/exon structure, returning one or more models of genomic structure.

Other resources are available to assist with downloading large amounts of specific data in batches (to create so-called 'batch-files') from sequence databases like GenBank. These resources include scripting tools such as BioPerl and Biopython.