Chapter 6: Navigating Biological Databases¶
6.1 The Data Deluge¶
Biology has transformed into a data-rich science. Experiments now generate terabytes of data daily. This data is stored in massive, centralized repositories. As a bioinformatician, knowing how to navigate these databases is as important as knowing how to code.
6.2 The Big Three¶
While there are thousands of specialized databases, three major organizations host the primary data for the world:
- NCBI (National Center for Biotechnology Information): Based in the USA. Home to GenBank (nucleotide sequences) and PubMed (scientific literature).
- EMBL-EBI (European Bioinformatics Institute): Based in the UK. Home to Ensembl (genome browser).
- DDBJ (DNA Data Bank of Japan): Based in Japan.
These three exchange data daily, so a sequence submitted to one will eventually appear in all three.
6.3 Key Databases¶
GenBank (Nucleotides)¶
An annotated collection of all publicly available DNA sequences.
UniProt (Proteins)¶
The gold standard for protein data. * Swiss-Prot: Manually annotated and reviewed (High quality). * TrEMBL: Automatically annotated (High quantity, lower confidence).
Ensembl¶
A genome browser focused on chordates (humans, mice, etc.). It is excellent for visualizing genes in the context of chromosomes and seeing variants (SNPs).
6.4 Accession Numbers: The ID Cards¶
Every record in a database has a unique identifier called an Accession Number. These are stable over time.
- NM_000546: NCBI RefSeq mRNA.
- NP_000537: NCBI RefSeq Protein.
- P53_HUMAN: UniProt ID.
6.5 Bioinformatics in Action: Fetching Data with Python¶
You don't need to visit the website to get data. You can use Biopython's Entrez module to fetch data programmatically.
Note: Always provide your email to NCBI so they can contact you if your script causes issues.
from Bio import Entrez
# 1. Tell NCBI who you are
Entrez.email = "your.email@example.com"
# 2. Fetch a record (e.g., Nucleotide database, ID = NM_000546 for TP53)
# rettype="gb" means GenBank format, retmode="text" means plain text
handle = Entrez.efetch(db="nucleotide", id="NM_000546", rettype="gb", retmode="text")
# 3. Read the data
record_data = handle.read()
handle.close()
print(record_data[:500]) # Print the first 500 characters
Output (Snippet):
LOCUS NM_000546 2591 bp mRNA linear PRI 02-APR-2023
DEFINITION Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA.
ACCESSION NM_000546
VERSION NM_000546.6
SOURCE Homo sapiens (human)
...
Summary¶
Databases are the libraries of bioinformatics. We use Accession Numbers to locate specific books (records) and tools like Entrez to check them out automatically.
6.6 Programmatic Access, APIs, and Standards¶
Beyond Entrez, programmatic access to data is critical for automation and reproducible research.
- ENA / EBI APIs: The European Nucleotide Archive provides REST endpoints for searching and fetching datasets (use
curlorrequests).
Example (fetch a metadata JSON):
curl -H "Accept: application/json" "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRR1234567&result=read_run&fields=run_accession,fastq_ftp"
- GA4GH standards: Emerging standards such as
htsgetandrefgetenable secure, standardized access to sequencing reads and reference sequences across cloud providers. - Rate limiting & attribution: Always include contact information when scripting large downloads and respect provider rate limits. Use
MultiQCand logging to collect provenance information.
Programmatic access enables reproducible pipelines: combine APIs, snakemake, and containers to fetch data, process it, and archive outputs with clear provenance.