Skip to content

Glossary of Bioinformatics Terms

A

  • Accession Number: A unique identifier assigned to a specific record (sequence, structure, etc.) in a biological database.
  • Algorithm: A step-by-step set of instructions for a computer to solve a problem.
  • Alignment: The arrangement of two or more sequences to identify regions of similarity.
  • Alpha Diversity: A measure of species diversity within a single sample (e.g., richness).
  • AlphaFold: An AI system developed by DeepMind that predicts protein 3D structure from its amino acid sequence with high accuracy.
  • Amino Acid: The building block of proteins. There are 20 standard amino acids.
  • Annotation: The process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.
  • Assembly: The process of stitching together short DNA reads to reconstruct the original genome sequence.

B

  • Beta Diversity: A measure of the difference in species composition between two or more samples.
  • Bioinformatics: The application of computational tools and statistics to analyze and interpret biological data.
  • BLAST (Basic Local Alignment Search Tool): A widely used algorithm for comparing primary biological sequence information.

C

  • Central Dogma: The framework describing the flow of genetic information: DNA \(\rightarrow\) RNA \(\rightarrow\) Protein.
  • Chromosomes: Long DNA molecules with part or all of the genetic material of an organism.
  • Codon: A sequence of three nucleotides that corresponds to a specific amino acid or stop signal during protein synthesis.
  • Contig: A continuous sequence of DNA that has been assembled from overlapping reads.
  • Coverage (Depth): The average number of times a nucleotide is represented by a sequenced read.

D

  • DEG (Differentially Expressed Gene): A gene that shows statistically significant differences in expression levels between two conditions.
  • DNA (Deoxyribonucleic Acid): The molecule that carries genetic instructions. Composed of Adenine (A), Thymine (T), Cytosine (C), and Guanine (G).
  • Docking: A computational method to predict the preferred orientation of one molecule to a second when bound to each other.

E

  • E-value (Expect Value): In BLAST, the number of hits one can "expect" to see by chance when searching a database of a particular size. Lower E-values indicate more significant matches.
  • Exon: The coding region of a eukaryotic gene.

F

  • FASTA: A text-based format for representing nucleotide sequences or peptide sequences, using single-letter codes.
  • FASTQ: A text-based format for storing both a biological sequence and its corresponding quality scores.
  • Frameshift Mutation: A genetic mutation caused by indels (insertions or deletions) of a number of nucleotides in a DNA sequence that is not divisible by three.

G

  • Gene: A distinct sequence of nucleotides forming part of a chromosome, the order of which determines the order of monomers in a polypeptide or nucleic acid molecule.
  • Genome: The complete set of genes or genetic material present in a cell or organism.
  • Global Alignment: An alignment of two sequences over their entire length (e.g., Needleman-Wunsch algorithm).
  • GWAS (Genome-Wide Association Study): An observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.

H

  • Homology: The existence of shared ancestry between a pair of structures or genes.
  • Hub: A node in a network that has a significantly higher number of links than other nodes.

I

  • Indel: A term for an Insertion or Deletion of bases in the genome.
  • Intron: A non-coding segment of a eukaryotic gene that is spliced out before translation.

L

  • Local Alignment: An alignment that identifies the most similar region within two sequences (e.g., Smith-Waterman algorithm).

M

  • Mass Spectrometry (Mass Spec): An analytical technique that measures the mass-to-charge ratio of ions, used to identify proteins and metabolites.
  • Metabolomics: The scientific study of chemical processes involving metabolites, the small molecule substrates, intermediates, and products of cell metabolism.
  • Microbiomics: The study of microbial communities (microbiomes).
  • MSA (Multiple Sequence Alignment): An alignment of three or more biological sequences.

N

  • N50: A statistic used to evaluate the quality of a genome assembly. It is the length of the shortest contig in the set of contigs containing 50% of the total assembly length.
  • NGS (Next-Generation Sequencing): High-throughput sequencing technologies that allow for sequencing of DNA and RNA much more quickly and cheaply than Sanger sequencing.
  • Node: A connection point in a network graph, representing a biological entity like a gene or protein.

O

  • ORF (Open Reading Frame): A continuous stretch of codons that has the potential to be translated into a protein (starts with Start codon, ends with Stop codon).

P

  • PDB (Protein Data Bank): A database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids.
  • Pharmacogenomics: The study of how genes affect a person's response to drugs.
  • Phred Quality Score: A measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.
  • Phylogenetics: The study of the evolutionary history and relationships among individuals or groups of organisms.
  • Protein: Large biomolecules and macromolecules that comprise one or more long chains of amino acid residues.
  • Proteomics: The large-scale study of proteins.

R

  • Read: A sequence of base pairs generated from a single DNA fragment in sequencing.
  • RNA (Ribonucleic Acid): A polymeric molecule essential in various biological roles. Uses Uracil (U) instead of Thymine (T).
  • RNA-Seq: A technique used to analyze the transcriptome (gene expression) using NGS.

S

  • Scaffold: A portion of the genome assembly consisting of contigs connected by gaps of known length.
  • SNP (Single Nucleotide Polymorphism): A substitution of a single nucleotide at a specific position in the genome.
  • Structural Variant (SV): Large-scale structural differences in the genomic DNA (e.g., inversions, translocations).
  • Systems Biology: An approach to biology that focuses on complex interactions within biological systems (holism).

T

  • Transcriptome: The set of all RNA molecules in one cell or a population of cells.
  • Transcription: The process of copying a segment of DNA into RNA.
  • Translation: The process in which ribosomes in the cytoplasm or ER synthesize proteins after the process of transcription of DNA to RNA.

V

  • VCF (Variant Call Format): A text file format used in bioinformatics for storing gene sequence variations.