Glossary of Bioinformatics Terms¶

A¶

Accession Number: A unique identifier assigned to a specific record (sequence, structure, etc.) in a biological database.
Algorithm: A step-by-step set of instructions for a computer to solve a problem.
Alignment: The arrangement of two or more sequences to identify regions of similarity.
Alpha Diversity: A measure of species diversity within a single sample (e.g., richness).
AlphaFold: An AI system developed by DeepMind that predicts protein 3D structure from its amino acid sequence with high accuracy.
Amino Acid: The building block of proteins. There are 20 standard amino acids.
Annotation: The process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.
Assembly: The process of stitching together short DNA reads to reconstruct the original genome sequence.

Beta Diversity: A measure of the difference in species composition between two or more samples.
Bioinformatics: The application of computational tools and statistics to analyze and interpret biological data.
BLAST (Basic Local Alignment Search Tool): A widely used algorithm for comparing primary biological sequence information.

Central Dogma: The framework describing the flow of genetic information: DNA \(\rightarrow\) RNA \(\rightarrow\) Protein.
Chromosomes: Long DNA molecules with part or all of the genetic material of an organism.
Codon: A sequence of three nucleotides that corresponds to a specific amino acid or stop signal during protein synthesis.
Contig: A continuous sequence of DNA that has been assembled from overlapping reads.
Coverage (Depth): The average number of times a nucleotide is represented by a sequenced read.

DEG (Differentially Expressed Gene): A gene that shows statistically significant differences in expression levels between two conditions.
DNA (Deoxyribonucleic Acid): The molecule that carries genetic instructions. Composed of Adenine (A), Thymine (T), Cytosine (C), and Guanine (G).
Docking: A computational method to predict the preferred orientation of one molecule to a second when bound to each other.

E-value (Expect Value): In BLAST, the number of hits one can "expect" to see by chance when searching a database of a particular size. Lower E-values indicate more significant matches.
Exon: The coding region of a eukaryotic gene.

FASTA: A text-based format for representing nucleotide sequences or peptide sequences, using single-letter codes.
FASTQ: A text-based format for storing both a biological sequence and its corresponding quality scores.
Frameshift Mutation: A genetic mutation caused by indels (insertions or deletions) of a number of nucleotides in a DNA sequence that is not divisible by three.

Gene: A distinct sequence of nucleotides forming part of a chromosome, the order of which determines the order of monomers in a polypeptide or nucleic acid molecule.
Genome: The complete set of genes or genetic material present in a cell or organism.
Global Alignment: An alignment of two sequences over their entire length (e.g., Needleman-Wunsch algorithm).
GWAS (Genome-Wide Association Study): An observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.

Homology: The existence of shared ancestry between a pair of structures or genes.
Hub: A node in a network that has a significantly higher number of links than other nodes.

Indel: A term for an Insertion or Deletion of bases in the genome.
Intron: A non-coding segment of a eukaryotic gene that is spliced out before translation.

Local Alignment: An alignment that identifies the most similar region within two sequences (e.g., Smith-Waterman algorithm).

Mass Spectrometry (Mass Spec): An analytical technique that measures the mass-to-charge ratio of ions, used to identify proteins and metabolites.
Metabolomics: The scientific study of chemical processes involving metabolites, the small molecule substrates, intermediates, and products of cell metabolism.
Microbiomics: The study of microbial communities (microbiomes).
MSA (Multiple Sequence Alignment): An alignment of three or more biological sequences.

N50: A statistic used to evaluate the quality of a genome assembly. It is the length of the shortest contig in the set of contigs containing 50% of the total assembly length.
NGS (Next-Generation Sequencing): High-throughput sequencing technologies that allow for sequencing of DNA and RNA much more quickly and cheaply than Sanger sequencing.
Node: A connection point in a network graph, representing a biological entity like a gene or protein.

ORF (Open Reading Frame): A continuous stretch of codons that has the potential to be translated into a protein (starts with Start codon, ends with Stop codon).

PDB (Protein Data Bank): A database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids.
Pharmacogenomics: The study of how genes affect a person's response to drugs.
Phred Quality Score: A measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.
Phylogenetics: The study of the evolutionary history and relationships among individuals or groups of organisms.
Protein: Large biomolecules and macromolecules that comprise one or more long chains of amino acid residues.
Proteomics: The large-scale study of proteins.

Read: A sequence of base pairs generated from a single DNA fragment in sequencing.
RNA (Ribonucleic Acid): A polymeric molecule essential in various biological roles. Uses Uracil (U) instead of Thymine (T).
RNA-Seq: A technique used to analyze the transcriptome (gene expression) using NGS.

Scaffold: A portion of the genome assembly consisting of contigs connected by gaps of known length.
SNP (Single Nucleotide Polymorphism): A substitution of a single nucleotide at a specific position in the genome.
Structural Variant (SV): Large-scale structural differences in the genomic DNA (e.g., inversions, translocations).
Systems Biology: An approach to biology that focuses on complex interactions within biological systems (holism).

Transcriptome: The set of all RNA molecules in one cell or a population of cells.
Transcription: The process of copying a segment of DNA into RNA.
Translation: The process in which ribosomes in the cytoplasm or ER synthesize proteins after the process of transcription of DNA to RNA.

VCF (Variant Call Format): A text file format used in bioinformatics for storing gene sequence variations.