Chapter 4: Introduction to the Command Line for Biologists¶
4.1 Breaking the GUI Habit¶
4.3 Practical CLI Best Practices¶
A bioinformatics environment is built on reliable command-line workflows. Key recommendations:
- Environments: Use
conda/mambato manage packages and isolate environments. - Containers: Ship reproducible analysis with
DockerorSingularity/Apptainerto freeze software stacks. - Workflow managers: Combine CLI tools with
SnakemakeorNextflowfor reproducibility and scalability. - Useful utilities:
jqfor JSON,csvkitfor CSV,htopfor processes, andtmuxfor session management.
Example: quickly view a BAM header and index it:
# View header
samtools view -H sample.bam
# Sort and index (recommended post-alignment)
samtools sort -o sample.sorted.bam sample.bam
samtools index sample.sorted.bam
Use containers when distributing pipelines; include Dockerfile or Singularity recipes in the repo.
Up until now, you have likely interacted with computers using a Graphical User Interface (GUI)—clicking icons, dragging folders, and using menus. While intuitive, GUIs have limits. They are hard to automate, struggle with massive files (try opening a 50GB genome file in Excel!), and are often unavailable on the powerful remote servers where actual bioinformatics work happens.
Enter the Command Line Interface (CLI), also known as the terminal or shell. It might look intimidating—a black screen with blinking text—but it is the most powerful tool in your arsenal.
4.2 Navigation: Finding Your Way¶
When you open a terminal, you are "standing" in a specific folder on your computer.
Where am I? (pwd)¶
pwd stands for Print Working Directory.
What is here? (ls)¶
ls Lists the files in your current directory.
Go somewhere else (cd)¶
cd stands for Change Directory.
cd .. moves you "up" one folder.
4.3 Handling Files: The Basics¶
Bioinformatics involves moving, renaming, and organizing thousands of files.
mkdir analysis: Make a directory (folder) named "analysis".cp gene.txt gene_backup.txt: Copy a file.mv gene.txt analysis/: Move a file into a folder (also used to rename files).rm junk.txt: Remove (delete) a file. Warning: There is no Trash Can in the terminal. Deleted files are gone forever.
4.4 Inspecting Biological Data¶
Biological data files (like FASTA or FASTQ) are often massive text files. You don't want to open them in a text editor; it will crash your computer. Instead, we peek at them.
head and tail¶
View the first or last 10 lines of a file.
less¶
Allows you to scroll through a huge file page by page without loading the whole thing into memory. Press q to exit.
wc¶
Word Count. Counts lines, words, and characters.
(This tells us the file has 50,000 lines).4.5 The Power Tools: grep and Pipes¶
This is where the magic happens.
grep: Search¶
grep searches for a specific pattern in a file.
Imagine you want to find a specific gene ID in a massive annotation file.
The Pipe (|)¶
The pipe takes the output of one command and passes it as input to the next. It allows you to chain tools together.
Scenario: How many sequences are in my FASTA file?
In a FASTA file, every sequence header starts with a >. We can find all headers with grep, and then count them with wc.