Essential Shell Scripting Tips and Tricks for Bioinformatics Workflows
Batch File Processing Example
#!/bin/bash
python preprocess_annotation.py -i wheat_annotation.gff3 -o wheat_annotation_filtered.gff4
base_fasta="wheat_transcripts.fasta"
annotation_gff="wheat_annotation_filtered.gff4"
motif_types=("G4" "C4" "A4" "T4")
for type in "${motif_types[@]}"
do
python extract_motifs.py -f1 "$base_fasta" -f2 "motifs_${type}_raw.fasta"
Rscript calculate_score.R -i "motifs_${type}_raw.fasta" -o "motifs_${type}_scored.fasta"
python annotate_results.py -g "$annotation_gff" -f "motifs_${type}_scored.fasta" -o "motifs_${type}_annotated.fasta"
less "motifs_${type}_annotated.fasta" | grep -E 'Id|five' > "five_${type}_total.txt"
less "motifs_${type}_annotated.fasta" | grep -E 'Id|CDS' > "cds_${type}_total.txt"
less "motifs_${type}_annotated.fasta" | grep -E 'Id|three' > "three_${type}_total.txt"
cp "motifs_${type}_annotated.fasta" "mrna_${type}_total.txt"
python filter_nonoverlap.py -f1 "five_${type}_total.txt" -f2 "five_${type}_filtered.txt"
python filter_nonoverlap.py -f1 "cds_${type}_total.txt" -f2 "cds_${type}_filtered.txt"
python filter_nonoverlap.py -f1 "three_${type}_total.txt" -f2 "three_${type}_filtered.txt"
python filter_nonoverlap.py -f1 "mrna_${type}_total.txt" -f2 "mrna_${type}_filtered.txt"
done
for type in "${motif_types[@]}"
do
mv "five_${type}_filtered.txt" "five_${type}_total.txt"
mv "cds_${type}_filtered.txt" "cds_${type}_total.txt"
mv "three_${type}_filtered.txt" "three_${type}_total.txt"
mv "mrna_${type}_filtered.txt" "mrna_${type}_total.txt"
rm "motifs_${type}_raw.fasta"
rm "motifs_${type}_scored.fasta"
rm "motifs_${type}_annotated.fasta"
done
Processs Management
Terminating Background Jobs
To terminate a running background script, first identfiy the process ID:
ps aux | grep batch_processing_pipeline.sh
kill -SIGINT 2852911
Monitoring Thread Usage
Check the number of threads (lightweight processes) consumed by a specific process:
ps -o nlwp,pid,cmd -C python
File Manipulation
Removing Carriage Returns
Convert Windows-style line endings to Unix format:
sed -i 's/\r//g' analysis_script.R
In-place String Replacement
Replace all occurrences of a substring across an entire file:
sed -i 's/Chr//g' coordinates.bed
Directory Operations
Safe Directory Creation
Recreate a directory by removing it first if it exists, then creating it:
DIRECTORY="analysis_output"
if [ -d "$DIRECTORY" ]; then
rm -rf "$DIRECTORY"
echo "Directory $DIRECTORY removed."
fi
mkdir "$DIRECTORY"
echo "Directory $DIRECTORY created."
Shell Navigation
Return to Previous Directory
Jump back to the last working directory (not the parent directory):
cd -
Clear Command Line
Clear text after the cursor position:
Ctrl + K
Clear text before the cursor position:
Ctrl + U
Text Processing
Deduplication by Column
Remove duplicate lines based on a specific column without sorting:
less data.txt | awk '!seen[$2]++'
Bioinformatics Utilities
Quick Sequence Length Retrieval
Fetch a specific transcript from a FASTA file and display its length:
seqkit grep -p "TraesCS5D03G0974400.1" transcripts.fa | seqkit fx2tab -l -n
Vim Text Editing
Bulk Commenting
Comment out all lines containing a specific pattern:
:argdo %s/^.*pattern.*$/# &/ | update
Directory Statistics
Count Directories in Current Path
ls -l | grep ^d | wc -l