MOCAT2: Installation, Configuration, and Usage Guide for Metagenomic and Metatranscriptomic Analysis
Common Modules and Output Files in MOCAT2
1. mocat_preprocessing Module:
- Output Files:
clean_reads_1.fastq,clean_reads_2.fastq: Sequencing data after quality control and preprocessing.summary_statistics.txt: Statistical information about the quality control steps, such as sequence counts and quality score statistics.
2. mocat_assembly Module:
- Output Files:
contigs.fasta: Assembled contig sequences.assembly_stats.txt: Statistics on assembly quality and performance, including N50, maximum/minimum contig lengths, etc.
3. mocat_analysis Module:
- Output Files:
blast_results.txt: Results from BLAST annotation, showing sequence similarity to reference databases.gene_catalog.fasta: Gene catalog sequences generated based on alignment results.functional_annotation.txt: Functional annotation results, including gene or sequence functional descriptions, KEGG or COG ennotations, etc.classification_results.txt: Classification results, displaying taxonomic information for sequences or genes, such as strain, genus, or phylum-level classifications.
4. mocat_metaquant Module (Optional, for quantitative analysis):
- Output Files:
gene_abundance_table.txt: Gene abundance table, showing estimated abundance of each gene in samples.transcript_abundance_table.txt: Transcript abundance table, showing estimated abundance of transcripts in samples.- Other files may include sample abundance information.
Notes:
- The format and content of output files generated by each module may vary depending on applied parameters and experimental design.
- Information in the result files helps researchers understand data quality, sequence annotation, assembly quality, and functional annotation.
- Data in output files are typically presented in text or FASTA formats and can be viewed and further analyzed using text editors or specialized bioinformatics software.
MOCAT2 Usage Workflow
Data Preparation:
- Obtain metagenomic/metatranscriptomic sequencing data in FASTQ format.
- Prepare reference databases, such as genome databases or functional annotation databases.
Running MOCAT2:
Main modules and example commands for MOCAT2 are as follows:
mocat_preprocessing: Perform quality control and preprocessing.
mocat_preprocessing -t 4 -o output_directory --input-files reads_1.fastq,reads_2.fastq
mocat_assembly: Execute sequence assembly.
mocat_assembly -t 4 -o output_directory --input-files reads_1.fastq,reads_2.fastq
mocat_analysis: Conduct functional annotation and classification analysis.
mocat_analysis -t 4 -o output_directory --input-files assembly.fa
Here, the -t option specifies the number of threads, -o specifies the output directory, and --input-files specifies the input files.
Result Interpretation and Analysis:
Output files generated by MOCAT2 include assembled sequences, annotation results, and classification information. These results can be further interpreted and analyzed using other tools or analysis pipelines.
Example Code
Below is a simple Shell script example demonstrating a basic analysis workflow using MOCAT2:
# Quality control and preprocessing
mocat_preprocessing -t 4 -o preprocessing_output --input-files reads_1.fastq,reads_2.fastq
# Sequence assembly
mocat_assembly -t 4 -o assembly_output --input-files preprocessing_output/clean_reads_1.fastq,preprocessing_output/clean_reads_2.fastq
# Functional annotation and classification analysis
mocat_analysis -t 4 -o analysis_output --input-files assembly_output/contigs.fasta
Notes:
- MOCAT2 offers a wide range of features and modules; specific usage methods and parameter settings should be adjusted based on data type and experimental design.
- The analysis process may require significant time and computational resources, especially for large-scale metagenomic/metatranscriptomic data.
- Depending on data type and analysis needs, further downstream analysis and interpretation may be necessary.
Full Parameter Help Information for MOCAT.pl
MOCAT.pl --help
===============================================================================
MOCAT - Metagenomics Analysis Toolkit v2.1.3
by Jens Roat Kultima, Luis Pedro Coelho, Shinichi Sunagawa @ Bork Group, EMBL
===============================================================================
Full manual & FAQ: MOCAT.pl -man
How to cite MOCAT: MOCAT.pl -cite
Have you tried the wrapper runMOCAT.sh? Try it!
Usage: MOCAT.pl -sf|sample_file 'FILE' [Pipeline, Statistics, & Additional Options]
'FILE'
Contains the list of folder names (sample names), one per line,
in which the raw sample data is located
Examples
Process, Assemble, Revise Assembly, Predict Genes, cluster genes into gene catalog, annotate gene catalog, profile against gene catalog
MOCAT.pl -sf my.samples -rtf
MOCAT.pl -sf my.samples -a
MOCAT.pl -sf my.samples -gp assembly
MOCAT.pl -sf my.samples -make_gene_catalog -assembly_type assembly
MOCAT.pl -sf my.samples -annotate_gene_catalog
MOCAT.pl -sf my.samples -s my.samples.padded -identity 95
MOCAT.pl -sf my.samples -f my.samples.padded -identity 95
MOCAT.pl -sf my.samples -p my.samples.padded -identity 95 -mode functional
Assemble and predict genes: MOCAT.pl -sf my.samples -rtf
(no screen) MOCAT.pl -sf my.samples -a
MOCAT.pl -sf my.samples -gp assembly
fetch marker genes: MOCAT.pl -sf my.samples -fmg assembly
MOCAT.pl -sf my.samples -ss
Assemble and predict genes: MOCAT.pl -sf my.samples -rtf
(DB screen) MOCAT.pl -sf my.samples -s hg19 -screened_files -identity 90
MOCAT.pl -sf my.samples -a -r hg19
MOCAT.pl -sf my.samples -gp assembly -r hg19
MOCAT.pl -sf my.samples -ss
Assemble and predict genes: MOCAT.pl -sf my.samples -rtf
(remove eg. adapters MOCAT.pl -sf my.samples -sff adapters.fa -screened_files
and then DB screen) MOCAT.pl -sf my.samples -bwa hg19 -r adapters.fa -screened_files
MOCAT.pl -sf my.samples -a -r screened.adapters.fa.on.hg19
MOCAT.pl -sf my.samples -gp assembly -r screened.adapters.fa.on.hg19
MOCAT.pl -sf my.samples -ss
Pipeline Options
-r|reads ['reads.processed', 'DATABASE' or 'FASTA FILE']
Required for all pipeline options, except rtf|read_trim_filter
Specify whether processing trim & filtered, or screened reads.
A default value to this setting can also be specified in config file
-e|extracted
Optional for all pipeline options, except rtf|read_trim_filter, see full manual
-rtf|read_trim_filter
performs trimming and filtering of reads
-a|assembly
Performs assembly of reads
-ar|assembly_revision
Further improves assemblies
-gp|gene_prediction ['assembly', 'assembly.revised']
Predicts protein coding genes on assemblies
-fmg|fetch_mg ['assembly', 'assembly.revised']
Extracts marker genes among the predicted genes
-soap|bwa ['DB1 DB2 ...',s,c,f,r]
Screen, extract and map reads against a reference databse (hg19 is provided) or (s)acftigs,
(c)ontigs, sca(f)folds from an assembly, or scaftigs from a (r)evised assembly.
This mapping step uses SOAPaligner2 (soap) or BWA (bwa).
Additional options:
-screened_files : If set, screened read files are generated, these are reads not matching the DB
-extracted_files : If set, extracted read files are generated, these are reads matching the DB
-use_mem : If set, copies the DB into memory for faster loading
-sff|screen_fastafile 'FASTA FILE'
Same as 's|screen' above, but uses USearch, rather than SOAPaligner2.
-fsoap ['DB1 DB2 ...',s,c,f,r]
Filter screened reads, (s)caftigs, (c)ontigs, sca(f)folds or (r)evised assembly scaftigs
at higher %ID and length cutoff. This step has to be run before calculating profiles if the option soap was used
Additional options:
-shm : If set, faster, but saves data for the filtering step in /dev/shm/<USER>
-psoap|pbwa ['DB1 DB2 ...',s,c,f,r] -m|mode [gene, NCBI, mOTU, functional] -o [OUTPUT FOLDER]
Generate gene, mOTU, NCBI or functional profiles on filtered reads,
(s)caftigs, (c)ontigs, sca(f)folds or (r)evised assembly scaftigs.
If -mode is set to either NCBI or mOTU, it is expected that the
reads have been correctly mapped to the corresponding databases.
Specify psoap if you used the command 'soap' previously, and 'pbwa' if you used 'bwa'.
Additional options:
-no_horizontal : No not calculate horizontal gene & functional coverages
-verbose : Prints extra information about status of profiling steps
-shm : Faster, but saves 2-5 GB of data for the profiling step in /dev/shm/<USER>
-uniq : Specify this flag if you find duplicated row names
(e.g. if you have mapped to a DB where the same reference appears multiple times)
Available modules
These are installed in the folder /nfs/data/Downloads/mocat2/stable/2.1.3/mod
Each module requires a NAME.sh and NAME.cfg file inside the NAME folder
-annotate_gene_catalog [leave empty for using sample file generated catalog or enter full path to catalog; use amino acid sequence file]
Required options:
-blasttype [should be "blastp" normally for amino acid sequences, but can be set to "blastx"]
-make_gene_catalog [samples specifed in sample file will be used ot generate catalog]
Required options:
-assembly_type [asembly or assembly.revised]
Statistics Options
-sfq|stats_fastqc
Produces statistics for each lane with raw reads using the FastQC toolkit
-ss|sample_status
Prints a simple view how the processing status of each sample,
and stores this in <sample_file>.status
Additional Options
-cfg|config [file]
Specify another config file than MOCAT.cfg
-x|no_execute
Only create job scripts, but don't execute them
-nt|no_temp
Overrides any specified temp folders config file
-cpus [integer]
Not recommended, but specifies a fixed number of cores for each job,
please read the full manual using MOCAT.pl -man
-host [hostname]
Runs the jobs on a different host machine
-identity [integer]
Overrides any percentage cutoff setting in cfg file
-length [integer]
Overrides any length cutoff setting in cfg file