A tool for analyzing sequencing run outputs primarily from adaptive sampling experiments and Oxford Nanopore Technology sequencers.
- Introduction
- Dependencies
- Installation
- Workflow Example
- Use-case Example
- Usage
- Quick Start
- Outputs
- Citation
- Legal
- Contact
Analyzing and interpreting sequencing data is a fundamental task in bioinformatics, and with the advent of ONT adaptive-sampling sequencing, specialized tools are needed to visualize and assess the effectiveness of enrichment or depletion in adaptive-sampling sequencing runs. Adaptive sampling data present challenges in effectively visualizing and assessing these sequencing runs in terms of key parameters, necessitating tailored analytical approaches and visual analytics. To assist with these challenges, we have developed a comprehensive bioinformatics pipeline consisting of three modules: analyze, plot, and filter_ONT. Our accessible pipeline aims to provide researchers with a fast and intuitive workflow for easily processing and analyzing sequencing data especially from ONT adaptive sequencing runs, enabling them to gain interpretable insights into their datasets with minimal upfront efforts.
The analyze module serves as the core component of our pipeline. First, It takes an input FASTQ file, a reference FASTA file, and an optional sequencing summary file from ONT sequencers or base callers. Next, Leveraging tools such as fastp
, minimap2
, pysam
, and mash
, this module performs a series of essential tasks. It filters the input FASTQ file, maps it to the reference FASTA file, and finally, generates a sequence manifest txt file and summary sequence manifest txt file. These files include key sequencing statistics such as read length, read quality (Q score), mapping efficiency, and coverage depth. For an in-depth explanation of all statistics provided, please refer to the report format section below.
The plot module complements the analysis performed by the "analyze" module by using the output to render interactive plots. It takes as input both a "test" and "control" directory, which represent different testing conditions, containing manifest and manifest summary txt files generated by the "analyze" module. With these files, the plot module generates visualizations that aid in the interpretation and visualization of the sequencing data. Please Note: This module is designed for comparative analysis where two testing conditions are present and can be compared.
The filter_ONT module is designed for for ONT raw reads filtering and subsetting. This module leverages a sequencing summary file to allow researchers to precisely filter reads based on customized criteria, including channel, sequencing decisions and other parameters.
Our bioinformatics pipeline offers a powerful tool for researchers working with ONT sequencing data. Whether you are exploring metagenomics sample composition, investigating adaptive sampling for your project, or conducting a comparative analysis of different methods in your lab, our pipeline can streamline your analyses and provide valuable insights into your genomic datasets using visual aids and easy to understand outputs.
- Python:
>=3.7.12, <4
- fastp:
>=0.22.0
- mash:
>=2.3
- minimap2:
>=2.26
- seqtk:
>=1.4
- samtools:
>=1.6
- pysam:
>=0.16.0
- plotly:
>=5.16.1
Install the latest released version from conda:
conda install -c bioconda sequenoscope
Coming soon
Install using pip:
pip install sequenoscope
Coming soon
If you wish to install sequenoscope from source, please first ensure these dependencies are installed and configured on your system:
python>=3.7.12,<4
fastp >=0.22.0
mash >=2.3
minimap2 >=2.26
seqtk >=1.4
samtools >=1.6
pysam >=0.16.0
plotly >=5.16.1
Install the latest commit from the master branch directly from Github:
pip install git+https://github.com/phac-nml/sequenoscope.git
In this section, we will walk through a simple workflow using mock data to demonstrate how to use each module of sequenoscope. The mock data directory contains the following files:
mock_data/
├── mock_adaptive_sampling.fastq
├── mock_control.fastq
├── mock_sequencing_summary.txt
├── mock.fastq
└── mock_reference.fasta
Our goal is to:
- Use the
filter_ONT
module to subset raw FASTQ reads into two sets representing different channel ranges. - Run the
analyze
module on both sets (treated as control and adaptive sampling datasets). - Use the
plot
module to visualize and compare the results.
This workflow is meant to provide a hands-on example that you can easily follow with your own data.
First, we will create a dataset that simulates an adaptive sampling scenario by filtering reads by channel. Let’s start by extracting reads from channel 1 to 256 from our mock.fastq
dataset using the filter_ONT
module. This will give us a subset similar to mock_adaptive_sampling.fastq
.
Command:
sequenoscope filter_ONT --input_fastq mock.fastq \
--input_summary mock_sequencing_summary.txt \
-o mock_filter_ONT \
-min_ch 1 \
-max_ch 256
What this does:
- Takes reads from
mock.fastq
that come from channels 1 to 256. - Outputs a filtered subset in
mock_filter_ONT/sample_filtered_fastq_subset.fastq
which should be identical tomock_adaptive_sampling.fastq
.
If desired, you could similarly generate the control dataset by adjusting the channel range (e.g., -min_ch 257 -max_ch 512
) to create a mock_control.fastq
. However, since we already have mock_control.fastq
available, we’ll skip that step for now to keep things simple.
Output Directory Structure:
mock_filter_ONT/
├── filter.log
├── sample_filtered_fastq_subset.fastq
└── sample_read_id_list.csv
Next, we run the analyze
module on both the control and adaptive sampling datasets. This step will generate various output files including manifest files, BAM alignments, and summary statistics.
Command for Control Dataset:
sequenoscope analyze --input_fastq mock_control.fastq \
--input_reference mock_reference.fasta \
-seq_sum mock_sequencing_summary.txt \
-o mock_control_results \
-seq_type SE \
-op control
Explanation:
--input_fastq mock_control.fastq
: The control dataset FASTQ file.--input_reference mock_reference.fasta
: Reference genome or sequence.-seq_sum mock_sequencing_summary.txt
: The sequencing summary file from ONT.-o mock_control_results
: Output directory.-seq_type SE
: Single-end sequencing.-op control
: A prefix for output files.
Control Output Directory Structure:
mock_control_results/
├── analyze.log
├── control_fastp_output.fastp.fastq
├── control_fastp_output.html
├── control_fastp_output.json
├── control_manifest_summary.txt
├── control_manifest.txt
├── control_mapped_bam.bam
├── control_mapped_bam.bam.bai
├── control_mapped_fastq.fastq
├── control_mapped_sam.sam
├── control_mash_hash.msh
└── control_read_list.txt
Command for Adaptive Sampling Dataset:
sequenoscope analyze --input_fastq mock_adaptive_sampling.fastq \
--input_reference mock_reference.fasta \
-seq_sum mock_sequencing_summary.txt \
-o mock_adaptive_sampling_results \
-seq_type SE \
-op adaptive_sampling
Explanation:
mock_adaptive_sampling.fastq
represents the dataset filtered byfilter_ONT
(or provided).- The rest of the parameters are analogous to the control dataset.
-op adaptive_sampling
tags output files with "adaptive_sampling" for clarity.
Adaptive Sampling Output Directory Structure:
mock_adaptive_sampling_results/
├── adaptive_sampling_fastp_output.fastp.fastq
├── adaptive_sampling_fastp_output.html
├── adaptive_sampling_fastp_output.json
├── adaptive_sampling_manifest_summary.txt
├── adaptive_sampling_manifest.txt
├── adaptive_sampling_mapped_bam.bam
├── adaptive_sampling_mapped_bam.bam.bai
├── adaptive_sampling_mapped_fastq.fastq
├── adaptive_sampling_mapped_sam.sam
├── adaptive_sampling_mash_hash.msh
├── adaptive_sampling_read_list.txt
└── analyze.log
The plot module is used to compare control and adaptive sampling datasets. In this example, we use hours as the time bin due to truncated data in the mock dataset.
sequenoscope plot -T mock_adaptive_sampling_results/ \
-C mock_control_results/ \
-o mock_comparison_plots \
-op mock \
-AS \
-bin hours
-T mock_adaptive_sampling_results/
: Test (adaptive sampling) directory.-C mock_control_results/
: Control directory.-o mock_comparison_plots
: Output directory for plots.-op mock
: Prefix for output files.-AS
: Enable adaptive sampling decision charts.-bin hours
: Use hourly bins for time-based decision charts.
After running the command, the output directory (mock_comparison_plots/
) will contain standard plots plus a dedicated subdirectory for decision bar charts that reflects the chosen time bin unit.
Example structure:
mock_comparison_plots/
├── mock_source_file_taxon_covered_bar_chart.html
├── mock_summary_table.csv
├── mock_taxon_mean_read_length_comparison.html
├── mock_taxon_mean_coverage_comparison.html
├── mock_read_len_violin_comparison_plot.html
├── mock_read_qscore_violin_comparison_plot.html
├── plot.log
└── decision_bar_charts_hours/
├── mock_test_independent_decision_bar_chart.html
├── mock_control_independent_decision_bar_chart.html
├── mock_test_cumulative_decision_bar_chart.html
└── mock_control_cumulative_decision_bar_chart.html
All decision bar charts (both independent and cumulative) are grouped into the decision_bar_charts_hours/
subdirectory, where the folder name reflects the selected time bin unit.
In this workflow example, we:
- Used
filter_ONT
to subset reads from a mock dataset by channel number. - Applied
analyze
to both the control and adaptive sampling datasets, generating manifest files and alignment statistics. - Visualized and compared the results using
plot
, focusing on adaptive sampling decisions and coverage metrics.
By following these steps, you can quickly get started with sequenoscope and adapt the workflow to suit your own data and research needs.
To demonstrate the practical application of our pipeline, consider a scenario where a researcher conducts adaptive sampling using an ONT sequencer. In this example, the researcher divides the sequencer channels into two sets: one half for adaptive sampling enrichment and the other half for regular sequencing as a control.
-
Utilizing our filter_ONT module, the researcher can create two distinct sets of FASTQ files (a 1-256 FASTQ file and a 257-512 FASTQ file), each representing the minimum and maximum channels of the sequencing data.
-
These files are then processed separately through our analyze module, generating two datasets – one for the test (adaptive sampling) and one for the control (regular sequencing).
-
Finally, by employing the plot module, the researcher can visually assess the effectiveness of the adaptive sampling in their experiment. This example shows how Sequenoscope facilitates data processing and analysis, enhancing the researcher's ability to draw meaningful conclusions from their ONT sequencing data.
If you run sequenoscope
, you should see the following usage statement:
Usage: sequenoscope <command> <required arguments>
To get full help for a command use one of:
sequenoscope <command> -h
sequenoscope <command> --help
Available commands:
analyze map reads to a target and produce a report with sequencing statistics
plot generate plots based on directories with seq manifest files
filter_ONT filter reads from a FASTQ file based on a sequencing summary file
If you run sequenoscope analyze -h
or sequenoscope analyze --help
, you should see the following options and usage guidleines:
usage: sequenoscope analyze --input_fastq <file.fq> --input_reference <ref.fasta> -o <out> -seq_type <sr>[options]
For help use: sequenoscope analyze -h or sequenoscope analyze --help
sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
Arguments:
-h, --help show this help message and exit
--input_fastq [ ...]
[REQUIRED] Path to ***EITHER 1 or 2*** fastq files to process.
--input_reference [REQUIRED] Path to a single reference FASTA file to process. the single FASTA file may contain several sequences.
-seq_sum , --sequencing_summary
Path to sequencing summary for manifest creation
-start , --start_time
Start time when no seq summary is provided
-end , --end_time End time when no seq summary is provided
-o , --output [REQUIRED] Output directory designation
-op , --output_prefix
Output file prefix designation. default is [sample]
-seq_type , --sequencing_type
[REQUIRED] A designation of the type of sequencing utilized for the input fastq files. SE = single-end reads and PE = paired-end reads.
-t , --threads A designation of the number of threads to use
-min_len , --minimum_read_length
A designation of the minimum read length. reads shorter than the integer specified required will be discarded, default is 15
-max_len , --maximum_read_length
A designation of the maximum read length. reads longer than the integer specified required will be discarded, default is 0 meaning no limitation
-trm_fr , --trim_front_bp
A designation of the how many bases to trim from the front of the sequence, default is 0.
-trm_tail , --trim_tail_bp
A designation of the how many bases to trim from the tail of the sequence, default is 0
-q , --quality_threshold
Quality score threshold for filtering reads. Reads with an average quality score below this threshold will be discarded. If not specified, no quality filtering will be performed.
-min_cov , --minimum_coverage
A designation of the minimum coverage for each taxon. Only bases equal to or higher then the designated value will be considered. default is 1
--minimap2_kmer A designation of the kmer size when running minimap2
--force Force overwite of existing results directory
If you run sequenoscope filter_ONT -h
or sequenoscope filter_ONT --help
, you should see the following options and usage guidleines:
usage: sequenoscope filter_ONT --input_fastq <file.fq> --input_summary <seq_summary.txt> -o <out.fastq> [options]
For help use: sequenoscope filter_ONT -h or sequenoscope filter_ONT --help
sequenoscope version 0.0.5: a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
Arguments:
-h, --help show this help message and exit
--input_fastq [ ...]
Path to adaptive sequencing fastq files to process. Not required when using --summarize.
--input_summary [REQUIRED] Path to ONT sequencing summary file.
-o , --output [REQUIRED] Output directory designation
-op , --output_prefix
Output file prefix designation. default is [sample]
-cls , --classification
a designation of the adaptive-sampling sequencing decision classification ['unblocked', 'stop_receiving', or 'no_decision']
-min_ch , --minimum_channel
a designation of the minimum channel/pore number for filtering reads
-max_ch , --maximum_channel
a designation of the maximum channel/pore number for filtering reads
-min_dur , --minimum_duration
a designation of the minimum duration of the sequencing run in SECONDS for filtering reads
-max_dur , --maximum_duration
a designation of the maximum duration of the sequencing run in SECONDS for filtering reads
-min_start , --minimum_start_time
a designation of the minimum start time of the sequencing run in SECONDS for filtering reads
-max_start , --maximum_start_time
a designation of the maximum start time of the sequencing run in SECONDS for filtering reads
-min_q , --minimum_q_score
a designation of the minimum q score for filtering reads
-max_q , --maximum_q_score
a designation of the maximum q score for filtering reads
-min_len , --minimum_length
a designation of the minimum read length for filtering reads
-max_len , --maximum_length
a designation of the maximum read length for filtering reads
--force Force overwite of existing results directory
--summarize Generate barcode statistics. Must specify an input summary and output directory
-v, --version show program's version number and exit
If you run sequenoscope plot -h
or sequenoscope plot --help
, you should see the following options and usage guidleines:
usage: sequenoscope plot --test_dir <test_dir_path> --control_dir <control_dir_path> --output_dir <out_path> For help use: sequenoscope plot -h or sequenoscope plot --help
sequenoscope version : a flexible tool for processing multiplatform sequencing data: analyze, subset/filter, compare and visualize.
Optional Arguments: -h, --help show this help message and exit
Required Paths: -T TEST_DIR, --test_dir TEST_DIR Path to test directory. -C CONTROL_DIR, --control_dir CONTROL_DIR Path to control directory. -o OUTPUT_DIR, --output_dir OUTPUT_DIR Output directory designation. --force Force overwrite of existing results directory.
Plotting Options: -op OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX Output prefix added before plot names. Default is 'sample'. -AS, --adaptive_sampling Generate decision bar charts for adaptive sampling if utilized during sequencing. -VP VIOLIN_DATA_PERCENT, --violin_data_percent VIOLIN_DATA_PERCENT Fraction of the data to use for the violin plot. -bin {seconds,minutes,5m,15m,hours}, --time_bin_unit {seconds,minutes,5m,15m,hours} Time bin used for decision bar charts. -legend, --taxon_chart_legend Include a legend in the source file taxon covered bar chart.
Note: The options --single_charts and --comparison_metric have been removed in this version. The module now automatically generates default comparison charts for both taxon mean read length and taxon mean coverage, and the summary table now includes only the columns: Parameter, Test_Value, Control_Value, and taxon_id.
Typically, ONT sequencing runs produce multiple FASTQ files for each barcode after base calling. Use the following steps to concatenate those files:
To concatenate multiple FASTQ files into a single FASTQ file, you can use the following command:
cat file1.fastq file2.fastq > combined.fastq
To concatenate multiple FASTQ GZ files and uncompress them into a single FASTQ file, you can use the following commands:
concatenate:
zcat file1.fastq.gz file2.fastq.gz > combined.fastq.gz
uncompress:
gzip -d combined.fastq.gz
Typically, paired end read sets will have a forward and a reverse compliment FASTQ that are compressed. Use these steps to uncompress them:
if the files are compressed, you can uncompress them as follows:
gzip -d Illumina_file_R1.fastq.gz
and
gzip -d Illumina_file_R2.fastq.gz
You should end up with two FASTQ files such as Illumina_file_R1.fastq
and Illumina_file_R2.fastq
which can then be run through sequenoscope analyze
module like this:
sequenoscope analyze --input_fastq Illumina_file_R1.fastq Illumina_file_R2.fastq --input_reference ref.fasta -o output -seq_type PE
The analyze module provides specific sequencing statistics based on the reference FASTA file provided. Refer to the outputs section below for more details.
To quickly get started with the analyze
module:
-
Ensure that you have the necessary input files and reference database prepared:
- Input FASTQ files: Provide the path to the FASTQ files you want to process using the
--input_FASTQ
option. - Reference database: Specify the path to the reference database in FASTA format using the
--input_reference
option.
- Input FASTQ files: Provide the path to the FASTQ files you want to process using the
-
Choose an output directory for the results:
- Specify the output directory path using the
--output
option.
- Specify the output directory path using the
-
Sprcify the sequencing type
- Specify the sequencing type
-seq_type
as either Paired-endPE
or Single-endSE
- Specify the sequencing type
-
Run the module with the minimally required options:
sequenoscope analyze --input_fastq <file.fq> --input_reference <ref.FASTA> -o <output_directory> -seq_type <sr>
This command will initiate the analysis module using the default settings. The input FASTQ file(s) will be processed, and the results will be saved in the specified output directory.
Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For additional customization options and more detailed information on available options please run sequenoscope analyze -h
or sequenoscope analyze --help
.
Note: remember to replace <file.fq>
with the actual path to your FASTQ file, <ref.FASTA>
with the path to your reference database, <output_directory>
with the desired location for the output files and <sr>
with your sequencing type (SE for single-end and PE for paired-end).
Note: Taxon IDs are used as a naming convention, reflecting the sequence name in the FASTA file. The pipeline can process genes, subspecies, and other identifiers; it doesn't have to be a taxon.
To quickly get started with the filter_ONT
module:
-
Ensure that you have the necessary input files prepared:
- Input FASTQ files: Provide the path to the adaptive sequencing FASTQ files from ONT sequencer you want to process using the
--input_FASTQ
option. - ONT sequencing summary file: Specify the path to the ONT sequencing summary file using the
--input_summary
option that is either generated by MinKnow or base calling tool such as Guppy or Dorado.
- Input FASTQ files: Provide the path to the adaptive sequencing FASTQ files from ONT sequencer you want to process using the
-
Choose an output file and directory for the filtered reads:
- Specify the output file path and directory using the
--output
option.
- Specify the output file path and directory using the
-
Set the desired filtering criteria:
- You can optionally apply various filters to the reads based on the following criteria:
- Read classification status*: Use the
-cls
or--classification
option to designate the adaptive-sampling sequencing decision classification. Valid options are'unblocked'
,'stop_receiving'
, or'no_decision'
. - Channel range/Pore number: Set the minimum and maximum channel/pore number range for filtering using the
-min_ch
and-max_ch
options. - Duration: Define the minimum and maximum duration of the read sequencing time in seconds using the
-min_dur
and-max_dur
options. - Run time range: Specify the minimum and maximum start time of the sequencing run in seconds using the
-min_start
and-max_start
options. - Q score: Determine the minimum and maximum q score for filtering using the
-min_q
and-max_q
options. - Read length range: Set the minimum and maximum read length for filtering using the
-min_len
and-max_len
options.
- Read classification status*: Use the
- You can optionally apply various filters to the reads based on the following criteria:
Note: Some sequence summary files lack the field specifying read classification status. A warning will be raised if this is the case.
-
Run the command with the basic required options:
sequenoscope filter_ONT --input_fastq <file.fq> --input_summary <seq_summary.txt> -o <output.FASTQ>
This command will initiate the filtering process based on the specified criteria and save the filtered reads to the output FASTQ file.
Please note that this is a simplified quick start guide, and additional options are available for advanced usage. For more detailed information on available options, you can run sequenoscope filter_ONT -h
or sequenoscope filter_ONT --help
.
Note: Remember to replace <file.fq>
with the actual path to your ONT sequencing FASTQ file, <seq_summary.txt>
with the path to your ONT sequencing summary file, and <output.FASTQ>
with the desired path and filename for the filtered reads.
The plot module is designed for comparative analysis of two conditions (test and control) using outputs from the analyze module.
- Generates an interactive Source File Taxon Covered Bar Chart displaying the percentage of taxon covered bases from source files.
- Creates a Summary Table (CSV) listing key parameters (Parameter, Test_Value, Control_Value, taxon_id) from the manifest summary files.
- Produces Violin Plots comparing read quality scores and read lengths between test and control datasets.
- Automatically generates two default comparison charts:
- Taxon Mean Read Length Comparison: A bar chart comparing
taxon_mean_read_length
values. - Taxon Mean Coverage Comparison: A bar chart comparing
taxon_mean_coverage
values.
- Taxon Mean Read Length Comparison: A bar chart comparing
- If adaptive sampling is enabled (
-AS
flag), the module produces decision bar charts (independent and cumulative), saved in a subdirectory nameddecision_bar_charts_<time_bin_unit>
, where<time_bin_unit>
reflects the user-selected time bin (e.g., minutes, 5m, etc.).
- Test Directory (
-T/--test_dir
): Contains manifest and manifest summary files for the test condition. - Control Directory (
-C/--control_dir
): Contains manifest and manifest summary files for the control condition. - Output Directory (
-o/--output_dir
): Directory where plots and summary files will be saved.
- Output Prefix (
-op/--output_prefix
): Prefix added to output filenames (default:sample
). - Adaptive Sampling (
-AS/--adaptive_sampling
): Enable decision bar charts (default:False
). - Violin Data Fraction (
-VP/--violin_data_percent
): Fraction of data used for violin plots (default:0.1
). - Time Bin Unit (
-bin/--time_bin_unit
): Time bin used for decision bar charts; choices:seconds
,minutes
,5m
,15m
,hours
(default:minutes
). - Taxon Chart Legend (
-legend/--taxon_chart_legend
): Include legend in the source file taxon covered bar chart (default:False
).
sequenoscope plot --test_dir <test_dir_path> --control_dir <control_dir_path> --output_dir <out_path>
Use --force
to overwrite an existing output directory if needed.
Note: Replace <test_dir_path>
, <control_dir_path>
, and <out_path>
with actual paths.
File | Description |
---|---|
<prefix>_fastp_output.fastq |
The output FASTQ file after processing with fastp . It includes filtered and trimmed sequencing reads. |
<prefix>_fastp_output.html |
An HTML report generated by fastp summarizing the filtering and quality control results. |
<prefix>_fastp_output.json |
A JSON formatted report with detailed fastp quality control statistics. |
<prefix>_manifest.txt |
A sequence manifest file containing various sequencing statistics post-analysis. |
<prefix>_manifest_summary.txt |
A summary of the sequence manifest with key statistics for a quick overview. |
<prefix>_mapped.bam |
The BAM file output from minimap2 , containing aligned sequences to the reference FASTA. |
<prefix>_mapped.bam.bai |
An index file for the BAM file to enable quick read access. |
<prefix>_mapped_fastq.fastq |
The FASTQ file containing reads that have been mapped to the reference. |
<prefix>_mapped.sam |
The SAM file equivalent of the BAM file, containing human-readable alignment data. |
<prefix>_mash.hash.msh |
A MASH sketch file used for rapid genome distance estimation. |
<prefix>_read_list.txt |
A text file list of reads, potentially used for further downstream analysis. |
Note: Replace <prefix>
with the user-specified prefix that precedes all output filenames.
Column ID | Description |
---|---|
sample_id |
Identifier for the sample to which the read belongs. |
read_id |
Unique identifier for the sequencing read. |
read_len |
Length of the sequencing read in base pairs. |
read_qscore |
Quality score of the sequencing read. |
channel |
The channel on the sequencing device from which the read was recorded. |
start_time |
Time when the sequencing of the read started. |
end_time |
Time when the sequencing of the read ended. |
decision |
Indicates the final decision on the sequencing read. Decisions are categorized into three main types: stop_receiving (the sequencing is allowed to continue, represented by signal_positive ), unblocked (the read is ejected from sequencing, indicated by data_service_unblock_mux_change ), and no_decision (no definitive action was taken, denoted by either signal_negative or unblock_mux_change ). Each term explains the action taken or not taken based on the read's signal detection and processing status. |
fastp_status |
Indicates whether the read passed the filtering and trimming process by fastp . |
is_mapped |
Indicates whether the read is mapped to any sequence in the provided multi-sequence FASTA reference file (TRUE if mapped, also see note 1 below). |
is_uniq |
Indicates whether the read is unique within the sample manifest file (TRUE if unique, also see note 2 below). |
contig_id |
Identifier for the contig to which the read is mapped, if applicable. |
Notes:
is_mapped
refers to whether or not a read is mapped to any sequence in the multi-sequence FASTA reference file provided by the user. If true, thecontig_id
is provided.is_uniq
refers to whether or not a read is unique throughout the sample manifest file. In ONT sequencing, a read may be processed multiple times if the decision is labelled assignal_negative
orNo_decision
before a final decision is made on whether to allow the read to continue sequencing or not.
Column ID | Description |
---|---|
sample_id |
Identifier for the sample. |
est_genome_size |
Estimated size of the genome. |
est_coverage |
Estimated coverage of the genome. |
total_bases |
Total number of bases in the sample. |
total_fastp_bases |
Total number of bases after processing with fastp . |
mean_read_length |
Mean read length of the sequencing reads. |
taxon_id |
Identifier for the taxon. Obtained from the user-provided FASTA file. |
taxon_length |
Length of the taxon's genome. |
taxon_mean_coverage |
Mean coverage across the taxon's genome. |
taxon_covered_bases_<prefix>X |
Number of bases in the taxon's genome covered at user-specified coverage threshold. |
taxon_%_covered_bases |
Percentage of the taxon's genome that is covered by reads at the user-specified coverage threshold . |
total_taxon_mapped_bases |
Total number of bases mapped to the taxon. |
taxon_mean_read_length |
Mean read length of the reads mapped to the taxon. |
Note: Replace <prefix>
with the user-specified threshold coverage.
File | Description |
---|---|
<user_prefix>_filtered_fastq_subset.fastq |
The subset of FASTQ reads that have been filtered based on the user-defined criteria within the filter_ONT module. |
<user_prefix>_read_id_list.csv |
A CSV file containing the list of read identifiers that correspond to the filtered subset. This may be used for further reference or analysis. |
Note: Replace <prefix>
with the user-specified prefix that precedes all output filenames.
File | Description | Triggered by Command |
---|---|---|
<prefix>_source_file_taxon_covered_bar_chart.html |
Interactive bar chart showing taxon covered percentages from source files. | Default behavior (optional legend via --taxon_chart_legend ) |
<prefix>_summary_table.csv |
CSV summary table listing: Parameter, Test_Value, Control_Value, taxon_id (derived from manifest summary files). |
Default behavior |
<prefix>_taxon_mean_read_length_comparison.html |
Interactive bar chart comparing taxon mean read length between test and control datasets. | Default behavior |
<prefix>_taxon_mean_coverage_comparison.html |
Interactive bar chart comparing taxon mean coverage between test and control datasets. | Default behavior |
<prefix>_independent_decision_bar_chart.html |
Interactive bar chart showing independent decision metrics over time. | Adaptive sampling enabled (-AS ); saved in decision_bar_charts_<time_bin_unit> |
<prefix>_cumulative_decision_bar_chart.html |
Interactive bar chart showing cumulative decision metrics over time. | Adaptive sampling enabled (-AS ); saved in decision_bar_charts_<time_bin_unit> |
read_len_<prefix>_violin_comparison_plot.html |
Violin plot comparing log-transformed read lengths between test and control datasets. | Default behavior |
read_qscore_<prefix>_violin_comparison_plot.html |
Violin plot comparing read quality score distributions between test and control datasets. | Default behavior |
Note: Replace <prefix>
with your output prefix (set via --output_prefix
).
For adaptive sampling plots, two files (for test and control) are generated for each decision chart type and saved in a subdirectory named decision_bar_charts_<time_bin_unit>
, where <time_bin_unit>
is your selected bin (e.g., minutes, 5m, etc.).
A manuscript is currently in preparation and will be updated later with publication reference once available.
Copyright Government of Canada 2023
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Abdallah Meknas: [email protected]