Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jr dev #11

Merged
merged 55 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
a4cda91
Merge pull request #3 from phac-nml/dev
ChristyPeterson Mar 20, 2024
57dd015
Merge pull request #4 from phac-nml/dev
ChristyPeterson Mar 20, 2024
0bd4c64
Word smithing introduction
jrober84 Mar 21, 2024
5797705
Word smithing introduction
jrober84 Mar 21, 2024
7f01244
corrected behaviour of extract for overlaping loci
jrober84 Mar 26, 2024
cc557c9
added new report mode
jrober84 Mar 27, 2024
096cc72
Update version.py
mattheww95 Apr 4, 2024
47aae9a
corrected paralog handelling
jrober84 Apr 11, 2024
fab0272
paralog work
jrober84 Apr 11, 2024
4d08bed
removed testing code
jrober84 Apr 11, 2024
0977382
updated paralog handelling
jrober84 Apr 12, 2024
5b8b71b
added int stop filter
jrober84 Apr 15, 2024
2e2621a
added int stop filter
jrober84 Apr 15, 2024
8e2040d
updated report description
jrober84 Apr 17, 2024
3227c61
Merge branch 'tests' of https://github.com/phac-nml/locidex into tests
jrober84 Apr 21, 2024
c48fc95
updated report format to include db info and sequence data
jrober84 Apr 22, 2024
7f85c69
updated report to use the same template as search module
jrober84 Apr 22, 2024
2f346e7
updated report to use the same template as search module
jrober84 Apr 22, 2024
a9b0514
added in seq_data
jrober84 Apr 22, 2024
38875fb
merge updated template and connection to new report format
jrober84 Apr 22, 2024
42dd4b3
merge included mafft alignment and production of concatenated alignment
jrober84 Apr 22, 2024
0471762
added missing input file protection
jrober84 Apr 22, 2024
b4bf401
added db_version validation
jrober84 Apr 22, 2024
1dae7cf
updated search to use manifest, created manifest module
jrober84 Apr 22, 2024
5a68dd6
removed print
jrober84 Apr 22, 2024
12a3b01
removed blank space
jrober84 Apr 23, 2024
367fbdd
added manifest module
jrober84 Apr 23, 2024
814f1de
removed print
jrober84 Apr 23, 2024
4ebee55
changed date format
jrober84 Apr 23, 2024
693d7dd
updated db config to respect which db type is enabled
jrober84 Apr 23, 2024
2ea2ba7
corrected path issue with output seq_store
jrober84 Apr 23, 2024
0b1aeed
removed print
jrober84 Apr 23, 2024
5c5a371
updated test_db.py tests to pass
mattheww95 Apr 25, 2024
d7e76df
updated manifest tests
mattheww95 Apr 25, 2024
a40b913
updated db config to dataclass
mattheww95 Apr 29, 2024
b268bc4
Added refactor for database config class
mattheww95 Apr 29, 2024
156208f
refactored merge module
mattheww95 Apr 30, 2024
b1f6814
refactored manifest module
mattheww95 Apr 30, 2024
1bd9973
refactored manifest module and tests
mattheww95 Apr 30, 2024
bef005a
refactored manifest module and tests
mattheww95 Apr 30, 2024
b976d92
updated manifest to allow for multiple versions of different dbs
mattheww95 Apr 30, 2024
69131cc
updated extract, build, search classes
mattheww95 Apr 30, 2024
03cbdd6
updated tests
mattheww95 Apr 30, 2024
47837a7
fixed merge conflicts
mattheww95 Apr 30, 2024
593bd36
updated tests and and search workflow
mattheww95 Apr 30, 2024
51cfd6a
updated test data for workflows
mattheww95 Apr 30, 2024
376a0e0
updated multi db selection for data using a manifest
mattheww95 Apr 30, 2024
3b416ae
added complete workflow tests and updated CI Scripts
mattheww95 May 2, 2024
3f8ce62
updated CI
mattheww95 May 2, 2024
d6e80a0
updated CI
mattheww95 May 2, 2024
e254fa6
Added pytest-workflow to CI
mattheww95 May 2, 2024
3591976
Added pytest-workflow to CI
mattheww95 May 2, 2024
a3b5fd7
Added pytest-workflow to CI
mattheww95 May 2, 2024
672567a
altered CI workflow due to tmp dir issues
mattheww95 May 2, 2024
44bd63c
updated manifest test to remove dependency on order
mattheww95 May 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/locidex-ci-pytest-workflow.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ name: Python application

on:
push:
branches: [ "main", "tests" ]
branches: [ "main", "tests", "dev" ]
pull_request:
branches: [ "main", "tests" ]
branches: [ "main", "tests", "dev" ]

permissions:
contents: read
Expand All @@ -29,6 +29,8 @@ jobs:
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install pytest-workflow==2.0.1
pip install -e .
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
Expand All @@ -37,4 +39,4 @@ jobs:
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest -o log_cli=true --basetemp=tmp-pytest
pytest -o log_cli=true --git-aware
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
__pycache__
*.egg*
*.egg*
.vscode
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,15 @@
<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>

# Introduction
A common function for many tools in bacterial typing is performing similarity searching using NCBI [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi). Blast provides a robust command line interface for constructing and using databases for similarity searching and is ubiquitous. There are many typing applications where custom code is written around the blast command line interface to perform searches for a variety of downstream applications. For instance, identification of specific target sequences within an assembly to perform gene-by-gene phylogenetic analysis (MLST, cgMLST, wgMLST), antimicrobial resistance gene detection, virulence gene detection, and in silico predictions of phenotypes such as serotype is a major application within public health. The typical approach is to bundle the search-based logic with additional specialized logic for performing the desired analysis.
A common function for many tools in bacterial typing is performing similarity searching using NCBI [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi). Blast provides a robust command line interface for constructing and using databases for similarity searching and is ubiquitous. There are many typing applications where custom code is written around the blast command line interface to perform searches for various downstream applications. For instance, the identification of specific target sequences within an assembly to perform gene-by-gene phylogenetic analysis (MLST, cgMLST, wgMLST), antimicrobial resistance gene detection, virulence gene detection, and in silico predictions of phenotypes such as serotype is an important application within public health. The typical approach is bundling the search-based logic with additional specialized logic to perform the desired analysis.

Decentralized allele calling has become a pressing concern by public health laboratories due to the increased use of whole genome sequencing (WGS) as part of outbreak detection and surveillance of a variety of pathogens. Gene-by-gene approaches have a variety of benefits for species typing which include a standardized set of loci for estimating genetic similarity between samples. This standardization allows for interoperability between different groups and also has the benefit of compression, simplifying genetic comparisons to use a simple hamming distance based on allele identifiers instead of a whole sequence. However, a limitation of this approach is the requirement of a centralized authority to issue unique allele identifiers and this poses multiple problems for operationalization such as privacy and connectivity. Despite this limitation PulseNet International has adopted gene-by-gene analysis as its preferred analytical approach for estimating genetic similarity between samples for routine operations with the limitation that comparing between jurisdictions requires the sharing of the primary sequence data rather than the allele identifiers.
Decentralized allele calling has become a pressing concern for public health laboratories due to the increased use of whole genome sequencing (WGS) for outbreak detection and surveillance of various pathogens. Gene-by-gene approaches have a variety of benefits for species typing, including a standardized set of loci for estimating genetic similarity between samples. This standardization allows for interoperability between different groups. Also, it has the benefit of compression, simplifying genetic comparisons by using a simple hamming distance based on allele identifiers instead of a whole sequence. However, a limitation of this approach is the requirement of a centralized authority to issue unique allele identifiers, which poses multiple operational problems, such as privacy and connectivity. Despite this limitation, PulseNet International has adopted gene-by-gene analysis as its preferred analytical approach for estimating genetic similarity between samples for routine operations, with the limitation that comparing between jurisdictions requires the sharing of the primary sequence data rather than the allele identifiers.

In recent years, the concept of using cryptographic hashes of the allele sequence itself have gained traction in a variety of different allele calling software, such as [Chewbbaca](https://github.com/B-UMMI/chewBBACA), to provide decentralized allele identifiers. Hashing the sequence yields a determinist and fixed-size hash value which can be compared in the same manner as integers. There are numerous hash functions with different strengths and weaknesses but MD5 digests have broad adoption in the software community and are routinely used to provide some assurance that a transferred file has arrived intact. The choice of md5 hash provides 16^32, possible hashes. There is a theoretical chance of hash collisions, i.e., different sequences resulting in the same hash, but as the number of allele sequences for each gene in databases is relatively low, this should be an uncommon occurrence. Collisions in this case would result in profiles appearing more similar than they truly are at the sequence level. In addition, the chances of multiple occurrences of collisions within a profile would be infinitely small.
In recent years, the concept of using cryptographic hashes of the allele sequence itself has gained traction in various allele-calling software, such as [Chewbbaca](https://github.com/B-UMMI/chewBBACA), to provide decentralized allele identifiers. Hashing the sequence yields a determinist and fixed-size hash value, which can be compared in the same manner as integers. There are numerous hash functions with different strengths and weaknesses, but MD5 digests have been broadly adopted in the software community. They are routinely used to assure that a transferred file has arrived intact. The choice of md5 hash provides 16^32, possible hashes. There is a theoretical chance of hash collisions, i.e., different sequences resulting in the same hash. However, as the number of allele sequences for each gene in databases is relatively low, this should be uncommon. In this case, collisions would result in profiles appearing more similar than they are at the sequence level. In addition, the chances of multiple occurrences of collisions within a profile would be infinitely small.

The motivation for developing locidex is the need a common searching engine for various loci based typing applications such as: gene-by-gene (mlst, cgMLST, wgMLST, rmlst), in silico serotyping, gene-based phenotype predictions (amr, virulence, pathotype, toxin typing), marker-based typing (16S). The tool must provide custom criteria filtering by loci, and produce multiple formats for downstream applications. It must be compatible with an HSP environment and not encounter any locking issues where multiple processes may try to change the data at the same time. [THIS SECTION WILL NEED EDITING]The logic for allele calling is greatly simplified by leveraging existing annotations from tools such as [prodigal](https://github.com/hyattpd/Prodigal), [prokka](https://github.com/tseemann/prokka), [bakta](https://github.com/oschwengers/bakta) to delineate the boundaries of the sequences to be queried and hashed to produce allele identifiers. A common issue in matching applications is that ranges of identity and coverage for a match will vary by locus and so locidex builds into its database structure control over these attributes at a locus level allowing for high variability databases to be used without building custom logic downstream. This is particularly important when lengths of loci can exhibit considerable variability as is the case for genes of interest for typing applications. This provides greater flexibility for the designation of ideal thresholds for a given application. However, these values can be overridden using the report module filtering parameters as well as by modifying the values within the database. [END]
The motivation for developing locidex is the need for a common search engine for various loci-based typing applications such as gene-by-gene (MLST, cgMLST, wgMLST, rMLST), in silico serotyping, gene-based phenotype predictions (amr, virulence, pathotype, toxin typing), marker-based typing (16S). The tool must provide custom criteria filtering by loci and produce multiple formats for downstream applications. It must be compatible with an HPC environment and not encounter locking issues where multiple processes may try to change the data simultaneously. It should provide input sequence data flexibility to user which includes support for 1) existing sequence annotations 2) de novo annotations based on contig input 3) capable of extracting sequence regions of interest. The logic for allele calling is greatly simplified by leveraging existing annotations from tools such as [prodigal](https://github.com/hyattpd/Prodigal), [prokka](https://github.com/tseemann/prokka), [bakta](https://github.com/oschwengers/bakta) to delineate the boundaries of the sequences to be queried and hashed to produce allele identifiers. However, not all loci are protein coding, have inconsistent annotations, or are not a complete OFR, and so Locidex has built in support for extracting regions of interest from a query genome. A common issue in matching applications is that ranges of identity and coverage for a match will vary by locus. So, locidex builds control over these attributes at a locus level into its database structure, allowing for high variability databases to be used without custom logic being built downstream. This is particularly important when lengths of loci can exhibit considerable variability, as is the case for genes of interest in typing applications. This provides greater flexibility for the designation of ideal thresholds for a given application. However, these values can be overridden using the report module filtering parameters and by modifying the values within the database.

[Chewbbaca](https://github.com/B-UMMI/chewBBACA) is an excellent choice for an open source allele caller and provides many advanced features for developing, curating and using gene-by-gene schemes. It provides a great deat of additional information regarding partial gene sequences. For R&D applications, this functionality can be extremely useful. However, for some operational contexts, the design of [Chewbbaca](https://github.com/B-UMMI/chewBBACA) provides undesirable information and at present it has issues with multiple instances using the same database at once with novel allele detection enabled ([B-UMMI/chewBBACA#168](https://github.com/B-UMMI/chewBBACA/issues/168)). Locidex is meant to be optimized for routine operation level searching where it is useful to have default parameters that are set for the user to have reproducibility combined with flexibility to apply multiple filtering parameters on the sequence store after the fact. This allows exploring different thresholds without the need to recompute blast searches. In addition, there is often a desire to include additional information about a given locus such as different identifiers, functional properties, and phenotypic effects. The database format of locidex allows inclusion of any number of fields bundled into a search result object for users to describe their data conveniently during downstream analysis. This functionality allows for different use cases of data from a common data store. Locidex does not have the full features for a gene-by-gene software package like [Chewbbaca](https://github.com/B-UMMI/chewBBACA) but can be used to achieve similar results while being a more generic tool kit for blast searches, similar to [abricate](https://github.com/tseemann/abricate).
[Chewbbaca](https://github.com/B-UMMI/chewBBACA) is an excellent choice for an open-source allele caller and provides many advanced features for developing, curating and using gene-by-gene schemes. It provides a great deal of additional information regarding partial gene sequences. For R&D applications, this functionality can be extremely useful. However, for some operational contexts, the design of [Chewbbaca](https://github.com/B-UMMI/chewBBACA) provides undesirable information, and at present, it has issues with multiple instances using the same database at once with novel allele detection enabled ([B-UMMI/chewBBACA#168](https://github.com/B-UMMI/chewBBACA/issues/168)). Locidex is meant to be optimized for routine operation-level searching. It is helpful to set default parameters for the user to have reproducibility and flexibility when applying multiple filtering parameters on the sequence store after the fact. This allows exploring different thresholds without the need to recompute blast searches. In addition, there is often a desire to include additional information about a given locus, such as different identifiers, functional properties, and phenotypic effects. The database format of locidex allows the inclusion of any number of fields bundled into a search result object for users to describe their data conveniently during downstream analysis. This functionality allows for different data use cases from a common data store. Locidex does not have the full features for a gene-by-gene software package like [Chewbbaca](https://github.com/B-UMMI/chewBBACA) but can be used to achieve similar results while being a more generic tool kit for blast searches, similar to [abricate](https://github.com/tseemann/abricate).

## Citation

Expand Down Expand Up @@ -187,7 +187,23 @@ Produce loci hash profiles in multiple formats (json, tsv)
- Filter results based on user criteria
- Multi-copy loci handling

**Optional:** (Not required for MVP) Produce concatenated fasta sequences based on allele profiles
QA Modes:

Conservative:
A locus is reported with an allele call only if all of the following are true. (Only works with protein coding schemes)
1) Match identity >= threshold
2) Match coverage >= threshold
3) Valid start codon present
4) Valid stop codon present
5) No internal stop codons
6) Only a single hit meets the criteria above

Normal
A locus is reported with an allele call only if all of the following are true.
1) Match identity >= threshold
2) Match coverage >= threshold
3) Multiple matches to a single locus are hashed to produce an allele call which is the hash of the (n) match hashes found

#### Input

A Sequence store (`seq_store.json`) object produced by the 'search' function.
Expand Down
Loading
Loading