phac-nml · mattheww95 · May 2, 2024 · Mar 20, 2024 · Mar 20, 2024 · Mar 21, 2024
diff --git a/.github/workflows/locidex-ci-pytest-workflow.yaml b/.github/workflows/locidex-ci-pytest-workflow.yaml
@@ -5,9 +5,9 @@ name: Python application
 
 on:
   push:
-    branches: [ "main", "tests" ]
+    branches: [ "main", "tests", "dev" ]
   pull_request:
-    branches: [ "main", "tests" ]
+    branches: [ "main", "tests", "dev" ]
 
 permissions:
   contents: read
@@ -29,6 +29,8 @@ jobs:
         python -m pip install --upgrade pip
         pip install flake8 pytest
         if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+        pip install pytest-workflow==2.0.1
+        pip install -e .
     - name: Lint with flake8
       run: |
         # stop the build if there are Python syntax errors or undefined names
@@ -37,4 +39,4 @@ jobs:
         flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with pytest
       run: |
-        pytest -o log_cli=true --basetemp=tmp-pytest
+        pytest -o log_cli=true --git-aware
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 __pycache__
-*.egg*
+*.egg*
+.vscode
diff --git a/README.md b/README.md
@@ -34,15 +34,15 @@
 <small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>
 
 # Introduction
-A common function for many tools in bacterial typing is performing similarity searching using NCBI [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi). Blast provides a robust command line interface for constructing and using databases for similarity searching and is ubiquitous. There are many typing applications where custom code is written around the blast command line interface to perform searches for a variety of downstream applications. For instance, identification of specific target sequences within an assembly to perform gene-by-gene phylogenetic analysis (MLST, cgMLST, wgMLST), antimicrobial resistance gene detection, virulence gene detection, and in silico predictions of phenotypes such as serotype is a major application within public health. The typical approach is to bundle the search-based logic with additional specialized logic for performing the desired analysis.
+A common function for many tools in bacterial typing is performing similarity searching using NCBI [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi). Blast provides a robust command line interface for constructing and using databases for similarity searching and is ubiquitous. There are many typing applications where custom code is written around the blast command line interface to perform searches for various downstream applications. For instance, the identification of specific target sequences within an assembly to perform gene-by-gene phylogenetic analysis (MLST, cgMLST, wgMLST), antimicrobial resistance gene detection, virulence gene detection, and in silico predictions of phenotypes such as serotype is an important application within public health. The typical approach is bundling the search-based logic with additional specialized logic to perform the desired analysis.
 
-Decentralized allele calling has become a pressing concern by public health laboratories due to the increased use of whole genome sequencing (WGS) as part of outbreak detection and surveillance of a variety of pathogens. Gene-by-gene approaches have a variety of benefits for species typing which include a standardized set of loci for estimating genetic similarity between samples. This standardization allows for interoperability between different groups and also has the benefit of compression, simplifying genetic comparisons to use a simple hamming distance based on allele identifiers instead of a whole sequence. However, a limitation of this approach is the requirement of a centralized authority to issue unique allele identifiers and this poses multiple problems for operationalization such as privacy and connectivity. Despite this limitation PulseNet International has adopted gene-by-gene analysis as its preferred analytical approach for estimating genetic similarity between samples for routine operations with the limitation that comparing between jurisdictions requires the sharing of the primary sequence data rather than the allele identifiers.
+Decentralized allele calling has become a pressing concern for public health laboratories due to the increased use of whole genome sequencing (WGS) for outbreak detection and surveillance of various pathogens. Gene-by-gene approaches have a variety of benefits for species typing, including a standardized set of loci for estimating genetic similarity between samples. This standardization allows for interoperability between different groups. Also, it has the benefit of compression, simplifying genetic comparisons by using a simple hamming distance based on allele identifiers instead of a whole sequence. However, a limitation of this approach is the requirement of a centralized authority to issue unique allele identifiers, which poses multiple operational problems, such as privacy and connectivity. Despite this limitation, PulseNet International has adopted gene-by-gene analysis as its preferred analytical approach for estimating genetic similarity between samples for routine operations, with the limitation that comparing between jurisdictions requires the sharing of the primary sequence data rather than the allele identifiers.
 
-In recent years, the concept of using cryptographic hashes of the allele sequence itself have gained traction in a variety of different allele calling software, such as [Chewbbaca](https://github.com/B-UMMI/chewBBACA), to provide decentralized allele identifiers. Hashing the sequence yields a determinist and fixed-size hash value which can be compared in the same manner as integers. There are numerous hash functions with different strengths and weaknesses but MD5 digests have broad adoption in the software community and are routinely used to provide some assurance that a transferred file has arrived intact. The choice of md5 hash provides 16^32, possible hashes. There is a theoretical chance of hash collisions, i.e., different sequences resulting in the same hash, but as the number of allele sequences for each gene in databases is relatively low, this should be an uncommon occurrence. Collisions in this case would result in profiles appearing more similar than they truly are at the sequence level. In addition, the chances of multiple occurrences of collisions within a profile would be infinitely small.
+In recent years, the concept of using cryptographic hashes of the allele sequence itself has gained traction in various allele-calling software, such as [Chewbbaca](https://github.com/B-UMMI/chewBBACA), to provide decentralized allele identifiers. Hashing the sequence yields a determinist and fixed-size hash value, which can be compared in the same manner as integers. There are numerous hash functions with different strengths and weaknesses, but MD5 digests have been broadly adopted in the software community. They are routinely used to assure that a transferred file has arrived intact. The choice of md5 hash provides 16^32, possible hashes. There is a theoretical chance of hash collisions, i.e., different sequences resulting in the same hash. However, as the number of allele sequences for each gene in databases is relatively low, this should be uncommon. In this case, collisions would result in profiles appearing more similar than they are at the sequence level. In addition, the chances of multiple occurrences of collisions within a profile would be infinitely small.
 
-The motivation for developing locidex is the need a common searching engine for various loci based typing applications such as: gene-by-gene (mlst, cgMLST, wgMLST, rmlst), in silico serotyping, gene-based phenotype predictions (amr, virulence, pathotype, toxin typing), marker-based typing (16S). The tool must provide custom criteria filtering by loci, and produce multiple formats for downstream applications. It must be compatible with an HSP environment and not encounter any locking issues where multiple processes may try to change the data at the same time. [THIS SECTION WILL NEED EDITING]The logic for allele calling is greatly simplified by leveraging existing annotations from tools such as [prodigal](https://github.com/hyattpd/Prodigal), [prokka](https://github.com/tseemann/prokka), [bakta](https://github.com/oschwengers/bakta) to delineate the boundaries of the sequences to be queried and hashed to produce allele identifiers. A common issue in matching applications is that ranges of identity and coverage for a match will vary by locus and so locidex builds into its database structure control over these attributes at a locus level allowing for high variability databases to be used without building custom logic downstream. This is particularly important when lengths of loci can exhibit considerable variability as is the case for genes of interest for typing applications. This provides greater flexibility for the designation of ideal thresholds for a given application. However, these values can be overridden using the report module filtering parameters as well as by modifying the values within the database. [END]
+The motivation for developing locidex is the need for a common search engine for various loci-based typing applications such as gene-by-gene (MLST,  cgMLST, wgMLST, rMLST), in silico serotyping, gene-based phenotype predictions (amr, virulence, pathotype, toxin typing), marker-based typing (16S). The tool must provide custom criteria filtering by loci and produce multiple formats for downstream applications. It must be compatible with an HPC environment and not encounter locking issues where multiple processes may try to change the data simultaneously. It should provide input sequence data flexibility to user which includes support for 1) existing sequence annotations 2) de novo annotations based on contig input 3) capable of extracting sequence regions of interest. The logic for allele calling is greatly simplified by leveraging existing annotations from tools such as [prodigal](https://github.com/hyattpd/Prodigal), [prokka](https://github.com/tseemann/prokka), [bakta](https://github.com/oschwengers/bakta) to delineate the boundaries of the sequences to be queried and hashed to produce allele identifiers. However, not all loci are protein coding, have inconsistent annotations, or are not a complete OFR, and so Locidex has built in support for extracting regions of interest from a query genome. A common issue in matching applications is that ranges of identity and coverage for a match will vary by locus. So, locidex builds control over these attributes at a locus level into its database structure, allowing for high variability databases to be used without custom logic being built downstream. This is particularly important when lengths of loci can exhibit considerable variability, as is the case for genes of interest in typing applications. This provides greater flexibility for the designation of ideal thresholds for a given application. However, these values can be overridden using the report module filtering parameters and by modifying the values within the database.
 
-[Chewbbaca](https://github.com/B-UMMI/chewBBACA) is an excellent choice for an open source allele caller and provides many advanced features for developing, curating and using gene-by-gene schemes. It provides a great deat of additional information regarding partial gene sequences. For R&D applications, this functionality can be extremely useful. However, for some operational contexts, the design of [Chewbbaca](https://github.com/B-UMMI/chewBBACA) provides undesirable information and at present it has issues with multiple instances using the same database at once with novel allele detection enabled ([B-UMMI/chewBBACA#168](https://github.com/B-UMMI/chewBBACA/issues/168)). Locidex is meant to be optimized for routine operation level searching where it is useful to have default parameters that are set for the user to have reproducibility combined with flexibility to apply multiple filtering parameters on the sequence store after the fact. This allows exploring different thresholds without the need to recompute blast searches. In addition, there is often a desire to include additional information about a given locus such as different identifiers, functional properties, and phenotypic effects. The database format of locidex allows inclusion of any number of fields bundled into a search result object for users to describe their data conveniently during downstream analysis. This functionality allows for different use cases of data from a common data store. Locidex does not have the full features for a gene-by-gene software package like [Chewbbaca](https://github.com/B-UMMI/chewBBACA) but can be used to achieve similar results while being a more generic tool kit for blast searches, similar to [abricate](https://github.com/tseemann/abricate).
+[Chewbbaca](https://github.com/B-UMMI/chewBBACA) is an excellent choice for an open-source allele caller and provides many advanced features for developing, curating and using gene-by-gene schemes. It provides a great deal of additional information regarding partial gene sequences. For R&D applications, this functionality can be extremely useful. However, for some operational contexts, the design of [Chewbbaca](https://github.com/B-UMMI/chewBBACA) provides undesirable information, and at present, it has issues with multiple instances using the same database at once with novel allele detection enabled ([B-UMMI/chewBBACA#168](https://github.com/B-UMMI/chewBBACA/issues/168)). Locidex is meant to be optimized for routine operation-level searching. It is helpful to set default parameters for the user to have reproducibility and flexibility when applying multiple filtering parameters on the sequence store after the fact. This allows exploring different thresholds without the need to recompute blast searches. In addition, there is often a desire to include additional information about a given locus, such as different identifiers, functional properties, and phenotypic effects. The database format of locidex allows the inclusion of any number of fields bundled into a search result object for users to describe their data conveniently during downstream analysis. This functionality allows for different data use cases from a common data store. Locidex does not have the full features for a gene-by-gene software package like [Chewbbaca](https://github.com/B-UMMI/chewBBACA) but can be used to achieve similar results while being a more generic tool kit for blast searches, similar to [abricate](https://github.com/tseemann/abricate).
 
 ## Citation
 
@@ -187,7 +187,23 @@ Produce loci hash profiles in multiple formats (json, tsv)
 - Filter results based on user criteria
 - Multi-copy loci handling
 
-**Optional:** (Not required for MVP) Produce concatenated fasta sequences based on allele profiles
+QA Modes:
+
+Conservative:
+A locus is reported with an allele call only if all of the following are true. (Only works with protein coding schemes)
+1) Match identity >= threshold
+2) Match coverage >= threshold
+3) Valid start codon present
+4) Valid stop codon present
+5) No internal stop codons
+6) Only a single hit meets the criteria above
+
+Normal
+A locus is reported with an allele call only if all of the following are true. 
+1) Match identity >= threshold
+2) Match coverage >= threshold
+3) Multiple matches to a single locus are hashed to produce an allele call which is the hash of the (n) match hashes found
+
 #### Input
 
 A Sequence store (`seq_store.json`) object produced by the 'search' function.