Skip to content

Commit 7e6b55a

Browse files
authored
Merge pull request #39 from phac-nml/integrate/input_assure
Added in input_asure.py to locidex merge
2 parents 4b2e4e0 + f03228c commit 7e6b55a

10 files changed

+338
-135
lines changed

CHANGELOG.md

+5
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## v0.3.0 - [2025-03-11]
7+
8+
### `Added`
9+
10+
- Option to `locidex merge` to correct/change the profile used in the output from the one found in the MLST profile key via an argument `--profile_ref`/`-p`
611

712
## v0.2.3 - [2024-08-20]
813

README.md

+45-31
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ The below figure shows a general workflow for each of the locidex commands:
9999

100100
### Search
101101

102-
The search module is meant to use locidex formatted database directories.
102+
The search module is meant to use locidex formatted database directories.
103103

104104
- DNA and protein blast searches
105105
- Md5 hashing of alleles
@@ -125,13 +125,13 @@ Accepted input Data Formats: GenBank, Fasta (of individual loci sequences)
125125
{out folder name}
126126
├── blast
127127
├── nucleotide
128-
├── hsps.txt
128+
├── hsps.txt
129129
└── queries.fasta
130130
├── protein
131-
├── hsps.txt
132-
└── queries.fasta
131+
├── hsps.txt
132+
└── queries.fasta
133133
├── seq_store.json
134-
└── results.json
134+
└── results.json
135135
```
136136

137137
See "Sequence Store" for description of the seq_store.json output file
@@ -141,8 +141,8 @@ See "Sequence Store" for description of the seq_store.json output file
141141
The extract module is meant to use locidex formatted database directories to get sequences of individual loci based on a locidex formatted database. The extract module operates in four different modes:
142142

143143
1) raw: sequences are directly extracted from the assembly with no further processing.
144-
2) trim: any leading or trailing bases which are not present in the db match are trimmed from the sequence.
145-
3) snp: This will apply only nucleotide variants to the reference allele which can be very useful for nanopore assemblies where indels are common and unlikely to be real.
144+
2) trim: any leading or trailing bases which are not present in the db match are trimmed from the sequence.
145+
3) snp: This will apply only nucleotide variants to the reference allele which can be very useful for nanopore assemblies where indels are common and unlikely to be real.
146146
4) extend : This mode will fill in any terminal sequence missing from the sequence based on the matched reference allele.
147147

148148
> [!Note]
@@ -158,18 +158,18 @@ Gene annotation is notoriously inconsistent between different software, and so w
158158

159159
EXAMPLE: to extract loci sequences from an input genome, reporting just extracted sequences and skipping any post processing (mode=`raw`)
160160

161-
locidex extract --mode raw -i ./example/search/NC_003198.1.fasta -d .example/build_db_mlst_out -o ./example/search/NC_003198_fasta -n 8
161+
locidex extract --mode raw -i ./example/search/NC_003198.1.fasta -d .example/build_db_mlst_out -o ./example/search/NC_003198_fasta -n 8
162162

163163
#### Output
164164

165165
```
166166
{out folder name}
167167
├── blast
168-
├── hsps.txt
168+
├── hsps.txt
169169
├── blast_db
170-
├── contigs.fasta.ndb
170+
├── contigs.fasta.ndb
171171
├── contigs.fasta.nhr
172-
├── contigs.fasta.nin
172+
├── contigs.fasta.nin
173173
├── contigs.fasta.njs
174174
├── contigs.fasta.not
175175
├── contigs.fasta.nsq
@@ -178,7 +178,7 @@ EXAMPLE: to extract loci sequences from an input genome, reporting just extracte
178178
├── filtered.hsps.txt
179179
├── processed.extracted.seqs.fasta #optional sequences with trimming, gapp filling an snp only based on options selected
180180
├── raw.extracted.seqs.fasta #exact extracted sequences
181-
└── results.json
181+
└── results.json
182182
```
183183

184184
### Report
@@ -200,7 +200,7 @@ A locus is reported with an allele call only if all of the following are true. (
200200
6) Only a single hit meets the criteria above
201201

202202
Normal
203-
A locus is reported with an allele call only if all of the following are true.
203+
A locus is reported with an allele call only if all of the following are true.
204204
1) Match identity >= threshold
205205
2) Match coverage >= threshold
206206
3) Multiple matches to a single locus are hashed to produce an allele call which is the hash of the (n) match hashes found
@@ -211,15 +211,29 @@ A Sequence store (`seq_store.json`) object produced by the 'search' function.
211211

212212
locidex report -i .example/search/seq_store.json -o ./example/report_out --name NC_003198
213213

214+
#### Option:
215+
216+
`-p`/`--profile_ref`: Provide a TSV file with profile references for overriding MLST profiles. Columns [sample/sample_name,mlst_alleles]'
217+
218+
The TSV should have the new profiles with a column name `sample` or `sample_name` and the associated MLST file path under `mlst_alleles`
219+
```
220+
sample mlst_alleles .....
221+
SAMPLE1 sample1_mlst.json .....
222+
SAMPLE2 sample2_mlst.json .....
223+
SAMPLE3 sample3_mlst.json .....
224+
```
225+
214226
#### Output
215227

216228
```
217229
{out folder name}
218230
├── nucleotide.hits.txt
219231
├── profile.json
220-
└── protein.hits.txt
232+
└── protein.hits.txt
233+
└── MLST_error_report.csv (optional)
221234
```
222235

236+
223237
### Merge
224238

225239
Reads and concatenates report files into an allele profile in TSV format.
@@ -234,14 +248,14 @@ EXAMPLE: merging multiple files provided on the command line to -i
234248

235249
EXAMPLE: merging files provided through a list of paths to report files
236250

237-
locidex merge -i ./example/merge_in/file_list.txt ./example/merge_out/
251+
locidex merge -i ./example/merge_in/file_list.txt ./example/merge_out/
238252

239253
#### Output
240254

241255
```
242256
{out folder name}
243-
├── profile.tsv
244-
└── results.json
257+
├── profile.tsv
258+
└── results.json
245259
```
246260

247261
### Format
@@ -252,16 +266,16 @@ Takes common formats of gene-by-gene databases and formats them for use with loc
252266

253267
Accepts two formats common with most of the major MLST databases:
254268

255-
1. a directory of fasta files: ["fasta","fas","fa","ffn","fna","fasta.gz","fas.gz","fa.gz","ffn.gz","fna.gz"] with "locus name" as the file name and allele id's are present in the fasta header separated by an underscore. ie. aroC would have the file name aroC.fas and the header line would be >aroC_1.
269+
1. a directory of fasta files: ["fasta","fas","fa","ffn","fna","fasta.gz","fas.gz","fa.gz","ffn.gz","fna.gz"] with "locus name" as the file name and allele id's are present in the fasta header separated by an underscore. ie. aroC would have the file name aroC.fas and the header line would be >aroC_1.
256270
2. a concatonated file of all loci in a single fasta file which has the fasta def line as `>{locus name}_{allele id}`. These two formats are common with most of the major MLST databases.
257271

258-
locidex format -i ./example/format_db_mlst_in/ -o ./example/mlst_out/
272+
locidex format -i ./example/format_db_mlst_in/ -o ./example/mlst_out/
259273

260274
#### Output
261275

262276
```
263277
{out folder name}
264-
├── results.json
278+
├── results.json
265279
└── locidex.txt
266280
```
267281

@@ -277,7 +291,7 @@ Builds locidex db folder structure
277291

278292
Takes the output of **locidex format** (may or may not have additional columns added). There are specific fields being looked for in the file which either or both are required depending on the type of db being built "dna_seq", "aa_seq".
279293

280-
locidex build -i ./example/build_db_mlst_in/senterica.mlst.txt -o ./example/mlst_out_db/
294+
locidex build -i ./example/build_db_mlst_in/senterica.mlst.txt -o ./example/mlst_out_db/
281295

282296
#### Output
283297

@@ -286,7 +300,7 @@ See - [Database structure](/README.md#Database) for further information.
286300

287301
### Manifest
288302

289-
Takes a directory containing multiple locidex databases and creates a manifest file that can be passed to locidex command (extract or search), along with a name and version of a specifec database to use.
303+
Takes a directory containing multiple locidex databases and creates a manifest file that can be passed to locidex command (extract or search), along with a name and version of a specifec database to use.
290304

291305
#### Input
292306

@@ -355,12 +369,12 @@ The output is a `manifest.json` file in the base directory of the database folde
355369

356370
## Example workflow
357371

358-
MLST Example: The 7-gene MLST scheme targets from [https://pubmlst.org/organisms/salmonella-spp](https://pubmlst.org/organisms/salmonella-spp) were used as targets to extract the full length CDS annotations from NC_003198.1 (Salmonella Typhi CT18). Sequences were separated into individual fasta files for each gene, though a concatonated version would also work as long as the fasta header began with the locus identifier.
372+
MLST Example: The 7-gene MLST scheme targets from [https://pubmlst.org/organisms/salmonella-spp](https://pubmlst.org/organisms/salmonella-spp) were used as targets to extract the full length CDS annotations from NC_003198.1 (Salmonella Typhi CT18). Sequences were separated into individual fasta files for each gene, though a concatonated version would also work as long as the fasta header began with the locus identifier.
359373

360374
> [!Note]
361-
> The extracted CDS annotations are not just the MLST target sequences but the full orf and so this will differ from normal MLST results. If you want to use the traditional subsections of each loci, you will need to extract these using another method.
375+
> The extracted CDS annotations are not just the MLST target sequences but the full orf and so this will differ from normal MLST results. If you want to use the traditional subsections of each loci, you will need to extract these using another method.
362376
363-
`locidex format` is used to create a TSV file containing the sequence of each of the targets and individual match thresholds for each query. These can be modified by the user before building the database.
377+
`locidex format` is used to create a TSV file containing the sequence of each of the targets and individual match thresholds for each query. These can be modified by the user before building the database.
364378

365379
locidex format -i ~/example/format_db_mlst_in/ -o ~/example/format_db_mlst_out/ --force
366380

@@ -370,7 +384,7 @@ The `locidex build` converts that TSV into a form that `locidex search` can use.
370384

371385
`locidex search` is used to query against the database to produce a sequence store (two examples are provided here to show the use of genbank annotations or prodigal results).
372386

373-
locidex search -q ~/example/search/NC_003198.1.gbk -d ~/example/build_db_mlst_out/ -o ./mlst_ncbi_annotated --force
387+
locidex search -q ~/example/search/NC_003198.1.gbk -d ~/example/build_db_mlst_out/ -o ./mlst_ncbi_annotated --force
374388

375389
locidex search -q ~/example/search/NC_003198.1.fasta -d ~/example/build_db_mlst_out/ -o ./mlst_prodigal --force --annotate
376390

@@ -401,7 +415,7 @@ Similar to [abricate](https://github.com/tseemann/abricate), Locidex uses a fixe
401415
├──nucleotide.njs
402416
├──nucleotide.nsq
403417
├──nucleotide.ntf
404-
└──nucleotide.nto
418+
└──nucleotide.nto
405419
└──protein #optional but >= 1must be present
406420
├── protein.fasta
407421
├── protein.pdb
@@ -492,10 +506,10 @@ No, the benefit of having dual searching with protein and dna is that you can ha
492506

493507
Ideally a gene-by-gene scheme consists of only single copy genes but bacterial genomes are dynamic and genuine dulplications can occur, in addition to assembly artifacts and contamination. There are a variety of approaches available to manages these cases. Within the 7-gene [mlst](https://github.com/tseemann/mlst) tool multiple alleles for a given locus are reported with a comma delimiting each allele. However, this poses an issue for calculating genetic distances since it is unclear how to treat the multiple alleles. There are several common methods for how to treat multiple alleles:
494508

495-
1) treat the combination as a novel allele
496-
2) blank the column
497-
3) select the earliest allele in the database
498-
4) Use a similarity score to rate which is the best allele to include.
509+
1) treat the combination as a novel allele
510+
2) blank the column
511+
3) select the earliest allele in the database
512+
4) Use a similarity score to rate which is the best allele to include.
499513

500514
The most conservative approach is to not interpret that column by blanking it in distance calculations which results in blunting resolution which is implemented within locidex as the conservative mode. Alternatively, by using an approach to select only one of the loci to match will have mixed effects (options 3, 4) that can result in inconsistencies where some isolates appear more similar or dissimilar than they are. The preferred method that locidex has implemented as its [DEFAULT] (normal) mode is to combine the result into a new "allele" hash that is derived from calculating the md5 hash of the concatenated allele md5 hashes, sorted alphabetically. This has the benefit of the same combination of alleles resulting in the same hash code and will match when this occurs. Conversely, it will count a difference even when individual component alleles may match between two samples.
501515

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"db_info": {},
3+
"parameters": {
4+
"mode": "normal",
5+
"min_match_ident": 100,
6+
"min_match_cov": 100,
7+
"max_ambiguous": 0,
8+
"max_internal_stops": 0
9+
},
10+
"data": {
11+
"sample_name": "sampleA",
12+
"profile": {
13+
"sampleA": {
14+
"l1": "1",
15+
"l2": "1",
16+
"l3": "1"
17+
}
18+
},
19+
"seq_data": {}
20+
}
21+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"db_info": {},
3+
"parameters": {
4+
"mode": "normal",
5+
"min_match_ident": 100,
6+
"min_match_cov": 100,
7+
"max_ambiguous": 0,
8+
"max_internal_stops": 0
9+
},
10+
"data": {
11+
"sample_name": "sampleB",
12+
"profile": {
13+
"sampleB": {
14+
"l1": "1",
15+
"l2": "1",
16+
"l3": "1"
17+
}
18+
},
19+
"seq_data": {}
20+
}
21+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"db_info": {},
3+
"parameters": {
4+
"mode": "normal",
5+
"min_match_ident": 100,
6+
"min_match_cov": 100,
7+
"max_ambiguous": 0,
8+
"max_internal_stops": 0
9+
},
10+
"data": {
11+
"sample_name": "sampleC",
12+
"profile": {
13+
"sampleC": {
14+
"l1": "1",
15+
"l2": "1",
16+
"l3": "2"
17+
}
18+
},
19+
"seq_data": {}
20+
}
21+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"db_info": {},
3+
"parameters": {
4+
"mode": "normal",
5+
"min_match_ident": 100,
6+
"min_match_cov": 100,
7+
"max_ambiguous": 0,
8+
"max_internal_stops": 0
9+
},
10+
"data": {
11+
"sample_name": "sampleQ",
12+
"profile": {
13+
"sampleQ": {
14+
"l1": "1",
15+
"l2": "2",
16+
"l3": "1"
17+
}
18+
},
19+
"seq_data": {}
20+
}
21+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
sample,mlst_alleles,address
2+
sampleQ,sampleQ.mlst.json,1.1.1
3+
sample1,sample1.mlst.json,
4+
sample2,sample2.mlst.json,1.1.1
5+
sample3,sample3.mlst.json,

0 commit comments

Comments
 (0)