You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CHANGELOG.md
+5
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,11 @@
3
3
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
4
4
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
5
5
6
+
## v0.3.0 - [2025-03-11]
7
+
8
+
### `Added`
9
+
10
+
- Option to `locidex merge` to correct/change the profile used in the output from the one found in the MLST profile key via an argument `--profile_ref`/`-p`
Copy file name to clipboardexpand all lines: README.md
+45-31
Original file line number
Diff line number
Diff line change
@@ -99,7 +99,7 @@ The below figure shows a general workflow for each of the locidex commands:
99
99
100
100
### Search
101
101
102
-
The search module is meant to use locidex formatted database directories.
102
+
The search module is meant to use locidex formatted database directories.
103
103
104
104
- DNA and protein blast searches
105
105
- Md5 hashing of alleles
@@ -125,13 +125,13 @@ Accepted input Data Formats: GenBank, Fasta (of individual loci sequences)
125
125
{out folder name}
126
126
├── blast
127
127
├── nucleotide
128
-
├── hsps.txt
128
+
├── hsps.txt
129
129
└── queries.fasta
130
130
├── protein
131
-
├── hsps.txt
132
-
└── queries.fasta
131
+
├── hsps.txt
132
+
└── queries.fasta
133
133
├── seq_store.json
134
-
└── results.json
134
+
└── results.json
135
135
```
136
136
137
137
See "Sequence Store" for description of the seq_store.json output file
@@ -141,8 +141,8 @@ See "Sequence Store" for description of the seq_store.json output file
141
141
The extract module is meant to use locidex formatted database directories to get sequences of individual loci based on a locidex formatted database. The extract module operates in four different modes:
142
142
143
143
1) raw: sequences are directly extracted from the assembly with no further processing.
144
-
2) trim: any leading or trailing bases which are not present in the db match are trimmed from the sequence.
145
-
3) snp: This will apply only nucleotide variants to the reference allele which can be very useful for nanopore assemblies where indels are common and unlikely to be real.
144
+
2) trim: any leading or trailing bases which are not present in the db match are trimmed from the sequence.
145
+
3) snp: This will apply only nucleotide variants to the reference allele which can be very useful for nanopore assemblies where indels are common and unlikely to be real.
146
146
4) extend : This mode will fill in any terminal sequence missing from the sequence based on the matched reference allele.
147
147
148
148
> [!Note]
@@ -158,18 +158,18 @@ Gene annotation is notoriously inconsistent between different software, and so w
158
158
159
159
EXAMPLE: to extract loci sequences from an input genome, reporting just extracted sequences and skipping any post processing (mode=`raw`)
@@ -252,16 +266,16 @@ Takes common formats of gene-by-gene databases and formats them for use with loc
252
266
253
267
Accepts two formats common with most of the major MLST databases:
254
268
255
-
1. a directory of fasta files: ["fasta","fas","fa","ffn","fna","fasta.gz","fas.gz","fa.gz","ffn.gz","fna.gz"] with "locus name" as the file name and allele id's are present in the fasta header separated by an underscore. ie. aroC would have the file name aroC.fas and the header line would be >aroC_1.
269
+
1. a directory of fasta files: ["fasta","fas","fa","ffn","fna","fasta.gz","fas.gz","fa.gz","ffn.gz","fna.gz"] with "locus name" as the file name and allele id's are present in the fasta header separated by an underscore. ie. aroC would have the file name aroC.fas and the header line would be >aroC_1.
256
270
2. a concatonated file of all loci in a single fasta file which has the fasta def line as `>{locus name}_{allele id}`. These two formats are common with most of the major MLST databases.
257
271
258
-
locidex format -i ./example/format_db_mlst_in/ -o ./example/mlst_out/
272
+
locidex format -i ./example/format_db_mlst_in/ -o ./example/mlst_out/
259
273
260
274
#### Output
261
275
262
276
```
263
277
{out folder name}
264
-
├── results.json
278
+
├── results.json
265
279
└── locidex.txt
266
280
```
267
281
@@ -277,7 +291,7 @@ Builds locidex db folder structure
277
291
278
292
Takes the output of **locidex format** (may or may not have additional columns added). There are specific fields being looked for in the file which either or both are required depending on the type of db being built "dna_seq", "aa_seq".
@@ -286,7 +300,7 @@ See - [Database structure](/README.md#Database) for further information.
286
300
287
301
### Manifest
288
302
289
-
Takes a directory containing multiple locidex databases and creates a manifest file that can be passed to locidex command (extract or search), along with a name and version of a specifec database to use.
303
+
Takes a directory containing multiple locidex databases and creates a manifest file that can be passed to locidex command (extract or search), along with a name and version of a specifec database to use.
290
304
291
305
#### Input
292
306
@@ -355,12 +369,12 @@ The output is a `manifest.json` file in the base directory of the database folde
355
369
356
370
## Example workflow
357
371
358
-
MLST Example: The 7-gene MLST scheme targets from [https://pubmlst.org/organisms/salmonella-spp](https://pubmlst.org/organisms/salmonella-spp) were used as targets to extract the full length CDS annotations from NC_003198.1 (Salmonella Typhi CT18). Sequences were separated into individual fasta files for each gene, though a concatonated version would also work as long as the fasta header began with the locus identifier.
372
+
MLST Example: The 7-gene MLST scheme targets from [https://pubmlst.org/organisms/salmonella-spp](https://pubmlst.org/organisms/salmonella-spp) were used as targets to extract the full length CDS annotations from NC_003198.1 (Salmonella Typhi CT18). Sequences were separated into individual fasta files for each gene, though a concatonated version would also work as long as the fasta header began with the locus identifier.
359
373
360
374
> [!Note]
361
-
> The extracted CDS annotations are not just the MLST target sequences but the full orf and so this will differ from normal MLST results. If you want to use the traditional subsections of each loci, you will need to extract these using another method.
375
+
> The extracted CDS annotations are not just the MLST target sequences but the full orf and so this will differ from normal MLST results. If you want to use the traditional subsections of each loci, you will need to extract these using another method.
362
376
363
-
`locidex format` is used to create a TSV file containing the sequence of each of the targets and individual match thresholds for each query. These can be modified by the user before building the database.
377
+
`locidex format` is used to create a TSV file containing the sequence of each of the targets and individual match thresholds for each query. These can be modified by the user before building the database.
364
378
365
379
locidex format -i ~/example/format_db_mlst_in/ -o ~/example/format_db_mlst_out/ --force
366
380
@@ -370,7 +384,7 @@ The `locidex build` converts that TSV into a form that `locidex search` can use.
370
384
371
385
`locidex search` is used to query against the database to produce a sequence store (two examples are provided here to show the use of genbank annotations or prodigal results).
@@ -401,7 +415,7 @@ Similar to [abricate](https://github.com/tseemann/abricate), Locidex uses a fixe
401
415
├──nucleotide.njs
402
416
├──nucleotide.nsq
403
417
├──nucleotide.ntf
404
-
└──nucleotide.nto
418
+
└──nucleotide.nto
405
419
└──protein #optional but >= 1must be present
406
420
├── protein.fasta
407
421
├── protein.pdb
@@ -492,10 +506,10 @@ No, the benefit of having dual searching with protein and dna is that you can ha
492
506
493
507
Ideally a gene-by-gene scheme consists of only single copy genes but bacterial genomes are dynamic and genuine dulplications can occur, in addition to assembly artifacts and contamination. There are a variety of approaches available to manages these cases. Within the 7-gene [mlst](https://github.com/tseemann/mlst) tool multiple alleles for a given locus are reported with a comma delimiting each allele. However, this poses an issue for calculating genetic distances since it is unclear how to treat the multiple alleles. There are several common methods for how to treat multiple alleles:
494
508
495
-
1) treat the combination as a novel allele
496
-
2) blank the column
497
-
3) select the earliest allele in the database
498
-
4) Use a similarity score to rate which is the best allele to include.
509
+
1) treat the combination as a novel allele
510
+
2) blank the column
511
+
3) select the earliest allele in the database
512
+
4) Use a similarity score to rate which is the best allele to include.
499
513
500
514
The most conservative approach is to not interpret that column by blanking it in distance calculations which results in blunting resolution which is implemented within locidex as the conservative mode. Alternatively, by using an approach to select only one of the loci to match will have mixed effects (options 3, 4) that can result in inconsistencies where some isolates appear more similar or dissimilar than they are. The preferred method that locidex has implemented as its [DEFAULT] (normal) mode is to combine the result into a new "allele" hash that is derived from calculating the md5 hash of the concatenated allele md5 hashes, sorted alphabetically. This has the benefit of the same combination of alleles resulting in the same hash code and will match when this occurs. Conversely, it will count a difference even when individual component alleles may match between two samples.
0 commit comments