Added in input_asure.py to locidex merge #39

sgsutcliffe · 2025-02-27T16:54:53Z

STRY0017229

Add method to provide a sample name associated with an MLST JSON file in "locidex merge"

Description

We want to be able to provide a sample name/identifier alongside each MLST JSON file in "locidex merge" so I can override the sample name stored within the JSON file without writing a new JSON file prior to "locidex merge".

The subtext of the added feature is it will allow pipelines like gasnomenclature to be more efficient, by integrating the check step at the same time the MLST file is already being read, and written together.

Goals

The "locidex merge" command will be modified to override the sample name stored within the MLST JSON file.
- One option is to use a syntax like "SAMPLE=FILE" when passing on the command line.
- Another option is to pass a CSV file as input with two columns "sample,file"
When a sample name is provided to "locidex merge", this will override the sample name stored in the MLST JSON file.
The output "profiles.csv" file will use the overridden sample names as identifiers if they were provided, otherwise it will use sample identifiers in the MLST JSON file.
The "locidex merge" command will be modified to output an "error.csv" file with any warnings or errors in input MLST JSON files, one sample/file per row. This will match format of existing error file produced by "INPUT_ASSURE" in gasnomenclature pipeline.
The updated "locidex merge" command will be backwards compatible with the previous functionality (where the sample name is derived from the contents of the MLST JSON file).
Unit/integration tests should be developed to confirm new functionality and existing functionality (overriding sample names in MLST JSON files).

mattheww95

Thanks so much for all the work @sgsutcliffe! I think I added too many comments, but they are just centered around removing magic numbers and assigning strings to variables.

mattheww95 · 2025-02-28T16:28:16Z

locidex/merge.py

+        sampledict = {}
+        for line in f:
+            line = line.rstrip().split(",")
+            if len(line) != 2:


Can you remove the magic number? e.g. Assign the numbers to variables above with a descriptive name. I assume the "2" is supposed to refer to the expected columns length

poof went away (thanks to a magical panda) 468f76e

mattheww95 · 2025-02-28T16:28:50Z

locidex/merge.py

+                logger.critical("File should be tsv with two columns [sample,mlst_alleles]")
+                raise_file_not_found_e(logger)
+            sample = line[0]
+            mlst_file = os.path.basename(line[1])


Same as above, I understand the intention of the zero but what does the "1" refer to?

magical panda strikes again 468f76e

mattheww95 · 2025-02-28T16:30:38Z

locidex/merge.py

+
+def compare_profiles(mlst, sample_id, file_name):
+    # Extract the profile from the json_data
+    profile = mlst.get("data", {}).get("profile", {})


If a string is repeated can you assign it to a variable?

Just to make refactoring easier in the future and minimize errors due to spelling mistakes in the future.

mattheww95 · 2025-02-28T16:34:13Z

locidex/merge.py

+        error_message = (
+            f"{file_name} is missing the 'profile' section or is completely empty!"
+        )
+        print(error_message)


For the error message could you use the logger and raise the correct error? There is an error handler in the main loop that will handle your error for example you could add `raise KeyError(error_message).

One reason for using the logger over print is that print is buffered to stdout which may not always be collected, where as the log messages write to stderr which is not buffered. Additionally we can format the log messages in a pretty way. simply replace the print statement with logger.error(error_message)

If you are curious about the error handler you can see how the error handler is implemented here:

locidex/locidex/main.py

Lines 42 to 54 in eedba57

try:

utils.check_utilities(logger, constants.UTILITIES_CHECK)

logger.info("Running {}".format(args.command))

tasks[args.command][module_idx].run(args)

logger.info("Finished: {}".format(args.command))

except Exception as e:

with open(error_file, "w") as f:

f.write(traceback.format_exc())

error_number = e.errno if hasattr(e, "errno") else -1

logger.critical("Program exited with errors, please review logs. For the full traceback please see file: {}".format(error_file))

SystemExit(error_number)

else:

logger.info("Program finished without errors.")

and for how to call the logger

Good catch, and that makes sense to me! In line with some of the comments Aaron made which is adapted what was a single input_assure script that was used per process in the nextflow pipeline, to proper integration with the locidex. 4631d27

mattheww95 · 2025-02-28T16:34:35Z

locidex/merge.py

+            f"{file_name} is missing the 'profile' section or is completely empty!"
+        )
+        print(error_message)
+        sys.exit(1)


Remove the sys.exit call, the error handler will handle the outputs.

mattheww95 · 2025-02-28T16:38:04Z

locidex/merge.py

+
+    #Write error messages for profile mismatch (compare_profiles())
+    for error_message in compare_error:
+            if error_message[2]:


magic numbers 🧙

poof dissappeared with c9b0d94

mattheww95 · 2025-02-28T16:39:09Z

locidex/merge.py

+                with open(output_error_file, "w", newline="") as f:
+                    writer = csv.writer(f)
+                    writer.writerow(["sample", "JSON_key", "error_message"])
+                    writer.writerow([error_message[0], error_message[1], error_message[2]])


magic numbers 🧙

poof disappeared with c9b0d94

mattheww95 · 2025-02-28T16:40:58Z

locidex/merge.py

-
+
+    #Write error messages for profile mismatch (compare_profiles())
+    for error_message in compare_error:


Should the program raise a non-zero exit status if errors are encountred so the pipeline notifies the user that something went wrong?

it may be useful for pipelines as they will not die nessecarily as all outputs will be present, but Nextflow will notify the user of a non-zero exit status.

I think I might rename these, as "error" is not really what they are (i.e. non-zero exit would not be right here). We're collapsing input_assure (a process from gasnomenclature pipeline) into locidex. input_assure was a way to override the mlst profile locidex merge uses in the output file. If someone supplies a renaming file we have to assume they will be aware that the output would be modified, we simply want to provide a file that explains how the file was modified. We called this an error report but its more of a warning, or friendly notification.

Changed (c9b0d94) the variable to modified_MLST_file_list which is less urgent or bad as we would expect from error:

#Write report of all the MLST files with profile mismatch and how MLST profiles with mismatch were modified df = pd.DataFrame(modified_MLST_file_list) df.to_csv(f'{outdir}/MLST_error_report.csv', index=False, header=False)

locidex/merge.py

tests/test_merge.py

apetkau

This looks great Steven. Thanks so much for your amazing work 😄. I've added my initial comments below.

locidex/merge.py

This reverts commit 580c272.

…gration" This reverts commit 1d7ae8f.

sgsutcliffe · 2025-03-05T16:55:57Z

Thanks @mattheww95 and @apetkau for the extremely helpful code review. It looked like simply using pandas cleared up a lot of the magic numbers, and I moved things around and renamed variables to hopefully improve the readibility of the code.

mattheww95

Much better thank you for removing the magic @sgsutcliffe 😄

I added some small requests to fix up, but I think I will have no further complaints after those changes are made.

tests/test_merge.py

mattheww95 · 2025-03-07T19:58:39Z

locidex/merge.py

@@ -65,28 +67,33 @@ def get_file_list(input_files):
                    file_list.append(line)
    return file_list

-def validate_input_file(data_in: dict, db_version: str, db_name: str, perform_validation: bool) -> tuple[ReportData, str, str]:
+def validate_input_file(data_in: dict, filename: str, db_version: str, db_name: str, perform_db_validation: bool, perform_profile_validation: bool, profile_refs_dict=dict) -> tuple[ReportData, str, str, list]:


Can you add a type annotation for the list?
For longer more complex types you can also create a TypeAlias however you dont need to implement this, just a fun fact.

Following up on the solution we discussed elsewhere: a79103b

locidex/merge.py

apetkau

This looks great @sgsutcliffe. Thanks so much for addressing my comments 😄 .

I have a few additional in-line comments for you below.

locidex/merge.py

tests/test_merge.py

apetkau

Everything looks great to me. Thanks so much @sgsutcliffe 😄

mattheww95

LGTM, Thanks so much for all the work!!!!! 🧙

locidex/merge.py

sgsutcliffe added 5 commits February 27, 2025 11:45

Added in input_asure.py to locidex merge

78745d2

Revert change of error message made my mistake

0373ae8

Add unit test for comparing sample name and profiles

43ecb6e

Fix the output directory for input assure error report

ba2752e

Add check error report test unit test

1f8ce81

sgsutcliffe requested review from mattheww95 and apetkau February 28, 2025 15:33

mattheww95 requested changes Feb 28, 2025

View reviewed changes

apetkau requested changes Mar 3, 2025

View reviewed changes

sgsutcliffe added 11 commits March 3, 2025 14:28

Remove commented-out block of code

06306ac

Move input assure critical errors to logger

4631d27

Added modified MLST json files as output of input_assure integration

1d7ae8f

Renamed variables related to profile mismatch

580c272

Revert "Renamed variables related to profile mismatch"

93d983b

This reverts commit 580c272.

Revert "Added modified MLST json files as output of input_assure inte…

445a008

…gration" This reverts commit 1d7ae8f.

Swapped csv for pandas for output MLST report csv

c9b0d94

Bundled the input_assure code into the function validate_input_files

9eb964c

Replaced csv with pandas for read_samplesheet

468f76e

Replace string with variable

23452cb

Added an end-to-end test for locidex merge

e78f153

sgsutcliffe requested review from apetkau and mattheww95 March 5, 2025 16:56

mattheww95 requested changes Mar 7, 2025

View reviewed changes

apetkau requested changes Mar 10, 2025

View reviewed changes

sgsutcliffe added 5 commits March 10, 2025 10:55

remove import of unused module

477790b

Create a class OutputRow for mslt_report i.e. rows in the outputfile

a79103b

Add filepath to error output

d66af08

remove pop() from profile[orginial_key]

12e2775

Additional verbosity to function name

bee3413

sgsutcliffe added 2 commits March 10, 2025 13:25

Fix and add test for no renaming

b4f51d8

Add documentation to locidex merge and improve help message

59b1d8c

sgsutcliffe requested review from mattheww95 and apetkau March 10, 2025 18:15

apetkau approved these changes Mar 10, 2025

View reviewed changes

mattheww95 approved these changes Mar 10, 2025

View reviewed changes

locidex/merge.py Show resolved Hide resolved

prepare for minor release for feature

f03228c

sgsutcliffe merged commit 7e6b55a into dev Mar 11, 2025
1 check passed

sgsutcliffe deleted the integrate/input_assure branch March 11, 2025 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added in input_asure.py to locidex merge #39

Added in input_asure.py to locidex merge #39

sgsutcliffe commented Feb 27, 2025 •

edited

Loading

mattheww95 left a comment

mattheww95 Feb 28, 2025

sgsutcliffe Mar 4, 2025

mattheww95 Feb 28, 2025

sgsutcliffe Mar 4, 2025

mattheww95 Feb 28, 2025

sgsutcliffe Mar 4, 2025

mattheww95 Feb 28, 2025

sgsutcliffe Mar 3, 2025 •

edited

Loading

mattheww95 Feb 28, 2025

sgsutcliffe Mar 3, 2025

mattheww95 Feb 28, 2025

sgsutcliffe Mar 3, 2025

mattheww95 Feb 28, 2025

sgsutcliffe Mar 3, 2025

mattheww95 Feb 28, 2025

sgsutcliffe Mar 3, 2025

sgsutcliffe Mar 3, 2025

apetkau left a comment

sgsutcliffe commented Mar 5, 2025

mattheww95 left a comment

mattheww95 Mar 7, 2025

sgsutcliffe Mar 10, 2025

apetkau left a comment

apetkau left a comment

mattheww95 left a comment

	try:
	utils.check_utilities(logger, constants.UTILITIES_CHECK)
	logger.info("Running {}".format(args.command))
	tasks[args.command][module_idx].run(args)
	logger.info("Finished: {}".format(args.command))
	except Exception as e:
	with open(error_file, "w") as f:
	f.write(traceback.format_exc())
	error_number = e.errno if hasattr(e, "errno") else -1
	logger.critical("Program exited with errors, please review logs. For the full traceback please see file: {}".format(error_file))
	SystemExit(error_number)
	else:
	logger.info("Program finished without errors.")



		#Write error messages for profile mismatch (compare_profiles())
		for error_message in compare_error:

Added in input_asure.py to locidex merge #39

Added in input_asure.py to locidex merge #39

Conversation

sgsutcliffe commented Feb 27, 2025 • edited Loading

STRY0017229

Description

Goals

mattheww95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgsutcliffe Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apetkau left a comment

Choose a reason for hiding this comment

sgsutcliffe commented Mar 5, 2025

mattheww95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apetkau left a comment

Choose a reason for hiding this comment

apetkau left a comment

Choose a reason for hiding this comment

mattheww95 left a comment

Choose a reason for hiding this comment

sgsutcliffe commented Feb 27, 2025 •

edited

Loading

sgsutcliffe Mar 3, 2025 •

edited

Loading