Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added in input_asure.py to locidex merge #39

Merged
merged 24 commits into from
Mar 11, 2025
Merged

Conversation

sgsutcliffe
Copy link
Collaborator

@sgsutcliffe sgsutcliffe commented Feb 27, 2025

STRY0017229

Add method to provide a sample name associated with an MLST JSON file in "locidex merge"

Description

We want to be able to provide a sample name/identifier alongside each MLST JSON file in "locidex merge" so I can override the sample name stored within the JSON file without writing a new JSON file prior to "locidex merge".

The subtext of the added feature is it will allow pipelines like gasnomenclature to be more efficient, by integrating the check step at the same time the MLST file is already being read, and written together.

Goals

  • The "locidex merge" command will be modified to override the sample name stored within the MLST JSON file.
    - One option is to use a syntax like "SAMPLE=FILE" when passing on the command line.
    - Another option is to pass a CSV file as input with two columns "sample,file"
  • When a sample name is provided to "locidex merge", this will override the sample name stored in the MLST JSON file.
  • The output "profiles.csv" file will use the overridden sample names as identifiers if they were provided, otherwise it will use sample identifiers in the MLST JSON file.
  • The "locidex merge" command will be modified to output an "error.csv" file with any warnings or errors in input MLST JSON files, one sample/file per row. This will match format of existing error file produced by "INPUT_ASSURE" in gasnomenclature pipeline.
  • The updated "locidex merge" command will be backwards compatible with the previous functionality (where the sample name is derived from the contents of the MLST JSON file).
  • Unit/integration tests should be developed to confirm new functionality and existing functionality (overriding sample names in MLST JSON files).

Copy link
Collaborator

@mattheww95 mattheww95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for all the work @sgsutcliffe! I think I added too many comments, but they are just centered around removing magic numbers and assigning strings to variables.

locidex/merge.py Outdated
sampledict = {}
for line in f:
line = line.rstrip().split(",")
if len(line) != 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the magic number? e.g. Assign the numbers to variables above with a descriptive name. I assume the "2" is supposed to refer to the expected columns length

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poof went away (thanks to a magical panda) 468f76e

locidex/merge.py Outdated
logger.critical("File should be tsv with two columns [sample,mlst_alleles]")
raise_file_not_found_e(logger)
sample = line[0]
mlst_file = os.path.basename(line[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I understand the intention of the zero but what does the "1" refer to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magical panda strikes again 468f76e

locidex/merge.py Outdated

def compare_profiles(mlst, sample_id, file_name):
# Extract the profile from the json_data
profile = mlst.get("data", {}).get("profile", {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a string is repeated can you assign it to a variable?

Just to make refactoring easier in the future and minimize errors due to spelling mistakes in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locidex/merge.py Outdated
error_message = (
f"{file_name} is missing the 'profile' section or is completely empty!"
)
print(error_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the error message could you use the logger and raise the correct error? There is an error handler in the main loop that will handle your error for example you could add `raise KeyError(error_message).

One reason for using the logger over print is that print is buffered to stdout which may not always be collected, where as the log messages write to stderr which is not buffered. Additionally we can format the log messages in a pretty way. simply replace the print statement with logger.error(error_message)

If you are curious about the error handler you can see how the error handler is implemented here:

locidex/locidex/main.py

Lines 42 to 54 in eedba57

try:
utils.check_utilities(logger, constants.UTILITIES_CHECK)
logger.info("Running {}".format(args.command))
tasks[args.command][module_idx].run(args)
logger.info("Finished: {}".format(args.command))
except Exception as e:
with open(error_file, "w") as f:
f.write(traceback.format_exc())
error_number = e.errno if hasattr(e, "errno") else -1
logger.critical("Program exited with errors, please review logs. For the full traceback please see file: {}".format(error_file))
SystemExit(error_number)
else:
logger.info("Program finished without errors.")
and for how to call the logger

Copy link
Collaborator Author

@sgsutcliffe sgsutcliffe Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and that makes sense to me! In line with some of the comments Aaron made which is adapted what was a single input_assure script that was used per process in the nextflow pipeline, to proper integration with the locidex. 4631d27

locidex/merge.py Outdated
f"{file_name} is missing the 'profile' section or is completely empty!"
)
print(error_message)
sys.exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the sys.exit call, the error handler will handle the outputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locidex/merge.py Outdated

#Write error messages for profile mismatch (compare_profiles())
for error_message in compare_error:
if error_message[2]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic numbers 🧙

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poof dissappeared with c9b0d94

locidex/merge.py Outdated
with open(output_error_file, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["sample", "JSON_key", "error_message"])
writer.writerow([error_message[0], error_message[1], error_message[2]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic numbers 🧙

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poof disappeared with c9b0d94

locidex/merge.py Outdated


#Write error messages for profile mismatch (compare_profiles())
for error_message in compare_error:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the program raise a non-zero exit status if errors are encountred so the pipeline notifies the user that something went wrong?

it may be useful for pipelines as they will not die nessecarily as all outputs will be present, but Nextflow will notify the user of a non-zero exit status.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might rename these, as "error" is not really what they are (i.e. non-zero exit would not be right here). We're collapsing input_assure (a process from gasnomenclature pipeline) into locidex. input_assure was a way to override the mlst profile locidex merge uses in the output file. If someone supplies a renaming file we have to assume they will be aware that the output would be modified, we simply want to provide a file that explains how the file was modified. We called this an error report but its more of a warning, or friendly notification.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed (c9b0d94) the variable to modified_MLST_file_list which is less urgent or bad as we would expect from error:

    #Write report of all the MLST files with profile mismatch and how MLST profiles with mismatch were modified
    df = pd.DataFrame(modified_MLST_file_list)
    df.to_csv(f'{outdir}/MLST_error_report.csv', index=False, header=False)

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Steven. Thanks so much for your amazing work 😄. I've added my initial comments below.

@sgsutcliffe
Copy link
Collaborator Author

Thanks @mattheww95 and @apetkau for the extremely helpful code review. It looked like simply using pandas cleared up a lot of the magic numbers, and I moved things around and renamed variables to hopefully improve the readibility of the code.

Copy link
Collaborator

@mattheww95 mattheww95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better thank you for removing the magic @sgsutcliffe 😄

I added some small requests to fix up, but I think I will have no further complaints after those changes are made.

locidex/merge.py Outdated
@@ -65,28 +67,33 @@ def get_file_list(input_files):
file_list.append(line)
return file_list

def validate_input_file(data_in: dict, db_version: str, db_name: str, perform_validation: bool) -> tuple[ReportData, str, str]:
def validate_input_file(data_in: dict, filename: str, db_version: str, db_name: str, perform_db_validation: bool, perform_profile_validation: bool, profile_refs_dict=dict) -> tuple[ReportData, str, str, list]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a type annotation for the list?
For longer more complex types you can also create a TypeAlias however you dont need to implement this, just a fun fact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the solution we discussed elsewhere: a79103b

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @sgsutcliffe. Thanks so much for addressing my comments 😄 .

I have a few additional in-line comments for you below.

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks great to me. Thanks so much @sgsutcliffe 😄

Copy link
Collaborator

@mattheww95 mattheww95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks so much for all the work!!!!! 🧙

@sgsutcliffe sgsutcliffe merged commit 7e6b55a into dev Mar 11, 2025
1 check passed
@sgsutcliffe sgsutcliffe deleted the integrate/input_assure branch March 11, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants