-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added in input_asure.py to locidex merge #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for all the work @sgsutcliffe! I think I added too many comments, but they are just centered around removing magic numbers and assigning strings to variables.
locidex/merge.py
Outdated
sampledict = {} | ||
for line in f: | ||
line = line.rstrip().split(",") | ||
if len(line) != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the magic number? e.g. Assign the numbers to variables above with a descriptive name. I assume the "2" is supposed to refer to the expected columns length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poof went away (thanks to a magical panda) 468f76e
locidex/merge.py
Outdated
logger.critical("File should be tsv with two columns [sample,mlst_alleles]") | ||
raise_file_not_found_e(logger) | ||
sample = line[0] | ||
mlst_file = os.path.basename(line[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I understand the intention of the zero but what does the "1" refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magical panda strikes again 468f76e
locidex/merge.py
Outdated
|
||
def compare_profiles(mlst, sample_id, file_name): | ||
# Extract the profile from the json_data | ||
profile = mlst.get("data", {}).get("profile", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a string is repeated can you assign it to a variable?
Just to make refactoring easier in the future and minimize errors due to spelling mistakes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
locidex/merge.py
Outdated
error_message = ( | ||
f"{file_name} is missing the 'profile' section or is completely empty!" | ||
) | ||
print(error_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the error message could you use the logger and raise the correct error? There is an error handler in the main loop that will handle your error for example you could add `raise KeyError(error_message).
One reason for using the logger over print is that print is buffered to stdout which may not always be collected, where as the log messages write to stderr which is not buffered. Additionally we can format the log messages in a pretty way. simply replace the print statement with logger.error(error_message)
If you are curious about the error handler you can see how the error handler is implemented here:
Lines 42 to 54 in eedba57
try: | |
utils.check_utilities(logger, constants.UTILITIES_CHECK) | |
logger.info("Running {}".format(args.command)) | |
tasks[args.command][module_idx].run(args) | |
logger.info("Finished: {}".format(args.command)) | |
except Exception as e: | |
with open(error_file, "w") as f: | |
f.write(traceback.format_exc()) | |
error_number = e.errno if hasattr(e, "errno") else -1 | |
logger.critical("Program exited with errors, please review logs. For the full traceback please see file: {}".format(error_file)) | |
SystemExit(error_number) | |
else: | |
logger.info("Program finished without errors.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, and that makes sense to me! In line with some of the comments Aaron made which is adapted what was a single input_assure script that was used per process in the nextflow pipeline, to proper integration with the locidex. 4631d27
locidex/merge.py
Outdated
f"{file_name} is missing the 'profile' section or is completely empty!" | ||
) | ||
print(error_message) | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the sys.exit call, the error handler will handle the outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
locidex/merge.py
Outdated
|
||
#Write error messages for profile mismatch (compare_profiles()) | ||
for error_message in compare_error: | ||
if error_message[2]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic numbers 🧙
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poof dissappeared with c9b0d94
locidex/merge.py
Outdated
with open(output_error_file, "w", newline="") as f: | ||
writer = csv.writer(f) | ||
writer.writerow(["sample", "JSON_key", "error_message"]) | ||
writer.writerow([error_message[0], error_message[1], error_message[2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic numbers 🧙
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poof disappeared with c9b0d94
locidex/merge.py
Outdated
|
||
|
||
#Write error messages for profile mismatch (compare_profiles()) | ||
for error_message in compare_error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the program raise a non-zero exit status if errors are encountred so the pipeline notifies the user that something went wrong?
it may be useful for pipelines as they will not die nessecarily as all outputs will be present, but Nextflow will notify the user of a non-zero exit status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might rename these, as "error" is not really what they are (i.e. non-zero exit would not be right here). We're collapsing input_assure
(a process from gasnomenclature pipeline) into locidex. input_assure
was a way to override the mlst profile locidex merge uses in the output file. If someone supplies a renaming file we have to assume they will be aware that the output would be modified, we simply want to provide a file that explains how the file was modified. We called this an error report but its more of a warning, or friendly notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed (c9b0d94) the variable to modified_MLST_file_list
which is less urgent or bad as we would expect from error
:
#Write report of all the MLST files with profile mismatch and how MLST profiles with mismatch were modified
df = pd.DataFrame(modified_MLST_file_list)
df.to_csv(f'{outdir}/MLST_error_report.csv', index=False, header=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great Steven. Thanks so much for your amazing work 😄. I've added my initial comments below.
Thanks @mattheww95 and @apetkau for the extremely helpful code review. It looked like simply using pandas cleared up a lot of the magic numbers, and I moved things around and renamed variables to hopefully improve the readibility of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better thank you for removing the magic @sgsutcliffe 😄
I added some small requests to fix up, but I think I will have no further complaints after those changes are made.
locidex/merge.py
Outdated
@@ -65,28 +67,33 @@ def get_file_list(input_files): | |||
file_list.append(line) | |||
return file_list | |||
|
|||
def validate_input_file(data_in: dict, db_version: str, db_name: str, perform_validation: bool) -> tuple[ReportData, str, str]: | |||
def validate_input_file(data_in: dict, filename: str, db_version: str, db_name: str, perform_db_validation: bool, perform_profile_validation: bool, profile_refs_dict=dict) -> tuple[ReportData, str, str, list]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a type annotation for the list?
For longer more complex types you can also create a TypeAlias
however you dont need to implement this, just a fun fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on the solution we discussed elsewhere: a79103b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @sgsutcliffe. Thanks so much for addressing my comments 😄 .
I have a few additional in-line comments for you below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks great to me. Thanks so much @sgsutcliffe 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks so much for all the work!!!!! 🧙
STRY0017229
Add method to provide a sample name associated with an MLST JSON file in "locidex merge"
Description
We want to be able to provide a sample name/identifier alongside each MLST JSON file in "locidex merge" so I can override the sample name stored within the JSON file without writing a new JSON file prior to "locidex merge".
The subtext of the added feature is it will allow pipelines like
gasnomenclature
to be more efficient, by integrating the check step at the same time the MLST file is already being read, and written together.Goals
- One option is to use a syntax like "SAMPLE=FILE" when passing on the command line.
- Another option is to pass a CSV file as input with two columns "sample,file"