Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Support table-specific id columns and delimiters #1594

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Aug 21, 2024

When trying out augur merge for our common Nextclade metadata merge step¹, I realized it would be good to allow table-specific values for these settings.

Table-specific values avoid problems with impossible orderings, e.g. two files both have columns A and B but you want to use A as the id column in the first and B in the second. Maybe that's an edge case, but it's also just much easier to reason about table-specific values than reason about how an ordered list of N values applies to all M files. A shared list of values is still good for when you don't know what they'll be ahead of time.

¹ https://github.com/nextstrain/pathogen-repo-guide/blob/89b3c5dbc9b8f6a009f4a19c3ac56113bc5511ee/ingest/rules/nextclade.smk#L75-L95

Checklist

  • Add tests for error messages I missed (see coverage)
  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@tsibley tsibley force-pushed the trs/merge/table-specific-id-delim branch from 49e6467 to a9d821f Compare August 21, 2024 21:26
Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.06%. Comparing base (232698b) to head (68d1a71).
Report is 105 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1594      +/-   ##
==========================================
+ Coverage   71.03%   71.06%   +0.03%     
==========================================
  Files          79       79              
  Lines        8258     8267       +9     
  Branches     2004     2007       +3     
==========================================
+ Hits         5866     5875       +9     
  Misses       2101     2101              
  Partials      291      291              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad the Nextclade and metadata merge is a useful example! Seems reasonable to me 👍

augur/merge.py Outdated
Comment on lines +361 to +380
def pairs(xs: Iterable[str]) -> Iterable[Tuple[str, str]]:
"""
Split an iterable of ``k=v`` strings into an iterable of ``(k,v)`` tuples.

>>> pairs(["abc=123", "eight nine ten=el em en"])
[('abc', '123'), ('eight nine ten', 'el em en')]

Strings missing a ``k`` and/or a ``v`` part get an empty string.

>>> pairs(["v", "=v", "k=", "=", ""])
[('', 'v'), ('', 'v'), ('k', ''), ('', ''), ('', '')]

``k`` ends at the first ``=``.

>>> pairs(["abc=123=xyz", "=v=v"])
[('abc', '123=xyz'), ('', 'v=v')]
"""
return [tuple(x.split("=", 1)) if "=" in x else ("", x) for x in xs] # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

This seems like it would be a useful custom argparse action. The k=v is such a common pattern we use for CLI options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, perhaps! I tend to write these sorts of utility functions immediately adjacent to my current need for them and then refactor/extract as desired when I reach for them again later. (Unless I concretely foresee needing them again elsewhere very soon.)

@tsibley
Copy link
Member Author

tsibley commented Aug 22, 2024

Coverage again points me at error messages I missed testing. I'll add those tomorrow or next week.

@tsibley tsibley force-pushed the trs/merge/table-specific-id-delim branch from a9d821f to f0d87e1 Compare August 27, 2024 18:21
@tsibley
Copy link
Member Author

tsibley commented Aug 27, 2024

Fixed. Will wait for #1593 to merge before this one. Merge there is pending name mangling discussion.

@tsibley tsibley force-pushed the trs/merge/id-column-overwrite branch from 40f4e2e to 2a73eff Compare August 28, 2024 18:47
When trying out `augur merge` for our common Nextclade metadata merge
step¹, I realized it would be good to allow table-specific values for
these settings.

Table-specific values avoid problems with impossible orderings, e.g. two
files both have columns A and B but you want to use A as the id column
in the first and B in the second.  Maybe that's an edge case, but it's
also just much easier to reason about table-specific values than reason
about how an ordered list of N values applies to all M files.  A shared
list of values is still good for when you don't know what they'll be
ahead of time.

¹ <https://github.com/nextstrain/pathogen-repo-guide/blob/89b3c5dbc9b8f6a009f4a19c3ac56113bc5511ee/ingest/rules/nextclade.smk#L75-L95>
@tsibley tsibley force-pushed the trs/merge/id-column-overwrite branch from 2a73eff to 232698b Compare August 28, 2024 19:01
@tsibley tsibley force-pushed the trs/merge/table-specific-id-delim branch from f0d87e1 to 68d1a71 Compare August 28, 2024 19:01
@tsibley
Copy link
Member Author

tsibley commented Aug 28, 2024

Added a changelog entry.

Base automatically changed from trs/merge/id-column-overwrite to master August 30, 2024 16:39
@tsibley tsibley merged commit d51e402 into master Aug 30, 2024
28 checks passed
@tsibley tsibley deleted the trs/merge/table-specific-id-delim branch August 30, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants