Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingest/vendored #130

Merged
merged 1 commit into from
Mar 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions ingest/vendored/.cramrc

This file was deleted.

17 changes: 17 additions & 0 deletions ingest/vendored/.github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Dependabot configuration file
# <https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file>
#
# Each ecosystem is checked on a scheduled interval defined below. To trigger
# a check manually, go to
#
# https://github.com/nextstrain/ingest/network/updates
#
# and look for a "Check for updates" button. You may need to click around a
# bit first.
---
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
18 changes: 6 additions & 12 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -1,21 +1,15 @@
name: CI

on:
- push
- pull_request
- workflow_dispatch
push:
branches:
- main
pull_request:
workflow_dispatch:

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: nextstrain/.github/actions/shellcheck@master

cram:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install cram
- run: cram tests/
14 changes: 14 additions & 0 deletions ingest/vendored/.github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: pre-commit

on:
- push

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- uses: pre-commit/[email protected]
4 changes: 2 additions & 2 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = c97df238518171c2b1574bec0349a55855d1e7a7
parent = 4fab4c912745a362c006a2bf893bf4530e050af0
commit = cd6d31a3b35cd1bb7eddf830c565be6d6e69f27a
parent = 4ed14150b4c09e72881d03375c42ce5aeeafc5e7
method = merge
cmdver = 0.4.6
40 changes: 40 additions & 0 deletions ingest/vendored/.pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
default_language_version:
python: python3
repos:
- repo: https://github.com/pre-commit/sync-pre-commit-deps
rev: v0.0.1
hooks:
- id: sync-pre-commit-deps
- repo: https://github.com/shellcheck-py/shellcheck-py
rev: v0.10.0.1
hooks:
- id: shellcheck
- repo: https://github.com/rhysd/actionlint
rev: v1.6.27
hooks:
- id: actionlint
entry: env SHELLCHECK_OPTS='--exclude=SC2027' actionlint
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: check-ast
- id: check-case-conflict
- id: check-docstring-first
- id: check-json
- id: check-executables-have-shebangs
- id: check-merge-conflict
- id: check-shebang-scripts-are-executable
- id: check-symlinks
- id: check-toml
- id: check-yaml
- id: destroyed-symlinks
- id: detect-private-key
- id: end-of-file-fixer
- id: fix-byte-order-marker
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.4.6
hooks:
# Run the linter.
- id: ruff
78 changes: 57 additions & 21 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree`.
each pathogen repo using `git subrepo`.

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.
Expand All @@ -12,6 +12,9 @@ Some tools may only live here temporarily before finding a permanent home in
Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)

For a list of Nextstrain repos that are currently using this method, use [this
GitHub code search](https://github.com/search?type=code&q=org%3Anextstrain+subrepo+%22remote+%3D+https%3A%2F%2Fgithub.ghproxy.top%2Fnextstrain%2Fingest%22).

If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
Then add the latest ingest scripts to the pathogen repo by running:

Expand All @@ -25,18 +28,43 @@ Any future updates of ingest scripts can be pulled in with:
git subrepo pull ingest/vendored
```

If you run into merge conflicts and would like to pull in a fresh copy of the
latest ingest scripts, pull with the `--force` flag:

```
git subrepo pull ingest/vendored --force
```

> **Warning**
> Beware of rebasing/dropping the parent commit of a `git subrepo` update

`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
which includes the hash for the parent commit in the pathogen repos.
If this hash no longer exists in the commit history, there will be errors when
running future `git subrepo pull` commands.

If you run into an error similar to the following:
```
$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''
```
Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
sure the commit exists in the commit history. Update to the appropriate parent
commit hash if needed.

## History

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
It subsequently proliferated from [monkeypox][] to other pathogen repos
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
thru copying. To [counter that
[mpox's ingest/](https://github.com/nextstrain/mpox/tree/@/ingest/). It
subsequently proliferated from [mpox][] to other pathogen repos ([rsv][],
[zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily thru
copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[monkeypox]: https://github.com/nextstrain/monkeypox
[mpox]: https://github.com/nextstrain/mpox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
Expand Down Expand Up @@ -72,10 +100,9 @@ Scripts for supporting ingest workflow automation that don’t really belong in
NCBI interaction scripts that are useful for fetching public metadata and sequences.

- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) or [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
- [fetch-from-ncbi-virus](fetch-from-ncbi-virus) - Fetch metadata and nucleotide sequences from [NCBI Virus](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) and output NDJSON records to stdout.
- [ncbi-virus-url](ncbi-virus-url) - Generates the URL to download metadata and sequences from NCBI Virus as a single CSV file.
- [csv-to-ndjson](csv-to-ndjson) - Converts CSV file to NDJSON file with a hard-coded 200MiB field size limit to accommodate sequences in the NCBI Virus download.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.

Potential Nextstrain CLI scripts

Expand All @@ -90,14 +117,6 @@ Potential Nextstrain CLI scripts
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.

Potential augur curate scripts

- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)

## Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.
Expand All @@ -106,7 +125,24 @@ Some scripts may require Bash ≥4. If you are running these scripts on macOS, t

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,
## Working on this repo

This repo is configured to use [pre-commit](https://pre-commit.com),
to help automatically catch common coding errors and syntax issues
with changes before they are committed to the repo.

If you will be writing new code or otherwise working within this repo,
please do the following to get started:

1. [install `pre-commit`](https://pre-commit.com/#install) by running
either `python -m pip install pre-commit` or `brew install
pre-commit`, depending on your preferred package management
solution
2. install the local git hooks by running `pre-commit install` from
the root of the repo
3. when problems are detected, correct them in your local working tree
before committing them.

1. Download Cram: `pip install cram`
2. Run the tests: `cram tests/`
Note that these pre-commit checks are also run in a GitHub Action when
changes are pushed to GitHub, so correcting issues locally will
prevent extra cycles of correction.
Loading
Loading