Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr commit date is AFTER the last-modified of new/pending Zarr entry #66

Open
yarikoptic opened this issue Feb 7, 2025 · 5 comments
Open

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Feb 7, 2025

Long story with conclusion summarized in subject line:

In attempt to provide temporary remedy for

I reran our tool with --mode force which ignored differences in metadata which happened without modifications to time stamp. Then I reran with --mode verify which still failed with

the single error for a single dandiset 001246 zarr
    | backups2datalad.util.UnexpectedChangeError: Dandiset 001246: Zarr 1c0e023e-150d-43a7-948b-dc46dd514ca6: entry '.zattrs' added, but Dandiset draft timestamp was not updated on server
    +------------------------------------
2025-02-06T10:06:12-0500 [ERROR   ] backups2datalad: An error occurred:
Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.13/site-packages/backups2datalad/__main__.py", line 119, in wrapped
    await f(datasetter, *args, **kwargs)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.13/site-packages/backups2datalad/__main__.py", line 229, in update_from_backup
    await datasetter.update_from_backup(dandisets, exclude=exclude)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.13/site-packages/backups2datalad/datasetter.py", line 135, in update_from_backup
    raise RuntimeError(
        f"Backups for {quantify(len(report.failed), 'Dandiset')} failed"
    )
RuntimeError: Backups for 1 Dandiset failed
Logs saved to /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/2025.02.06.14.57.12Z.log
action summary:
  publish (notneeded: 2)
2025-02-06T10:06:12-0500 [INFO    ] con-duct: Summary:
Exit Code: 1
Command: flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron  --mode verify
Log files location: .duct/logs/2025.02.06T09.57.04-2226083_
Wall Clock Time: 547.832 sec
Memory Peak Usage (RSS): 1.4 GB
Memory Average Usage (RSS): 502.4 MB
Virtual Memory Peak Usage (VSZ): 7.7 TB
Virtual Memory Average Usage (VSZ): 1.0 TB
Memory Peak Percentage: 1.9000000000000001%
Memory Average Percentage: 0.6165427509293678%
CPU Peak Usage: 168.2%
Average CPU Usage: 53.58773234200743%

And that zarr is marked as "Complete"

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/zarr/1c0e023e-150d-43a7-948b-dc46dd514ca6/' -H 'accept: application/json' | jq .
{
  "name": "sub-I39/micr/sub-I39_ses-01_ce-intensity_sample-occipital_chunk-034_OCT.ome.zarr",
  "dandiset": "001246",
  "zarr_id": "1c0e023e-150d-43a7-948b-dc46dd514ca6",
  "status": "Complete",
  "checksum": "09639cfd30d04838239fa0519e067916-478--307209961",
  "file_count": 478,
  "size": 307209961
}

The timestamp for the draft version of the dandiset is 2024-10-29T19:39:37.231865Z

❯ curl --silent -X 'GET' 'https://api.dandiarchive.org/api/dandisets/001246/versions/' -H 'accept: application/json' | jq . | ~/proj/CON/utils/bin/show-paths -e modified -f full-lines
4    "results": [
13:       "modified": "2024-10-29T19:39:37.231865Z",
14        "dandiset": {
17:         "modified": "2024-10-29T19:17:15.901343Z",

and for that .zattrs

❯ wget -S 'https://dandiarchive.s3.amazonaws.com/zarr/1c0e023e-150d-43a7-948b-dc46dd514ca6/.zattrs?versionId=ls2qFhNsIojrQsi_2J0Udc72EEuKzsS9' 2>&1 | grep Last
  Last-Modified: Tue, 29 Oct 2024 19:27:41 GMT

and we have

dandi@drogon:/mnt/backup/dandi/dandisets/001246$ cat .dandi/assets-state.json 
{
    "timestamp": "2024-10-29T19:39:37.231865Z"
}
dandi@drogon:/mnt/backup/dandi/dandisets/001246$ git show | grep Date
Date:   2024 Oct 29 15:39:37 -0400

and in the zarr

dandi@drogon:/mnt/backup/dandi/dandizarrs/1c0e023e-150d-43a7-948b-dc46dd514ca6$ git show | grep Date
Date:   2024 Oct 29 15:28:10 -0400
dandi@drogon:/mnt/backup/dandi/dandizarrs/1c0e023e-150d-43a7-948b-dc46dd514ca6$ ls -ld .zattrs
ls: cannot access '.zattrs': No such file or directory

and that .zattrs has

dandi@drogon:/mnt/backup/dandi/dandizarrs/1c0e023e-150d-43a7-948b-dc46dd514ca6$ wget -O /dev/null -S 'https://dandiarchive.s3.amazonaws.com/zarr/1c0e023e-150d-43a7-948b-dc46dd514ca6/.zattrs?versionId=ls2qFhNsIojrQsi_2J0Udc72EEuKzsS9' 2>&1 | grep Last-Mo
  Last-Modified: Tue, 29 Oct 2024 19:27:41 GMT

So, we have a commit time of 28:10 minutes, but the last-modified of the .zattrs (which was not yet added to that commit) is 27:41 minutes. How could that happen? How exactly do we decide on commit date for zarrs - is it based on last-modified?

I could hypothesize/propose:

  • upload uses local to file last-modified, so user's laptop had date wrong and later uploaded file has last-modified back in time... that would mean we cannot rely on last-modified
  • may be while we were listing zarr, new files being uploaded, so depending on the listing we might catch some newer keys/versions but not unfortunate missed older ones?
  • should we then add some guarding that if any last-modified is "too close to now" (e.g. within time it takes to list), do not act on it, and sleep that amount of time needed to list, redo listing from beginning (while observing the same guarding)?

wdyt @jwodder?

@jwodder jwodder changed the title zarr commit date is AFTER the last-modified of not yet added to zarr file Zarr commit date is AFTER the last-modified of new/pending Zarr entry Feb 7, 2025
@jwodder
Copy link
Member

jwodder commented Feb 7, 2025

@yarikoptic The commit timestamp for Zarr backups is the latest Last-Modified timestamp of all latest objects in the Zarr on S3 at the time of the backup.

I'm not sure how this situation happened, but your theories don't seem right.

upload uses local to file last-modified

Not true; no metadata about local files is sent to S3 when uploading a Zarr entry.

may be while we were listing zarr, new files being uploaded

The actual commit date (not the author date, which we fake) of the HEAD of the Zarr backup is 2024 Nov 18 21:21:30 -0500, three weeks after the .zattrs was uploaded, so I don't think that's it.

@jwodder
Copy link
Member

jwodder commented Feb 7, 2025

@yarikoptic Looking into this further, I think the original backup of the Zarr errored out somehow partway through. For one thing, the actual Zarr has a bunch of files that aren't in the backup, and the HEAD^ commit has a commit message of "478 files added", yet the only files it touched were two files in .dandi/.

@jwodder
Copy link
Member

jwodder commented Feb 7, 2025

@yarikoptic

I reran our tool with --mode force which ignored differences in metadata which happened without modifications to time stamp.

Note that --mode force only affects when to backup a Dandiset; to force backups of Zarrs as well, you need to also pass --zarr-mode force.

@yarikoptic
Copy link
Member Author

@yarikoptic Looking into this further, I think the original backup of the Zarr errored out somehow partway through.

why do you think so? my counter-argument is that if it errorred out, there would be no commit, right?

For one thing, the actual Zarr has a bunch of files that aren't in the backup, and the HEAD^ commit has a commit message of "478 files added", yet the only files it touched were two files in .dandi/.

good observation, FTR

(dandi-cli-latest2) (3) dandi@drogon:/mnt/backup/dandi/dandizarrs/1c0e023e-150d-43a7-948b-dc46dd514ca6$ git show --stat 014065e
commit 014065e
Author: DANDI User <[email protected]>
Date:   2024 Oct 29 15:28:10 -0400

    [backups2datalad] 478 files added, checksum updated

 .dandi/s3sync.json   | 5 +++++
 .dandi/zarr-checksum | 1 +
 2 files changed, 6 insertions(+)

but wouldn't it be just indicative of some bug instead of backup "errored out"?

FTR, in my mailbox I have indicators of finding that dandiset dirty only starting Nov 07

2024-11-07T10:02:48-0500 [ERROR   ] backups2datalad: Job failed on input <Dandiset 001246/draft>:
Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/site-packages/backups2datalad/aioutil.py", line 177, in dowork
    outp = await func(inp)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/site-packages/backups2datalad/datasetter.py", line 192, in update_dandiset
    changed = await self.sync_dataset(dandiset, ds, dmanager)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/site-packages/backups2datalad/datasetter.py", line 233, in sync_dataset
    raise RuntimeError(f"Dirty {dandiset}; clean or save before running")
RuntimeError: Dirty Dandiset 001246/draft; clean or save before running

and continues until Nov 18 (2024-11-18T21:02:54-0500) while the most recent commit

(dandi-cli-latest2) (3) dandi@drogon:/mnt/backup/dandi/dandizarrs/1c0e023e-150d-43a7-948b-dc46dd514ca6$ git log --format=fuller HEAD^..HEAD
commit bfa05e0 (HEAD -> draft, github/draft)
Author:     DANDI User <[email protected]>
AuthorDate: 2024 Oct 29 15:28:10 -0400
Commit:     DANDI Team <[email protected]>
CommitDate: 2024 Nov 18 21:21:30 -0500

    [backups2datalad] 13 files added

so somehow we (I) addressed it that day, likely through git reset --hard; git clean within the dandiset... potentially also doing similar in the zarr.

FWIW, I do not see anything suspicious among admin emails from drogon for Oct 29 (like out of space etc).

For Oct 29 we have the single log with matching that zarr

(dandi-cli-latest2) (3) dandi@drogon:/mnt/backup/dandi/dandizarrs/1c0e023e-150d-43a7-948b-dc46dd514ca6$ grep -l 1c0e023e-150d-43a7-948b-dc46dd514ca6 /mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/_old_/2024/10/2024.10.29*
/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad/_old_/2024/10/2024.10.29.20.02.11Z.log

and in that log we run into issues with annex starting with

2024-10-29T16:03:25-0400 [ERROR   ] asyncio: Exception in callback SubprocessStreamProtocol.pipe_data_received(1, b'add .dandi/... repository) ')
handle: <Handle SubprocessStreamProtocol.pipe_data_received(1, b'add .dandi/... repository) ')>
Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/subprocess.py", line 72, in pipe_data_received
    reader.feed_data(data)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/streams.py", line 457, in feed_data
    assert not self._eof, 'feed_data after feed_eof'
AssertionError: feed_data after feed_eof
2024-10-29T16:03:25-0400 [DEBUG   ] backups2datalad: Command `git -c receive.autogc=0 -c gc.auto=0 annex whereis --batch-keys --json --json-error-messages` [cwd=/mnt/backup/dandi/dandizarrs/cf615248-6645-43ca-9a4e-c5087b73f1d3] exited with return code 1
2024-10-29T16:03:25-0400 [DEBUG   ] backups2datalad: Waiting for `git -c receive.autogc=0 -c gc.auto=0 annex registerurl -c annex.alwayscompact=false --batch --json --json-error-messages` [cwd=/mnt/backup/dandi/dandizarrs/cf615248-6645-43ca-9a4e-c5087b73f1d3] to terminate
2024-10-29T16:03:25-0400 [ERROR   ] asyncio: Exception in callback SubprocessStreamProtocol.pipe_data_received(1, b'ok\n')
handle: <Handle SubprocessStreamProtocol.pipe_data_received(1, b'ok\n')>
Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/subprocess.py", line 72, in pipe_data_received
    reader.feed_data(data)
  File "/home/dandi/miniconda3/envs/dandisets-2/lib/python3.10/asyncio/streams.py", line 457, in feed_data
    assert not self._eof, 'feed_data after feed_eof'
AssertionError: feed_data after feed_eof

but that is long after that suspicious commit at Date: 2024 Oct 29 15:28:10 -0400. Please review it if you spot some other oddity/explanation

@jwodder
Copy link
Member

jwodder commented Feb 7, 2025

@yarikoptic I don't know how that "empty" commit with an inaccurate commit message happened, but the "AssertionError: feed_data after feed_eof" is an anyio bug triggered by closing a process's stdout handle while there's more data to be read. The bug was fixed in anyio 4.5.0, released September 19, 2024. backups2datalad-cron-common should have updated anyio whenever a backup was run after that. anyio in the dandisets-2 conda env is currently 4.7.0, the next-to-latest version, with the latest version having been released on January 5, so maybe pip install --upgrade-strategy eager doesn't do what I think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants