Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.io.IOException: Snapshot medusa-<....> already exists. #390

Open
kiddom-kq opened this issue Aug 16, 2021 · 2 comments
Open

java.io.IOException: Snapshot medusa-<....> already exists. #390

kiddom-kq opened this issue Aug 16, 2021 · 2 comments
Labels
complexity: low enhancement New feature or request good first issue Good for newcomers product-backlog Issues in the state 'product-backlog'

Comments

@kiddom-kq
Copy link
Contributor

kiddom-kq commented Aug 16, 2021

Project board link

I've been working on standing up Medusa the last few days. Most of the issues that I hit have workarounds documented:

But this one didn't; i'm creating this issue to document the problem for future benefit.


On an admin node, i would run:

$ medusa backup-cluster ...

Which would then come back with a failure message that appeared to be the same on each node in the cluster

When looking at the /tmp/medusa-job-<...>/stderr file on each node, I would see something like this:

<...>
[2021-08-11 21:56:17,250] DEBUG: https://s3-us-east-2.amazonaws.com:443 "HEAD /some-bucket-name-here/ClusterNameHere/ip-10-69-42-1.us-east-2.compute.internal/None/meta/differential HTTP/1.1" 404 0
[2021-08-11 21:56:17,251] DEBUG: [Storage] Getting object ClusterNameHere/ip-10-69-42-1.us-east-2.compute.internal/None/meta/incremental
[2021-08-11 21:56:17,255] DEBUG: https://s3-us-east-2.amazonaws.com:443 "HEAD /some-bucket-name-here HTTP/1.1" 200 0
[2021-08-11 21:56:17,267] DEBUG: https://s3-us-east-2.amazonaws.com:443 "HEAD /some-bucket-name-here/ClusterNameHere/ip-10-69-42-1.us-east-2.compute.internal/None/meta/incremental HTTP/1.1" 404 0
[2021-08-11 21:56:17,267] INFO: Node ip-10-69-42-1.us-east-2.compute.internal does not have latest backup
[2021-08-11 21:56:17,267] INFO: Starting backup
[2021-08-11 21:56:17,268] INFO: Creating snapshot
[2021-08-11 21:56:17,268] DEBUG: Executing: nodetool -u someUserNameHere -pw somePasswordStringHere snapshot -t medusa-202108112127
error: Snapshot medusa-202108112127 already exists.
-- StackTrace --
java.io.IOException: Snapshot medusa-202108112127 already exists.

A bizarre error because just seconds before the error showed up, the nodetool snapshot command was executed cluster-wide.
Something was wrong with the logic that figured out which stage in the backup process this cluster node was at; the cluster node was behaving like no snapshot had yet been created, but that was obviously not true.

Looking closer, it looks like None is showing up in the S3 object path.... which indicates something broke and wasn't caught. Dynamic typed languages are wonderful... right up until they're not :/.

When looking through the S3 bucket, it was clear that some data was being uploaded. There was a directory structure that Medusa had created and some of the metadata files like tokenmap.json had accurate data in them.

However, the total size of all objects in the backup path was about 1 MB which is about a thousand times smaller than the size of data on disk in the cluster.

The problem is not in S3 access; Medusa has no problem creating objects in the bucket for internal use so there must be a problem either at the stage where backup files are created or where the files are uploaded. The logs seemed to back this hunch up:

[2021-08-12 18:52:14,738] CRITICAL: FIND ME: the result of snapshot.find_dirs() is :[]
[2021-08-12 18:52:14,739] INFO: Updating backup index
[2021-08-12 18:52:15,518] INFO: Backup done
[2021-08-12 18:52:15,519] INFO: - Started: 2021-08-12 18:52:08
                        - Started extracting data: 2021-08-12 18:52:12
                        - Finished: 2021-08-12 18:52:15
[2021-08-12 18:52:15,519] INFO: - Real duration: 0:00:03.411112 (excludes time waiting for other nodes)
[2021-08-12 18:52:15,519] INFO: - 0 files, 0.00 B
[2021-08-12 18:52:15,519] INFO: - 0 files copied from host

So I started looking through the Medusa codebase to better understand where nodetool snapshot files are uploaded to S3.

The CRITICAL: FIND ME: log is something I added while debugging / tracing the code flow.
The snapshot.find_dirs() call returning an empty list explained why the snapshot data files never made it into S3.

I spent an embarrassing amount of time stepping through and isolating the code that searches through the configured Cassandra data locations looking for snapshot directories, but eventually figured out why my test code returned the correct set of files and directories while Medusa didn't: permissions.

In short, the medusa backup job was running on the Cassandra node as the medusa user. The cassandra user owns all directories:

root@some-cassandra-node-01:/usr/share/cassandra-medusa/lib/python3.8/site-packages/medusa# ls -lah /cassandra/
total 40K
drwxr-xr-x  7 root      root      4.0K Aug 11 20:15 .
drwxr-xr-x 19 root      root      4.0K Aug 11 20:14 ..
drwxr-x---  2 cassandra cassandra 4.0K Aug 12 18:24 commitlog
drwxr-x---  9 cassandra cassandra 4.0K Aug 11 20:15 data
drwxr-x---  2 cassandra cassandra 4.0K Aug 11 20:15 hints
drwx------  2 root      root       16K Aug 11 20:15 lost+found
drwxr-x---  2 cassandra cassandra 4.0K Aug 13 16:22 saved_caches

If the medusa user tries to list the content of /cassandra/data/*/*/snapshots, the ls operation would fail with some sort of Permission denied error.... except this error is silently discarded with glob().

A quick sudo chown -R cassandra:medusa later and backup operations started working:

[2021-08-13 23:18:27,186] INFO: A new backup test02 was created on all nodes.
[2021-08-13 23:18:27,187] INFO: Backup duration: 10484
[2021-08-13 23:18:27,187] INFO: Backup of the cluster done.

Hopefully that helps somebody! I am slowly making my way through the testing setup for Medusa trying to get a better idea of how it all works so I can contribute some patches to Medusa that will focus on sanity checks / logging with the intent of failing as soon as possible rather than letting silent failures propagate!

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: MED-56

@adejanovski adejanovski added the new Issues requiring triage label Aug 16, 2021
kiddom-kq added a commit to kiddom/cassandra-medusa that referenced this issue Aug 23, 2021
During node backup, the python process running as the `medusa` user was unable to locate snapshot files on disk. If no plausible directories with snapshot files exist on the host, consider this a failure condition and abort. Do not let execution proceed erroneously leading to a "backup successful cess" message.
@jsanda
Copy link
Contributor

jsanda commented Apr 19, 2022

Please add your planning poker estimate with ZenHub @adejanovski

@adejanovski adejanovski moved this to Product Backlog in K8ssandra Nov 8, 2022
@adejanovski adejanovski added the product-backlog Issues in the state 'product-backlog' label Jun 5, 2023
@rzvoncek
Copy link
Contributor

rzvoncek commented Apr 4, 2024

This is very much still an issue, excep we no longer get the CRITICAL log line. The backup passes through with uploading 0 files.

It'd be nice to print a warning and fail properly instead of pretending all is dandy.

@rzvoncek rzvoncek added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed complexity: low help-wanted Issues in the state 'help-wanted' and removed new Issues requiring triage zh:Product-Backlog product-backlog Issues in the state 'product-backlog' help wanted Extra attention is needed labels Apr 4, 2024
@adejanovski adejanovski added product-backlog Issues in the state 'product-backlog' and removed help-wanted Issues in the state 'help-wanted' labels Apr 8, 2024
@adejanovski adejanovski moved this from Product Backlog to Help Wanted in K8ssandra Apr 8, 2024
@adejanovski adejanovski added help-wanted Issues in the state 'help-wanted' and removed product-backlog Issues in the state 'product-backlog' labels Apr 8, 2024
@adejanovski adejanovski moved this from Product Backlog to Help Wanted in K8ssandra Apr 8, 2024
@adejanovski adejanovski added help-wanted Issues in the state 'help-wanted' and removed product-backlog Issues in the state 'product-backlog' labels Apr 8, 2024
@adejanovski adejanovski moved this from Help Wanted to Product Backlog in K8ssandra Apr 8, 2024
@adejanovski adejanovski added product-backlog Issues in the state 'product-backlog' and removed help-wanted Issues in the state 'help-wanted' labels Apr 8, 2024
@adejanovski adejanovski moved this from Product Backlog to Help Wanted in K8ssandra Apr 8, 2024
@adejanovski adejanovski added help-wanted Issues in the state 'help-wanted' and removed product-backlog Issues in the state 'product-backlog' labels Apr 8, 2024
@adejanovski adejanovski moved this from Help Wanted to Product Backlog in K8ssandra Apr 8, 2024
@adejanovski adejanovski added product-backlog Issues in the state 'product-backlog' and removed help-wanted Issues in the state 'help-wanted' labels Apr 8, 2024
@adejanovski adejanovski moved this from Product Backlog to Help Wanted in K8ssandra Apr 8, 2024
@adejanovski adejanovski added help-wanted Issues in the state 'help-wanted' and removed product-backlog Issues in the state 'product-backlog' labels Apr 8, 2024
@adejanovski adejanovski moved this from Help Wanted to Product Backlog in K8ssandra Apr 8, 2024
@adejanovski adejanovski added product-backlog Issues in the state 'product-backlog' and removed help-wanted Issues in the state 'help-wanted' labels Apr 8, 2024
@adejanovski adejanovski moved this from Product Backlog to Help Wanted in K8ssandra Apr 8, 2024
@adejanovski adejanovski added help-wanted Issues in the state 'help-wanted' and removed product-backlog Issues in the state 'product-backlog' labels Apr 8, 2024
@adejanovski adejanovski moved this from Help Wanted to Product Backlog in K8ssandra Apr 8, 2024
@adejanovski adejanovski added product-backlog Issues in the state 'product-backlog' and removed help-wanted Issues in the state 'help-wanted' labels Apr 8, 2024
@adejanovski adejanovski moved this from Product Backlog to Help Wanted in K8ssandra Apr 8, 2024
@adejanovski adejanovski added help-wanted Issues in the state 'help-wanted' and removed product-backlog Issues in the state 'product-backlog' labels Apr 8, 2024
@adejanovski adejanovski moved this from Help Wanted to Product Backlog in K8ssandra Apr 8, 2024
@adejanovski adejanovski added product-backlog Issues in the state 'product-backlog' and removed help-wanted Issues in the state 'help-wanted' labels Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: low enhancement New feature or request good first issue Good for newcomers product-backlog Issues in the state 'product-backlog'
Projects
None yet
Development

No branches or pull requests

4 participants