-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.io.IOException: Snapshot medusa-<....> already exists. #390
Labels
complexity: low
enhancement
New feature or request
good first issue
Good for newcomers
product-backlog
Issues in the state 'product-backlog'
Comments
kiddom-kq
added a commit
to kiddom/cassandra-medusa
that referenced
this issue
Aug 23, 2021
During node backup, the python process running as the `medusa` user was unable to locate snapshot files on disk. If no plausible directories with snapshot files exist on the host, consider this a failure condition and abort. Do not let execution proceed erroneously leading to a "backup successful cess" message.
Please add your planning poker estimate with ZenHub @adejanovski |
This is very much still an issue, excep we no longer get the It'd be nice to print a warning and fail properly instead of pretending all is dandy. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
complexity: low
enhancement
New feature or request
good first issue
Good for newcomers
product-backlog
Issues in the state 'product-backlog'
Project board link
I've been working on standing up Medusa the last few days. Most of the issues that I hit have workarounds documented:
/
on S3 doesn't seem to work correctly #212But this one didn't; i'm creating this issue to document the problem for future benefit.
On an
admin
node, i would run:Which would then come back with a failure message that appeared to be the same on each node in the cluster
When looking at the
/tmp/medusa-job-<...>/stderr
file on each node, I would see something like this:A bizarre error because just seconds before the error showed up, the
nodetool snapshot
command was executed cluster-wide.Something was wrong with the logic that figured out which stage in the backup process this cluster node was at; the cluster node was behaving like no snapshot had yet been created, but that was obviously not true.
Looking closer, it looks like
None
is showing up in the S3 object path.... which indicates something broke and wasn't caught. Dynamic typed languages are wonderful... right up until they're not :/.When looking through the S3 bucket, it was clear that some data was being uploaded. There was a directory structure that Medusa had created and some of the metadata files like
tokenmap.json
had accurate data in them.However, the total size of all objects in the backup path was about
1 MB
which is about a thousand times smaller than the size of data on disk in the cluster.The problem is not in S3 access; Medusa has no problem creating objects in the bucket for internal use so there must be a problem either at the stage where backup files are created or where the files are uploaded. The logs seemed to back this hunch up:
So I started looking through the Medusa codebase to better understand where
nodetool snapshot
files are uploaded to S3.The
CRITICAL: FIND ME:
log is something I added while debugging / tracing the code flow.The
snapshot.find_dirs()
call returning an empty list explained why the snapshot data files never made it into S3.I spent an embarrassing amount of time stepping through and isolating the code that searches through the configured Cassandra data locations looking for snapshot directories, but eventually figured out why my test code returned the correct set of files and directories while Medusa didn't: permissions.
In short, the medusa backup job was running on the Cassandra node as the
medusa
user. Thecassandra
user owns all directories:root@some-cassandra-node-01:/usr/share/cassandra-medusa/lib/python3.8/site-packages/medusa# ls -lah /cassandra/ total 40K drwxr-xr-x 7 root root 4.0K Aug 11 20:15 . drwxr-xr-x 19 root root 4.0K Aug 11 20:14 .. drwxr-x--- 2 cassandra cassandra 4.0K Aug 12 18:24 commitlog drwxr-x--- 9 cassandra cassandra 4.0K Aug 11 20:15 data drwxr-x--- 2 cassandra cassandra 4.0K Aug 11 20:15 hints drwx------ 2 root root 16K Aug 11 20:15 lost+found drwxr-x--- 2 cassandra cassandra 4.0K Aug 13 16:22 saved_caches
If the
medusa
user tries to list the content of/cassandra/data/*/*/snapshots
, thels
operation would fail with some sort ofPermission denied
error.... except this error is silently discarded withglob()
.A quick
sudo chown -R cassandra:medusa
later and backup operations started working:Hopefully that helps somebody! I am slowly making my way through the testing setup for Medusa trying to get a better idea of how it all works so I can contribute some patches to Medusa that will focus on sanity checks / logging with the intent of failing as soon as possible rather than letting silent failures propagate!
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: MED-56
The text was updated successfully, but these errors were encountered: