-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some technical questions I got about the tool #734
Comments
It is. We never got to a point where it felt like moving to v1 although it's been used for years in production by many companies. The current thinking is that we're going to remove the "full" backup mode and keep the "differential" only (although we recently realized what Medusa does is "synthetic full backups").
Snapshots are hardlinks, they only take space if the original sstable file is deleted. As we need to take synchronized backups, if you start the backup using
It depends a lot on the write/compaction flow (which will have an impact on whether or not the snapshots will be persisted) and the duration of backups. There's no general rule we can come up with here.
You don't need to run "full" on the first backup. It's also inefficient if you switch to differential afterwards because the first differential will re-upload everything. Do you run backups through the command line and using the |
Hi @ftarrega ! I'd also like to know what's your data churn rate, meaning how much of new data do you foresee getting between two backups? Aside from the first backup, Medusa will only upload things that were not present in the previous backup (just as Alex mentioned). If you don't have much data coming in, nor too much compaction activity, this might mean the backups will actually not be very long. So the lifetime of the snapshots (and the disk usage associated with them) might be smaller than you think. |
@rzvoncek, I really think this is a good idea. We've seen clusters running out of disk space due to snapshots being persisted as backups were running and the change should be fairly light in our code. |
Hey, guys. Hope you're all well.
First of all thanks a lot for taking the time and effort to answer me. I'm still reviewing it, but I'd like to answer to Alex's questions right away, so:
Do you run backups through the command line and using the medusa backup-cluster command?
Through the command-line yes, but not with backup-cluster. As a policy in the company nodes are not set up for remote connection between themselves. As far as I could gather from the code, backup-cluster requires pssh and allowing it to work could be seem as a security risk within the company, so we are triggering medusa backup with the same backup name in each of the cluster nodes
Any reason why you want to use full as backup mode?
This could be from my misunderstanding on how to properly use the tool, but I learned this from an article you wrote on medusa in 2019 (Medusa - Spotify’s Apache Cassandra backup tool is now open source (thelastpickle.com)<https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html>):
[cid:6fadc31d-bd59-40f9-b37a-afb95537198d]
I figured for a first run ever on any cluster one should run a full backup and then keep track of the changes with differential backups as a follow-up. I thought if we just went for a differential one from the get go I'd be telling medusa not to backup all the data in the cluster and I'd be at fault for the missing data
Which version of Medusa are you using?
I see you guys have just released v0.20.0 and v0.20.1, but when I started this endeavour v0.19.1 was the latest available. That's the version we are currently using. Do you advise we upgrade? I suppose there's some relearning I'll have to do in such case, cause I don't see the setup.py file in the repo anymore for these latest versions. I don't know much about python, but I was able to make medusa work from the Linux cassandra account using miniconda to setup a python 3.8.18 env (couldn't figure pipenv quickly enough :o), run the pip download and pip install, then python setup.py install, finally making python see all dependencies by adding an export PYTHONPATH to the bash_profile of cassandra
What throughput do you effectively observe during backups?
Here's what I could draw from AWS metric system.net.bytes_sent. I'd say that, on average, it clocked at no more than 50MiB/s:
[cid:21affe7c-6e05-46d0-a209-e1760af0cb90]
@rzvoncek<https://github.com/rzvoncek>, I understand the churn will be somewhat high, cause the data holds B2B and B2C customers. We tend to have changes coming in every day as CRUD requests in many different tables across keyspaces
Thanks again and best regards
…________________________________
From: Alexander Dejanovski ***@***.***>
Sent: Thursday, April 4, 2024 2:40 AM
To: thelastpickle/cassandra-medusa ***@***.***>
Cc: ftarrega ***@***.***>; Author ***@***.***>
Subject: Re: [thelastpickle/cassandra-medusa] Some technical questions I got about the tool (Issue #734)
Is Medusa ready for PROD use? We were wondering about that cause medusa is still version 0 and hasn't reached v1 yet and in such cases it's usually because a tool isn't mature enough yet, right? Is that the case?
Do you have a timeline for when you expect Medusa v1.X.X will be generally available?
It is. We never got to a point where it felt like moving to v1 although it's been used for years in production by many companies. The current thinking is that we're going to remove the "full" backup mode and keep the "differential" only (although we recently realized what Medusa does is "synthetic full backups").
The removal of a backup mode feels like a good moment to finally move to v1. This should happen within a few months.
How does medusa handle the snapshots as the backup is getting created? Snapshots take a lot of storage space and are saved to the data folder, in practice competing for the resource with the actual data. My concern is it filling up the file system, having a compaction kick in and the server start failing. So what's the flow? Will it take all snapshots and clear them only at the end of the backup or does it do it table for table?
Snapshots are hardlinks, they only take space if the original sstable file is deleted. As we need to take synchronized backups, if you start the backup using medusa backup-cluster, a snapshot will be taken on all nodes, and only after will the uploads start. They're cleared only after the end of the backup for all tables. We use nodetool to create and clear the snapshots, which is why it's done for all files at once.
Your idea of deleting files after their upload is very interesting though and deserves to be investigated. It would limit the odds of running out of disk space on long running backups.
There's a rule of thumb you learn from researching that you should keep from 50% to 75% free space in storage for Cassandra smooth operation. Because of snapshots and the time a backup can take to finish does it mean one should need more than 50% free disk space in order to not trouble the node when medusa is running?
It depends a lot on the write/compaction flow (which will have an impact on whether or not the snapshots will be persisted) and the duration of backups. There's no general rule we can come up with here.
Currently we are clocking about 10TB backup at about 28h for a 4-nodes cluster. transfer_max_bandwidth = 1000MB/s and concurrent_transfers = 16 ( it was set to 4 before, we were getting lots of "WARNING: Connection pool is full, discarding connection:" messages, so we changed to 16. We are still getting those messages but less than before, it seems). It didn't seem to improve performance, so maybe the issue is because snapshots are in fact taking the brunt of the backup time and we need to tweak Cassandra somehow? Any advice on how we can fine-tune a first time --mode=full backup?
You don't need to run "full" on the first backup. It's also inefficient if you switch to differential afterwards because the first differential will re-upload everything.
We recently released a new version which heals broken differential backups, which was really the only reason why full could be considered safer.
Avoiding upload of already uploaded files is really the best optimization you can get.
Do you run backups through the command line and using the medusa backup-cluster command?
Any reason why you want to use full as backup mode?
Which version of Medusa are you using?
What throughput do you effectively observe during backups?
—
Reply to this email directly, view it on GitHub<#734 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGQ3N4PCJ7QSQNKGOIIT6D3Y3TRTRAVCNFSM6AAAAABFVSPBJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZWGIZDSNRVGM>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
That's a common misconception, so I clearly did a poor job at explaining this 😅 When differential backups run for the first time on a cluster, they upload everything. Subsequent backups only upload new files since the previous backup.
We've had some upload concurrency issues that were fixed in v0.20.x but they were affecting the grpc server I think, so it shouldn't have an impact on your setup. One interesting feature there though is the self healing of broken differential backups. We'll re-upload any file that is missing/corrupted in the storage bucket even if it was previously uploaded as part of an older backup.
Well, that would suggest your 1000MB/s throughput limit isn't being taken into account and the default throttling at 50MB/s is still applied 🤔 Could you share your medusa.ini file with us? |
Yes, we removed the |
Hey, lads. How are you? Just dropping by to share the medusa.ini file requested:
|
Hey lads, what's up? It's been a while. Let me take this opportunity wherein this ticket is still open to ask you some new questions rather than cluttering you ticket system with a new ticket. I've been wondering recently about encryption. Does medusa ship with backup encryption capabilities as it's saved to an S3 bucket for instance, or as you were designing it you felt it was better served if left to Cassandra or even to the applications reading from and writing to the cluster to do it? Thanks in advance |
Hey @ftarrega, well, we're all busy, you probably understand :-/ Looking at the ini file, there are 4 settings that might impact the upload speed:
To sum this up, tuning the throughpput comes down to how your files look like. If you have many small files, you'll need more concurrency. And remove the For encryption, there are two features we have, both only for S3:
For the other providers, you might be able to set up some encryption-at-rest and it might work transparently. |
Hi there, guys. Hope you're doing fine. My apologies, @rzvoncek, I didn't mean for my introductory comment to come across as pushy. It was just a bit of smalltalk really, as it had been a while since I visited the ticket too. I'm sure you guys are hard at work :-) Anyway, thanks a lot for your reply. I'm gonna close this ticket and open a fresh one as I need some help with an issue I'm getting, if that's OK. |
Project board link
Hey team. I got in touch with Datastax support on their webpage chat cause I couldn't find an email or forum to ask you questions. I felt opening an issue would be frowned upon for such request, but I've been told by them this is indeed the place so here I am =o)
I got a few questions about Cassandra Medusa I'd appreciate answered as we are trying to implement it in our project, which will lead to ultimately using medusa for production clusters backups and restores:
Thanks in advance and best regards
The text was updated successfully, but these errors were encountered: