-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
af_scaletempo2: improve signal similarity metric #13737
af_scaletempo2: improve signal similarity metric #13737
Conversation
Download the artifacts for this pull request: |
d33a571
to
304087a
Compare
304087a
to
672d058
Compare
672d058
to
f7c69d0
Compare
f7c69d0
to
ff9c54f
Compare
The old formula worked well for stereo, but the results got worse with increased channel count. The taxicab distance works just as well for stereo, while not falling appart as the channel count grows. The downside is increased CPU usage. Maybe someone can try and vectorize this one like the old one was. The performance still isn't bad, so there is no pressing need for it. Fixes mpv-player#8705 (comment)
ff9c54f
to
69b2759
Compare
I'm not sure this is the right approach to take. Chromium apprently was aware of this issue and chose to use resampling for small speed changes. What's the difference between this and |
iiuc you can get better results if you split audio into several bands and extract granules (usually zero-crossings) in each band, and than stretch/multiply each granule, but i think this will mess with stereo image in stereo and >2 ch audio. |
@na-na-hi That's not the same problem. They use resampling when the speed change is within a 0.95-1.06 range, but if you test the sample in the linked comment you'll notice that it sounds bad with speeds much bigger then that (the comment mentions 1.1x, but it's also very noticeable above that).
scaletempo on master uses the dot product to determine similarity, where as this uses the taxicab distance. Their implementations are also very different in how they deal with resets and speed changes, but I don't understand those parts well enough for both filters to explain how their different or if they end up behaving in the exact same way. The CPU usage is also very different. I haven't done any benchmarks of the filters themselves, but I've kept track of how the CPU usage changed while implementing my changes by playing a section of a specific flac at 2.5x speed with the same parameters by using gnu time
I don't know what the initial reasoning behind it was, but since it causes problems with >2 channels we need to change something about it. You're welcome to play around with the metric yourself to how the result sounds :)
That's much harder then making that change here. |
@na-na-hi That's not the same problem. They use resampling when the speed change is within a 0.95-1.06 range, but if you test the sample in the linked comment you'll notice that it sounds bad with speeds much bigger then that (the comment mentions 1.1x, but it's also very noticeable above that).
If chromium's implementation doesn't have this problem at 1.1x speed, doesn't this mean that the method used here isn't implemented correctly?
scaletempo on master uses the dot product to determine similarity, where as this uses the taxicab distance.
scaletempo also uses a windowing function (no idea if it has a name, but it creates a hill like curve) while looking for the best overlap position and then uses a different window (linear) when doing the overlap. scaletempo2 uses no window when looking for the best overlap position and a Hann function when doing the overlap.
But as mentioned in #12487, you were able to get scaletempo to sound nearly the same as scaletempo2 just by changing the window function to match the latter's. And after both methods are modified to use the same similarity algorithm and search method (which brings the performance to similar level), it raises the question of the remaining difference of the two algorithms. It seems to me that the only real difference is the energy calculation here.
`scaletempo` also supports 16bit integer while `scaletempo2` only supports float and integer signals need to be converted to float first. Idk how much of a difference that makes, but I don't expect it to be audible.
Agreed.
I don't know what the initial reasoning behind it was, but since it causes problems with >2 channels we need to change something about it. You're welcome to play around with the metric yourself to how the result sounds :)
The per-sample correlation methods used here work well to preserve periodic patterns, but not for transients. The energy calculation here could be serving a purpose for transient preservation. The problem is that the similarity metric used here treats all channels equally, so it's natural that this scheme breaks with multi channel audio when correlation between some channels are poor.
A better way to do this is to process group channels by correlation, and process each group individually. For example in a 5.1 setup there will be 4 groups (FL/FR, CNT, SL/SR, LFE). Since audio from different groups don't have good correlation in the first place, using different offset values should not be a problem.
I wonder if this energy removal change will cause some quality degradation for transients. If this change causes quality regression for stereo sources, it needs to be reworked.
That's much harder then making that change here.
If you think #12487 fixes the scaletempo performance and quality problem at high speed factors, I'd suggest to merge that PR and make scaletempo the default in the meantime. Then we can discuss the "proper" way to fix scaletempo2.
|
Does it not have that problem?
I don't know where you got this from, maybe that was the case for some sample I used back then and I wrote something in IRC? But the PR description states "The correlation change made a big difference.", and I've now tried that PR with only the window function changes and it sounds much worse then with the correlation change.
That sounds like a good idea, but I don't think that I will be the one implementing this, because it already sounds pretty good to me with this PR and I doubt that it can get noticeably better. I'd love to be proven wrong though :P
So far I haven't noticed any problems, but if you can find any examples where that's a problem, then those will be good test cases for future changes.
We could also get this in as it doesn't cause any regressions afaik (except a bit for performance). It's not like that will lock scaletempo2 from getting any changes in the future. I'd prefer to get the scaletempo2 change in because as was pointed out in #12487 it handles speed changes better then scaletempo. |
I think there could still be a less radical way to fix this. Here's a simple diff that sounds very similar to this patch (only first impressions so far; tried around 1.10x-1.50x speed).
I simply squared the dot products before they are added, which makes louder channels more impactful in the measure. And here's another where I just removed the divide-by-sqrt part, sounding very similar as well:
I only tested with the sample provided on the issue, but will try other things soon. I'm sure these aren't mathematically perfect, but the point is that I think a simpler fix can be found in this part of the similarity measure. Update: I've done some testing with these two changes and the immediate impression was not so good. There's a much more (audible) attack during dialog, which sounds pretty bad. Maybe someone else can find a way to change this formula that works better? I can confirm though that the changes from the PR do sound pretty good in the media I've tried so far, including the problematic sample. The slight performance decrease is unfortunate though. |
I also suspected that the chromium implementation has the same problem but they decided that use resampling for small speed changes is a reasonable tradeoff, and the artifact at 1.1x speed is acceptable. This turns out to be the case.
Maybe I misunderstood but my point still stands that after this and that PR, the two algorithms appear to have no meaningful difference.
My suspicion is correct: this PR does degrade transient detection. Proof of the following stereo sample played at 0.5x speed: Master: master.webmThis PR: PR.webmOriginal sample: test.webmThe degraded transient matching performance for this PR is very noticeable for the square wave blips at the start. |
That file is great for testing transients. |
Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's the target block energy is factored into the usual similarity measure. To prevent bias towards louder blocks, the result is divided by the total energy across all channels. This should have very little effect on very correlated channels (such as most stereo media), as the division by total energy reverses the effect of the channel-wise factorization if all channels have similar energy. See-Also: mpv-player#8705 See-Also: mpv-player#13737
Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's target block energy is factored into the usual similarity measure. This should have very little effect on very correlated channels (such as most stereo media), as the factors are very similar for all channels. See-Also: mpv-player#8705 See-Also: mpv-player#13737
Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's target block energy is factored into the usual similarity measure. This should have little effect on very correlated channels (such as most stereo media), where the factors are very similar for all channels. See-Also: mpv-player#8705 See-Also: mpv-player#13737
Closing in favor of #13748 |
Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's target block energy is factored into the usual similarity measure. This should have little effect on very correlated channels (such as most stereo media), where the factors are very similar for all channels. See-Also: #8705 See-Also: #13737
@CounterPillow