Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

af_scaletempo2: improve signal similarity metric #13737

Conversation

christoph-heinrich
Copy link
Contributor

@christoph-heinrich christoph-heinrich commented Mar 20, 2024

Copy link

github-actions bot commented Mar 20, 2024

Download the artifacts for this pull request:

Windows
macOS

@christoph-heinrich christoph-heinrich force-pushed the scaletempo2_remove_energy branch from d33a571 to 304087a Compare March 20, 2024 23:38
@christoph-heinrich christoph-heinrich changed the title scaletempo2: remove signal energy af_scaletempo2: remove signal energy Mar 20, 2024
@christoph-heinrich christoph-heinrich marked this pull request as draft March 21, 2024 02:26
@christoph-heinrich christoph-heinrich force-pushed the scaletempo2_remove_energy branch from 304087a to 672d058 Compare March 21, 2024 05:12
@christoph-heinrich christoph-heinrich changed the title af_scaletempo2: remove signal energy af_scaletempo2: replace similarity formula with taxicab distance Mar 21, 2024
@christoph-heinrich christoph-heinrich marked this pull request as ready for review March 21, 2024 05:14
@christoph-heinrich christoph-heinrich changed the title af_scaletempo2: replace similarity formula with taxicab distance af_scaletempo2: improve signal similarity metric Mar 21, 2024
@christoph-heinrich christoph-heinrich force-pushed the scaletempo2_remove_energy branch from 672d058 to f7c69d0 Compare March 21, 2024 05:18
@christoph-heinrich christoph-heinrich force-pushed the scaletempo2_remove_energy branch from f7c69d0 to ff9c54f Compare March 21, 2024 06:01
The old formula worked well for stereo, but the results got worse with
increased channel count.

The taxicab distance works just as well for stereo, while not falling
appart as the channel count grows.

The downside is increased CPU usage. Maybe someone can try and vectorize
this one like the old one was. The performance still isn't bad, so there
is no pressing need for it.

Fixes mpv-player#8705 (comment)
@christoph-heinrich christoph-heinrich force-pushed the scaletempo2_remove_energy branch from ff9c54f to 69b2759 Compare March 21, 2024 13:34
@na-na-hi
Copy link
Contributor

na-na-hi commented Mar 21, 2024

I'm not sure this is the right approach to take. Chromium apprently was aware of this issue and chose to use resampling for small speed changes.

What's the difference between this and af_scaletempo after the energy calculation is removed? Isn't the whole point of that energy stuff for transient detection and preservation? If the current algorithm is flawed, we should implement a better algorithm instead, not this workaround.

@richardpl
Copy link
Contributor

iiuc you can get better results if you split audio into several bands and extract granules (usually zero-crossings) in each band, and than stretch/multiply each granule, but i think this will mess with stereo image in stereo and >2 ch audio.
Using resampling will change booth pitch and speed.

@christoph-heinrich
Copy link
Contributor Author

Chromium apprently was aware of this issue and chose to use resampling for small speed changes.

@na-na-hi That's not the same problem. They use resampling when the speed change is within a 0.95-1.06 range, but if you test the sample in the linked comment you'll notice that it sounds bad with speeds much bigger then that (the comment mentions 1.1x, but it's also very noticeable above that).

What's the difference between this and af_scaletempo after the energy calculation is removed?

scaletempo on master uses the dot product to determine similarity, where as this uses the taxicab distance.
scaletempo also uses a windowing function (no idea if it has a name, but it creates a hill like curve) while looking for the best overlap position and then uses a different window (linear) when doing the overlap. scaletempo2 uses no window when looking for the best overlap position and a Hann function when doing the overlap.

Their implementations are also very different in how they deal with resets and speed changes, but I don't understand those parts well enough for both filters to explain how their different or if they end up behaving in the exact same way.

The CPU usage is also very different. I haven't done any benchmarks of the filters themselves, but I've kept track of how the CPU usage changed while implementing my changes by playing a section of a specific flac at 2.5x speed with the same parameters by using gnu time

version CPU %
scaletempo master 54
scaletempo #12487 23
scaletempo2 master 10
scaletempo2 with this PR 16

scaletempo on master uses so much more because it doesn't use a decimated search, and if I change the decimation from 3 to 5 in my PR the CPU usage also goes down to 16%, same as scaletempo2 (which uses 5).

scaletempo also supports 16bit integer while scaletempo2 only supports float and integer signals need to be converted to float first. Idk how much of a difference that makes, but I don't expect it to be audible.

Isn't the whole point of that energy stuff for transient detection and preservation?

I don't know what the initial reasoning behind it was, but since it causes problems with >2 channels we need to change something about it. You're welcome to play around with the metric yourself to how the result sounds :)

If the current algorithm is flawed, we should implement a better algorithm instead, not this workaround.

That's much harder then making that change here.

@na-na-hi
Copy link
Contributor

na-na-hi commented Mar 21, 2024 via email

@christoph-heinrich
Copy link
Contributor Author

If chromium's implementation doesn't have this problem at 1.1x speed, doesn't this mean that the method used here isn't implemented correctly?

Does it not have that problem?
Yesterday I've converted the sample from that comment to flac so that it plays in chromium and I've compared it to scaletempo2 with the same parameters as they use and it sounded identical. I did this after hours of looking for a bug in our scaletempo2, just to find out that up-to-date chromium sounds the same.

But as mentioned in #12487, you were able to get scaletempo to sound nearly the same as scaletempo2 just by changing the window function to match the latter's.

I don't know where you got this from, maybe that was the case for some sample I used back then and I wrote something in IRC? But the PR description states "The correlation change made a big difference.", and I've now tried that PR with only the window function changes and it sounds much worse then with the correlation change.

A better way to do this is to process group channels by correlation, and process each group individually. For example in a 5.1 setup there will be 4 groups (FL/FR, CNT, SL/SR, LFE). Since audio from different groups don't have good correlation in the first place, using different offset values should not be a problem.

That sounds like a good idea, but I don't think that I will be the one implementing this, because it already sounds pretty good to me with this PR and I doubt that it can get noticeably better. I'd love to be proven wrong though :P

I wonder if this energy removal change will cause some quality degradation for transients. If this change causes quality regression for stereo sources, it needs to be reworked.

So far I haven't noticed any problems, but if you can find any examples where that's a problem, then those will be good test cases for future changes.

If you think #12487 fixes the scaletempo performance and quality problem at high speed factors, I'd suggest to merge that PR and make scaletempo the default in the meantime. Then we can discuss the "proper" way to fix scaletempo2.

We could also get this in as it doesn't cause any regressions afaik (except a bit for performance). It's not like that will lock scaletempo2 from getting any changes in the future.

I'd prefer to get the scaletempo2 change in because as was pointed out in #12487 it handles speed changes better then scaletempo.

@ferreum
Copy link
Contributor

ferreum commented Mar 21, 2024

I think there could still be a less radical way to fix this.
Following my hunch I described in my recent comment on the issue, I tried to make louder channels more significant in the measure.

Here's a simple diff that sounds very similar to this patch (only first impressions so far; tried around 1.10x-1.50x speed).

diff --git audio/filter/af_scaletempo2_internals.c audio/filter/af_scaletempo2_internals.c
index 534f4f672a..ce35510033 100644
--- audio/filter/af_scaletempo2_internals.c
+++ audio/filter/af_scaletempo2_internals.c
@@ -100,7 +100,7 @@ static float multi_channel_similarity_measure(
     const float epsilon = 1e-12f;
     float similarity_measure = 0.0f;
     for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n]
+        similarity_measure += dot_prod_a_b[n] * dot_prod_a_b[n]
             / sqrtf(energy_a[n] * energy_b[n] + epsilon);
     }
     return similarity_measure;

I simply squared the dot products before they are added, which makes louder channels more impactful in the measure.

And here's another where I just removed the divide-by-sqrt part, sounding very similar as well:

diff --git audio/filter/af_scaletempo2_internals.c audio/filter/af_scaletempo2_internals.c
index 534f4f672a..ee78940ba1 100644
--- audio/filter/af_scaletempo2_internals.c
+++ audio/filter/af_scaletempo2_internals.c
@@ -100,8 +100,7 @@ static float multi_channel_similarity_measure(
     const float epsilon = 1e-12f;
     float similarity_measure = 0.0f;
     for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n]
-            / sqrtf(energy_a[n] * energy_b[n] + epsilon);
+        similarity_measure += dot_prod_a_b[n];
     }
     return similarity_measure;
 }

I only tested with the sample provided on the issue, but will try other things soon.

I'm sure these aren't mathematically perfect, but the point is that I think a simpler fix can be found in this part of the similarity measure.

Update: I've done some testing with these two changes and the immediate impression was not so good. There's a much more (audible) attack during dialog, which sounds pretty bad. Maybe someone else can find a way to change this formula that works better?

I can confirm though that the changes from the PR do sound pretty good in the media I've tried so far, including the problematic sample. The slight performance decrease is unfortunate though.

@na-na-hi
Copy link
Contributor

na-na-hi commented Mar 21, 2024

Does it not have that problem?

I also suspected that the chromium implementation has the same problem but they decided that use resampling for small speed changes is a reasonable tradeoff, and the artifact at 1.1x speed is acceptable. This turns out to be the case.

But the PR description states "The correlation change made a big difference.", and I've now tried that PR with only the window function changes and it sounds much worse then with the correlation change.

Maybe I misunderstood but my point still stands that after this and that PR, the two algorithms appear to have no meaningful difference.

So far I haven't noticed any problems, but if you can find any examples where that's a problem, then those will be good test cases for future changes.

My suspicion is correct: this PR does degrade transient detection. Proof of the following stereo sample played at 0.5x speed:

Master:

master.webm

This PR:

PR.webm

Original sample:

test.webm

The degraded transient matching performance for this PR is very noticeable for the square wave blips at the start.

@christoph-heinrich
Copy link
Contributor Author

My suspicion is correct: this PR does degrade transient detection. Proof of the following stereo sample played at 0.5x speed:

That file is great for testing transients.
Everything I try that improves how it sounds in general ends up degrading transients 😐

ferreum added a commit to ferreum/mpv that referenced this pull request Mar 22, 2024
Playback with many audio channels could be distorted when using
scaletempo2. This was most noticeable when there were a lot of quiet
channels and few louder channels.

Fix this by increasing the weight of louder channels in relation to
quieter channels. Each channel's the target block energy is factored
into the usual similarity measure. To prevent bias towards louder
blocks, the result is divided by the total energy across all channels.

This should have very little effect on very correlated channels (such as
most stereo media), as the division by total energy reverses the effect
of the channel-wise factorization if all channels have similar energy.

See-Also: mpv-player#8705
See-Also: mpv-player#13737
ferreum added a commit to ferreum/mpv that referenced this pull request Mar 22, 2024
Playback with many audio channels could be distorted when using
scaletempo2. This was most noticeable when there were a lot of quiet
channels and few louder channels.

Fix this by increasing the weight of louder channels in relation to
quieter channels. Each channel's target block energy is factored into
the usual similarity measure.

This should have very little effect on very correlated channels (such as
most stereo media), as the factors are very similar for all channels.

See-Also: mpv-player#8705
See-Also: mpv-player#13737
ferreum added a commit to ferreum/mpv that referenced this pull request Mar 22, 2024
Playback with many audio channels could be distorted when using
scaletempo2. This was most noticeable when there were a lot of quiet
channels and few louder channels.

Fix this by increasing the weight of louder channels in relation to
quieter channels. Each channel's target block energy is factored into
the usual similarity measure.

This should have little effect on very correlated channels (such as most
stereo media), where the factors are very similar for all channels.

See-Also: mpv-player#8705
See-Also: mpv-player#13737
@christoph-heinrich
Copy link
Contributor Author

Closing in favor of #13748

Dudemanguy pushed a commit that referenced this pull request Apr 12, 2024
Playback with many audio channels could be distorted when using
scaletempo2. This was most noticeable when there were a lot of quiet
channels and few louder channels.

Fix this by increasing the weight of louder channels in relation to
quieter channels. Each channel's target block energy is factored into
the usual similarity measure.

This should have little effect on very correlated channels (such as most
stereo media), where the factors are very similar for all channels.

See-Also: #8705
See-Also: #13737
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants