af_scaletempo2: improve signal similarity metric #13737

christoph-heinrich · 2024-03-20T22:30:35Z

github-actions · 2024-03-20T22:42:59Z

Download the artifacts for this pull request:

Windows

macOS

audio/filter/af_scaletempo2_internals.c

The old formula worked well for stereo, but the results got worse with increased channel count. The taxicab distance works just as well for stereo, while not falling appart as the channel count grows. The downside is increased CPU usage. Maybe someone can try and vectorize this one like the old one was. The performance still isn't bad, so there is no pressing need for it. Fixes mpv-player#8705 (comment)

na-na-hi · 2024-03-21T14:59:00Z

I'm not sure this is the right approach to take. Chromium apprently was aware of this issue and chose to use resampling for small speed changes.

What's the difference between this and af_scaletempo after the energy calculation is removed? Isn't the whole point of that energy stuff for transient detection and preservation? If the current algorithm is flawed, we should implement a better algorithm instead, not this workaround.

richardpl · 2024-03-21T15:01:59Z

iiuc you can get better results if you split audio into several bands and extract granules (usually zero-crossings) in each band, and than stretch/multiply each granule, but i think this will mess with stereo image in stereo and >2 ch audio.
Using resampling will change booth pitch and speed.

christoph-heinrich · 2024-03-21T16:35:04Z

Chromium apprently was aware of this issue and chose to use resampling for small speed changes.

@na-na-hi That's not the same problem. They use resampling when the speed change is within a 0.95-1.06 range, but if you test the sample in the linked comment you'll notice that it sounds bad with speeds much bigger then that (the comment mentions 1.1x, but it's also very noticeable above that).

What's the difference between this and af_scaletempo after the energy calculation is removed?

scaletempo on master uses the dot product to determine similarity, where as this uses the taxicab distance.
scaletempo also uses a windowing function (no idea if it has a name, but it creates a hill like curve) while looking for the best overlap position and then uses a different window (linear) when doing the overlap. scaletempo2 uses no window when looking for the best overlap position and a Hann function when doing the overlap.

Their implementations are also very different in how they deal with resets and speed changes, but I don't understand those parts well enough for both filters to explain how their different or if they end up behaving in the exact same way.

The CPU usage is also very different. I haven't done any benchmarks of the filters themselves, but I've kept track of how the CPU usage changed while implementing my changes by playing a section of a specific flac at 2.5x speed with the same parameters by using gnu time

version	CPU %
scaletempo master	54
scaletempo #12487	23
scaletempo2 master	10
scaletempo2 with this PR	16

scaletempo on master uses so much more because it doesn't use a decimated search, and if I change the decimation from 3 to 5 in my PR the CPU usage also goes down to 16%, same as scaletempo2 (which uses 5).

scaletempo also supports 16bit integer while scaletempo2 only supports float and integer signals need to be converted to float first. Idk how much of a difference that makes, but I don't expect it to be audible.

Isn't the whole point of that energy stuff for transient detection and preservation?

I don't know what the initial reasoning behind it was, but since it causes problems with >2 channels we need to change something about it. You're welcome to play around with the metric yourself to how the result sounds :)

If the current algorithm is flawed, we should implement a better algorithm instead, not this workaround.

That's much harder then making that change here.

na-na-hi · 2024-03-21T17:45:59Z

@na-na-hi That's not the same problem. They use resampling when the speed change is within a 0.95-1.06 range, but if you test the sample in the linked comment you'll notice that it sounds bad with speeds much bigger then that (the comment mentions 1.1x, but it's also very noticeable above that).

If chromium's implementation doesn't have this problem at 1.1x speed, doesn't this mean that the method used here isn't implemented correctly?

scaletempo on master uses the dot product to determine similarity, where as this uses the taxicab distance. scaletempo also uses a windowing function (no idea if it has a name, but it creates a hill like curve) while looking for the best overlap position and then uses a different window (linear) when doing the overlap. scaletempo2 uses no window when looking for the best overlap position and a Hann function when doing the overlap.

But as mentioned in #12487, you were able to get scaletempo to sound nearly the same as scaletempo2 just by changing the window function to match the latter's. And after both methods are modified to use the same similarity algorithm and search method (which brings the performance to similar level), it raises the question of the remaining difference of the two algorithms. It seems to me that the only real difference is the energy calculation here.

`scaletempo` also supports 16bit integer while `scaletempo2` only supports float and integer signals need to be converted to float first. Idk how much of a difference that makes, but I don't expect it to be audible.

Agreed.

I don't know what the initial reasoning behind it was, but since it causes problems with >2 channels we need to change something about it. You're welcome to play around with the metric yourself to how the result sounds :)

The per-sample correlation methods used here work well to preserve periodic patterns, but not for transients. The energy calculation here could be serving a purpose for transient preservation. The problem is that the similarity metric used here treats all channels equally, so it's natural that this scheme breaks with multi channel audio when correlation between some channels are poor. A better way to do this is to process group channels by correlation, and process each group individually. For example in a 5.1 setup there will be 4 groups (FL/FR, CNT, SL/SR, LFE). Since audio from different groups don't have good correlation in the first place, using different offset values should not be a problem. I wonder if this energy removal change will cause some quality degradation for transients. If this change causes quality regression for stereo sources, it needs to be reworked.

That's much harder then making that change here.

If you think #12487 fixes the scaletempo performance and quality problem at high speed factors, I'd suggest to merge that PR and make scaletempo the default in the meantime. Then we can discuss the "proper" way to fix scaletempo2.

christoph-heinrich · 2024-03-21T18:29:25Z

If chromium's implementation doesn't have this problem at 1.1x speed, doesn't this mean that the method used here isn't implemented correctly?

Does it not have that problem?
Yesterday I've converted the sample from that comment to flac so that it plays in chromium and I've compared it to scaletempo2 with the same parameters as they use and it sounded identical. I did this after hours of looking for a bug in our scaletempo2, just to find out that up-to-date chromium sounds the same.

But as mentioned in #12487, you were able to get scaletempo to sound nearly the same as scaletempo2 just by changing the window function to match the latter's.

I don't know where you got this from, maybe that was the case for some sample I used back then and I wrote something in IRC? But the PR description states "The correlation change made a big difference.", and I've now tried that PR with only the window function changes and it sounds much worse then with the correlation change.

A better way to do this is to process group channels by correlation, and process each group individually. For example in a 5.1 setup there will be 4 groups (FL/FR, CNT, SL/SR, LFE). Since audio from different groups don't have good correlation in the first place, using different offset values should not be a problem.

That sounds like a good idea, but I don't think that I will be the one implementing this, because it already sounds pretty good to me with this PR and I doubt that it can get noticeably better. I'd love to be proven wrong though :P

I wonder if this energy removal change will cause some quality degradation for transients. If this change causes quality regression for stereo sources, it needs to be reworked.

So far I haven't noticed any problems, but if you can find any examples where that's a problem, then those will be good test cases for future changes.

If you think #12487 fixes the scaletempo performance and quality problem at high speed factors, I'd suggest to merge that PR and make scaletempo the default in the meantime. Then we can discuss the "proper" way to fix scaletempo2.

We could also get this in as it doesn't cause any regressions afaik (except a bit for performance). It's not like that will lock scaletempo2 from getting any changes in the future.

I'd prefer to get the scaletempo2 change in because as was pointed out in #12487 it handles speed changes better then scaletempo.

ferreum · 2024-03-21T19:27:12Z

I think there could still be a less radical way to fix this.
Following my hunch I described in my recent comment on the issue, I tried to make louder channels more significant in the measure.

Here's a simple diff that sounds very similar to this patch (only first impressions so far; tried around 1.10x-1.50x speed).

diff --git audio/filter/af_scaletempo2_internals.c audio/filter/af_scaletempo2_internals.c
index 534f4f672a..ce35510033 100644
--- audio/filter/af_scaletempo2_internals.c
+++ audio/filter/af_scaletempo2_internals.c
@@ -100,7 +100,7 @@ static float multi_channel_similarity_measure(
     const float epsilon = 1e-12f;
     float similarity_measure = 0.0f;
     for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n]
+        similarity_measure += dot_prod_a_b[n] * dot_prod_a_b[n]
             / sqrtf(energy_a[n] * energy_b[n] + epsilon);
     }
     return similarity_measure;

I simply squared the dot products before they are added, which makes louder channels more impactful in the measure.

And here's another where I just removed the divide-by-sqrt part, sounding very similar as well:

diff --git audio/filter/af_scaletempo2_internals.c audio/filter/af_scaletempo2_internals.c
index 534f4f672a..ee78940ba1 100644
--- audio/filter/af_scaletempo2_internals.c
+++ audio/filter/af_scaletempo2_internals.c
@@ -100,8 +100,7 @@ static float multi_channel_similarity_measure(
     const float epsilon = 1e-12f;
     float similarity_measure = 0.0f;
     for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n]
-            / sqrtf(energy_a[n] * energy_b[n] + epsilon);
+        similarity_measure += dot_prod_a_b[n];
     }
     return similarity_measure;
 }

I only tested with the sample provided on the issue, but will try other things soon.

I'm sure these aren't mathematically perfect, but the point is that I think a simpler fix can be found in this part of the similarity measure.

Update: I've done some testing with these two changes and the immediate impression was not so good. There's a much more (audible) attack during dialog, which sounds pretty bad. Maybe someone else can find a way to change this formula that works better?

I can confirm though that the changes from the PR do sound pretty good in the media I've tried so far, including the problematic sample. The slight performance decrease is unfortunate though.

na-na-hi · 2024-03-21T23:54:53Z

Does it not have that problem?

I also suspected that the chromium implementation has the same problem but they decided that use resampling for small speed changes is a reasonable tradeoff, and the artifact at 1.1x speed is acceptable. This turns out to be the case.

But the PR description states "The correlation change made a big difference.", and I've now tried that PR with only the window function changes and it sounds much worse then with the correlation change.

Maybe I misunderstood but my point still stands that after this and that PR, the two algorithms appear to have no meaningful difference.

So far I haven't noticed any problems, but if you can find any examples where that's a problem, then those will be good test cases for future changes.

My suspicion is correct: this PR does degrade transient detection. Proof of the following stereo sample played at 0.5x speed:

Master:

master.webm

This PR:

PR.webm

Original sample:

test.webm

The degraded transient matching performance for this PR is very noticeable for the square wave blips at the start.

christoph-heinrich · 2024-03-22T12:50:10Z

My suspicion is correct: this PR does degrade transient detection. Proof of the following stereo sample played at 0.5x speed:

That file is great for testing transients.
Everything I try that improves how it sounds in general ends up degrading transients 😐

Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's the target block energy is factored into the usual similarity measure. To prevent bias towards louder blocks, the result is divided by the total energy across all channels. This should have very little effect on very correlated channels (such as most stereo media), as the division by total energy reverses the effect of the channel-wise factorization if all channels have similar energy. See-Also: mpv-player#8705 See-Also: mpv-player#13737

Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's target block energy is factored into the usual similarity measure. This should have very little effect on very correlated channels (such as most stereo media), as the factors are very similar for all channels. See-Also: mpv-player#8705 See-Also: mpv-player#13737

Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's target block energy is factored into the usual similarity measure. This should have little effect on very correlated channels (such as most stereo media), where the factors are very similar for all channels. See-Also: mpv-player#8705 See-Also: mpv-player#13737

christoph-heinrich · 2024-03-26T18:58:05Z

Closing in favor of #13748

Playback with many audio channels could be distorted when using scaletempo2. This was most noticeable when there were a lot of quiet channels and few louder channels. Fix this by increasing the weight of louder channels in relation to quieter channels. Each channel's target block energy is factored into the usual similarity measure. This should have little effect on very correlated channels (such as most stereo media), where the factors are very similar for all channels. See-Also: #8705 See-Also: #13737

christoph-heinrich force-pushed the scaletempo2_remove_energy branch from d33a571 to 304087a Compare March 20, 2024 23:38

christoph-heinrich changed the title ~~scaletempo2: remove signal energy~~ af_scaletempo2: remove signal energy Mar 20, 2024

christoph-heinrich marked this pull request as draft March 21, 2024 02:26

christoph-heinrich force-pushed the scaletempo2_remove_energy branch from 304087a to 672d058 Compare March 21, 2024 05:12

christoph-heinrich changed the title ~~af_scaletempo2: remove signal energy~~ af_scaletempo2: replace similarity formula with taxicab distance Mar 21, 2024

christoph-heinrich marked this pull request as ready for review March 21, 2024 05:14

christoph-heinrich changed the title ~~af_scaletempo2: replace similarity formula with taxicab distance~~ af_scaletempo2: improve signal similarity metric Mar 21, 2024

christoph-heinrich force-pushed the scaletempo2_remove_energy branch from 672d058 to f7c69d0 Compare March 21, 2024 05:18

kasper93 reviewed Mar 21, 2024

View reviewed changes

audio/filter/af_scaletempo2_internals.c Show resolved Hide resolved

christoph-heinrich force-pushed the scaletempo2_remove_energy branch from f7c69d0 to ff9c54f Compare March 21, 2024 06:01

christoph-heinrich force-pushed the scaletempo2_remove_energy branch from ff9c54f to 69b2759 Compare March 21, 2024 13:34

ferreum mentioned this pull request Mar 22, 2024

af_scaletempo2: prioritize louder channels for similarity measure #13748

Merged

christoph-heinrich closed this Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

af_scaletempo2: improve signal similarity metric #13737

af_scaletempo2: improve signal similarity metric #13737

christoph-heinrich commented Mar 20, 2024 •

edited

Loading

github-actions bot commented Mar 20, 2024 •

edited

Loading

na-na-hi commented Mar 21, 2024 •

edited

Loading

richardpl commented Mar 21, 2024

christoph-heinrich commented Mar 21, 2024

na-na-hi commented Mar 21, 2024 via email

christoph-heinrich commented Mar 21, 2024

ferreum commented Mar 21, 2024 •

edited

Loading

na-na-hi commented Mar 21, 2024 •

edited

Loading

christoph-heinrich commented Mar 22, 2024

christoph-heinrich commented Mar 26, 2024

af_scaletempo2: improve signal similarity metric #13737

af_scaletempo2: improve signal similarity metric #13737

Conversation

christoph-heinrich commented Mar 20, 2024 • edited Loading

github-actions bot commented Mar 20, 2024 • edited Loading

na-na-hi commented Mar 21, 2024 • edited Loading

richardpl commented Mar 21, 2024

christoph-heinrich commented Mar 21, 2024

na-na-hi commented Mar 21, 2024 via email

christoph-heinrich commented Mar 21, 2024

ferreum commented Mar 21, 2024 • edited Loading

na-na-hi commented Mar 21, 2024 • edited Loading

christoph-heinrich commented Mar 22, 2024

christoph-heinrich commented Mar 26, 2024

christoph-heinrich commented Mar 20, 2024 •

edited

Loading

github-actions bot commented Mar 20, 2024 •

edited

Loading

na-na-hi commented Mar 21, 2024 •

edited

Loading

ferreum commented Mar 21, 2024 •

edited

Loading

na-na-hi commented Mar 21, 2024 •

edited

Loading