Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

af_scaletempo2: prioritize louder channels for similarity measure #13748

Merged

Conversation

ferreum
Copy link
Contributor

@ferreum ferreum commented Mar 22, 2024

This is an attempt to fix #8705 and is the result to my ideas commented on #13737.

The idea here is to increase the weight of louder channels. Some details are in the commit message. Should have very little effect on stereo audio and none on mono.

I tested it with the sample from the issue, with the test file in the discussion, and in random other videos I've watched and my impression is pretty good so far.

Copy link

github-actions bot commented Mar 22, 2024

Download the artifacts for this pull request:

Windows
macOS

Copy link
Contributor

@christoph-heinrich christoph-heinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a quick test and it sounds good so far 👍

I'll be using it in my mpv build.

@@ -99,11 +99,13 @@ static float multi_channel_similarity_measure(
{
const float epsilon = 1e-12f;
float similarity_measure = 0.0f;
float total_energy = epsilon;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing that shouldn't make any difference to the result.

The target block doesn't change with the offset, therefore this sum will always be the same for each iteration of the optimal index search, so all calculated similarities are changed by the same factor, thus not changing the location of the maximum.

Copy link
Contributor Author

@ferreum ferreum Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's true. I was thinking of the candidate blocks here. On the other hand, this relies on the fact that target block is always passed first, and this method currently doesn't say which one the target block is (though I did phrase my commit message that way). I'll remove it.

Copy link
Contributor Author

@ferreum ferreum Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about changing the parameters to make the intention apparent? I'm thinking of this

 static float multi_channel_similarity_measure(
-    const float* dot_prod_a_b,
-    const float* energy_a, const float* energy_b,
+    const float* dot_prod,
+    const float* energy_target, const float* energy_candidate,
     int channels)
 {
     const float epsilon = 1e-12f;
     float similarity_measure = 0.0f;
     for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n] * energy_a[n]
-            / sqrtf(energy_a[n] * energy_b[n] + epsilon);
+        similarity_measure += dot_prod[n] * energy_target[n]
+            / sqrtf(energy_target[n] * energy_candidate[n] + epsilon);
     }
     return similarity_measure;
 }

Copy link
Contributor

@christoph-heinrich christoph-heinrich Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the target and candidate aren't treated the same anymore in that function, it does make sense to give them more descriptive names. I'd be fine with those.

BTW if you start the block with ```diff it gets some colors.

@ferreum ferreum force-pushed the scaletempo2-similarity-measure branch from db72edb to 06c61d7 Compare March 22, 2024 16:07
Playback with many audio channels could be distorted when using
scaletempo2. This was most noticeable when there were a lot of quiet
channels and few louder channels.

Fix this by increasing the weight of louder channels in relation to
quieter channels. Each channel's target block energy is factored into
the usual similarity measure.

This should have little effect on very correlated channels (such as most
stereo media), where the factors are very similar for all channels.

See-Also: mpv-player#8705
See-Also: mpv-player#13737
@ferreum ferreum force-pushed the scaletempo2-similarity-measure branch from 06c61d7 to b4b1172 Compare March 22, 2024 16:23
@na-na-hi
Copy link
Contributor

I haven't found obvious regressions with a few samples so far, and the principle seems sound.

But this PR still doesn't eliminate some problems which scaletempo doesn't have. Here is a sample I have played at 1.1x speed, where scaletempo2 has some kind of pitch shifting artifact which scaletempo doesn't have:

This PR: (the same artifact also happens with master and #13737, and I think they sound worse than this PR)

PR13748.1.1.webm

scaletempo:

scaletempo.1.1.webm

Original sample:

sample.zip

Anyway, this sounds better than the status quo for a few samples I tested, so if more users can test this and also find a net improvement I think this can go through, even though this doesn't completely solve the problem.

Copy link
Contributor

@christoph-heinrich christoph-heinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why dot_prod(target, candidate) * sqrt(energy_target) / sqrt(energy_candidate) works, but it does and that's all that really matters.

@christoph-heinrich
Copy link
Contributor

But this PR still doesn't eliminate some problems which scaletempo doesn't have. Here is a sample I have played at 1.1x speed, where scaletempo2 has some kind of pitch shifting artifact which scaletempo doesn't have:

scaletempo uses very different default parameters to scaletempo2. If you use af=scaletempo2=search-interval=14:window-size=60 (which is kind of what scaletempo uses, except scaletempo2 has no way of replicating overlap=0.2) then there is no noticeable pitch shift anymore.

Whichever parameters you choose, it will always be a compromise in some regard.
Maybe dynamic parameters could work well for "all" situations, where search-interval and window-size change automatically based on playback speed.

@na-na-hi
Copy link
Contributor

I don't understand why dot_prod(target, candidate) * sqrt(energy_target) / sqrt(energy_candidate) works, but it does and that's all that really matters.

My understanding is that the original similarity measure being ununited means that the input signals are essentially normalized, so quiet channels get essentially boosted even though they shouldn't.

Whichever parameters you choose, it will always be a compromise in some regard.
Maybe dynamic parameters could work well for "all" situations, where search-interval and window-size change automatically based on playback speed.

True. The parameters can probably also be content-dependent too - lots of commercial DAW plugins use similar approaches to auto adapt parameters based on certain signal features.

@christoph-heinrich
Copy link
Contributor

My understanding is that the original similarity measure being ununited means that the input signals are essentially normalized, so quiet channels get essentially boosted even though they shouldn't.

I should have been more specific. I do understand why this change in isolation is an improvement over the status quo, but I don't understand why multiplying the dot product with sqrt(energy_target) / sqrt(energy_candidate) makes it better.
How come that if the target is louder then the candidate the similarity gets raised over when they are the same loudness,
and then if the candidate is louder then the target it gets decreased.
How does that make the result better? Evidently it does, but it makes no sense to me.

@na-na-hi
Copy link
Contributor

How come that if the target is louder then the candidate the similarity gets raised over when they are the same loudness,
and then if the candidate is louder then the target it gets decreased.

Note that the candidate's signal level is also represented in the dot_prod term, so raising it shouldn't change the similarity figure by much because that's still normalized. The similarity can still be decreased if the candidate is significantly louder than the target, but this is probably desired to reduce possible mismatches of transients of different strength.

For the former point though, I suspect that the sqrt(energy_target) term can be dropped and it would be still better than the dot product alone. (untested intuition, probably wrong)

Copy link
Member

@Dudemanguy Dudemanguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use this extensively, but if everyone agrees this has better results, I don't see why not.

@Dudemanguy Dudemanguy merged commit 096d35d into mpv-player:master Apr 12, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scaletempo2, the new default for adjusting audio playback speed, sounds noticeably worse in some situations
4 participants