Mega-groups #306

whacklezz · 2022-03-04T14:58:04Z

Background
I have found that running the scan repeatedly with decreasing % similarity gives me good results, by gradually chipping away very good matches and flagging false positives. When coming down into the Mid/High 80s % similarity, I can still snag 30-40 groups of true duplicates with only 2 members in each group for the most part. However, there is always one "mega-group" with tens of thousands of members. This is due to how groups are merged if a sample is similar to any one member of the existing group, so it has a tendency of growing exponentially. It's for all intents and purposes impossible to visually compare these n^2, so I always just end up minimizing the group and waiting until it has processed all the thumbnails (which takes a good while eventually due to the high number).

Describe the solution you'd like
We need a way to break up these colossal groups. Surely many of the members of the group are only similar to one or a few others. I have a feeling that the union of all these sets is barely overlapping, but it ends up daisy-chaining everything into an all-consuming black hole. I guess it could also be one or a few "super-matchers" which somehow ends up reporting a high similarity to an overwhelming lot of others?

It could actually be interesting to see all the elements on a weighted graph. There probably exists theory to break up, cluster, or untangle such sets.

Maybe it could be possible to get the listed % similarity to change based on the element selected, and let us sort the list? Then at least we could expect any true matches to be close to the top.

Or run a separate scan on the group elements where they are broken up into distinct groups instead of merged into one. I guess both (x, y) and (y, x) would appear as groups though. It would of course mean potentially tens of thousands of groups instead of one group of tens of thousands.

Maybe we just need further options for filtering similarity to reduce the group sizes, like e.g. not allowing files with too different aspect ratios (landscape/portrait) to be considered similar.

Idk, I'm just spitballing here, but I think it's an issue that eventually needs addressing. I feel like there are a lot of dupes hiding somewhere inside that monster set :p

The text was updated successfully, but these errors were encountered:

floydcg · 2022-03-09T13:43:59Z

I've noticed the same thing, and actually if you take the time to look through them, there are often matches. Sometimes several of them. Perhaps also some flag or something maybe that if the group is larger than (x) # of duplicates, Ignore that dupe group?

0x90d · 2022-03-12T17:23:25Z

maybe that if the group is larger than (x) # of duplicates, Ignore that dupe group?

VDF doesn't know if duplicates are false positives or positive positives. There could be a group larger than X which only contains real duplicates. One way to reduce the amount of false positives is to increase the thumbnail size. Also this suggestion #300 could help to reduce the amount of false positives even further.

0x90d · 2022-03-30T10:30:43Z

#300 has been added

jeffward01 · 2022-10-27T18:27:34Z

I have this issue also, such as there will be a mega group with 3,000 matches. Within that group there are many sets of duplicates, but also many sets of unique files.

0x90d added the discussion label Mar 12, 2022

0x90d closed this as completed Mar 30, 2022

whacklezz mentioned this issue Jul 14, 2023

New feature: Blacklisting the grouping of confirmed non-matching file pairs #438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mega-groups #306

Mega-groups #306

whacklezz commented Mar 4, 2022

floydcg commented Mar 9, 2022

0x90d commented Mar 12, 2022

0x90d commented Mar 30, 2022

jeffward01 commented Oct 27, 2022 •

edited

Loading

Mega-groups #306

Mega-groups #306

Comments

whacklezz commented Mar 4, 2022

floydcg commented Mar 9, 2022

0x90d commented Mar 12, 2022

0x90d commented Mar 30, 2022

jeffward01 commented Oct 27, 2022 • edited Loading

jeffward01 commented Oct 27, 2022 •

edited

Loading