Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega-groups #306

Closed
whacklezz opened this issue Mar 4, 2022 · 4 comments
Closed

Mega-groups #306

whacklezz opened this issue Mar 4, 2022 · 4 comments

Comments

@whacklezz
Copy link

Background
I have found that running the scan repeatedly with decreasing % similarity gives me good results, by gradually chipping away very good matches and flagging false positives. When coming down into the Mid/High 80s % similarity, I can still snag 30-40 groups of true duplicates with only 2 members in each group for the most part. However, there is always one "mega-group" with tens of thousands of members. This is due to how groups are merged if a sample is similar to any one member of the existing group, so it has a tendency of growing exponentially. It's for all intents and purposes impossible to visually compare these n^2, so I always just end up minimizing the group and waiting until it has processed all the thumbnails (which takes a good while eventually due to the high number).

Describe the solution you'd like
We need a way to break up these colossal groups. Surely many of the members of the group are only similar to one or a few others. I have a feeling that the union of all these sets is barely overlapping, but it ends up daisy-chaining everything into an all-consuming black hole. I guess it could also be one or a few "super-matchers" which somehow ends up reporting a high similarity to an overwhelming lot of others?

It could actually be interesting to see all the elements on a weighted graph. There probably exists theory to break up, cluster, or untangle such sets.

Maybe it could be possible to get the listed % similarity to change based on the element selected, and let us sort the list? Then at least we could expect any true matches to be close to the top.

Or run a separate scan on the group elements where they are broken up into distinct groups instead of merged into one. I guess both (x, y) and (y, x) would appear as groups though. It would of course mean potentially tens of thousands of groups instead of one group of tens of thousands.

Maybe we just need further options for filtering similarity to reduce the group sizes, like e.g. not allowing files with too different aspect ratios (landscape/portrait) to be considered similar.

Idk, I'm just spitballing here, but I think it's an issue that eventually needs addressing. I feel like there are a lot of dupes hiding somewhere inside that monster set :p

@floydcg
Copy link

floydcg commented Mar 9, 2022

I've noticed the same thing, and actually if you take the time to look through them, there are often matches. Sometimes several of them. Perhaps also some flag or something maybe that if the group is larger than (x) # of duplicates, Ignore that dupe group?

@0x90d
Copy link
Owner

0x90d commented Mar 12, 2022

maybe that if the group is larger than (x) # of duplicates, Ignore that dupe group?

VDF doesn't know if duplicates are false positives or positive positives. There could be a group larger than X which only contains real duplicates. One way to reduce the amount of false positives is to increase the thumbnail size. Also this suggestion #300 could help to reduce the amount of false positives even further.

@0x90d
Copy link
Owner

0x90d commented Mar 30, 2022

#300 has been added

@0x90d 0x90d closed this as completed Mar 30, 2022
@jeffward01
Copy link

jeffward01 commented Oct 27, 2022

I have this issue also, such as there will be a mega group with 3,000 matches. Within that group there are many sets of duplicates, but also many sets of unique files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants