-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spam tracker 3000 #4
Comments
This will also be used to calculate the efficacy of spam fighting tools, such as this proposed tool: |
Quick and dirty way to check how many spam users are generated, it is safe to assume that PL does not legitimately generate over a thousand legit users a month.
|
Surprisingly the spam users appear to connect more than once. Less than 22% of users logged in only 1 time (or less), whereas I suspect spam users represent 80% or more of the legit users. 70% of users logged in only once or twice.
|
Given my assumption that real users are now an outlier, an arithmetic average should tell us something about spammer behavior. The average PublicLab.org user logs in 2.3 times.
|
Some spam usernames must get recycled or something. The list of people who have connected more than 50 times is mostly spam users! Out of curiosity, I checked to see how many are not yet banned. Only 1 was banned.
|
Fun way to grab a bunch of spam-like usernames. Usually the format is {noun}{noun}{number} or {noun}{number}{noun}. Conveniently, few if any legit users have numbers in their usernames. I dropped some banhammer on 50+ users (after reading through for any legit ones) using this WHERE clause:
Maybe I banned some valid users? They must have at least one non-numeric character, then precisely one or two numerals, and then maybe some more non-numeric characters afterwards. Unsure why these usernames are so formulaic, but it helps. |
pushed spam prevention to master. did not yet deploy. current spam stats:
|
spam prevention is deployed. halfway through the month, we have about 25,000 new users already. Given the usual ~50,000 per month, if this anti-spam measure isn't effective, we'll see 50,000 at the end of the month as per usual. If this measure is effective, I want to see at least an order magnitude drop in new users. Instead of 50,000, I want to see 5,000. Since this is a half month, I want no more than 2,500 new users for the rest of the month (added to the 25,000 already created). So if its effective, we'll have less than 28,000 new users in December. A little higher is worth keeping for another full month, a lot higher means it doesn't really help and should be removed. |
Figured out how to check new users in the old users table:
As long as the users count is the same as the rusers count for any given month, the old site is not creating users. There is a little bit of strangeness though: Starting in about August it looks like the numbers track each other pretty closely, but the users table is always shy of the rusers table (looks like its off by no more than 20 users). This is odd because new users are created in both rusers and users. The users table shouldn't be falling behind. Oh well, different problem and not worth investigating for now. |
Current status after adding some spam intervention:
Halfway through December we had 25564, and by the end we had only 25801 (237 users added). January is about done, and we've had 365 users. Soooooo that's WAY less than 50,000! Two orders of magnitude! We've had two or three emails asking about the signup page, as a result Jeff and I have enhanced the anti-spam bits to be a little easier to understand. |
Yay!!!!!!!!! |
Into the wee early days of March, we can see Jan and Feb are fairly consistent at ~500 new users per month.
|
web@ has been receiving emails from various companies complaining of being link poisoned and having their SEO damaged by our site. The nice thing is that each company which emails us provides a very solid lead for finding and removing spam users. Today an email came in identifying 10 user profiles with spam links. Using the search term through profiles, I found 18 users which linked that that URL or a similar Facebook URL. All 18 were obviously spam users. I deleted the profiles, then I deleted the users in both user tables. I need to find a way to properly delete users so that all tables created for a user are removed. Usually this is taken care of in relational databases with foreign keys and cascades, but we aren't so lucky with either the Drupal drippings or the current Rails code... or maybe it is the MySQL tables themselves ignoring the foreign keys. Regardless, I have no doubt there are a number of tables which need to be scrubbed of content from users which no longer exist. |
Let me clarify: properly batch delete users. I'm not about to go through 18 users, profile by profile, to delete them. |
Hi bryan, rails (active record) keeps track of dependent=>destroy
Then
All dependent records will also be destroyed. |
Thanks for the example code, Jeff. I'll have to figure out how to run a |
you should be able to run the shell command "rails console" as any user, |
Haven't run this in awhile. Looks like we're fluctuating around 1000 new users a month on average now.
|
Something changes in Feb/March that moved us from ~500 new users per month to ~1000 new users per month. Possibly another bump July to August that moved us from ~1000 new users per month to ~1500 new users per month, but that trend isn't as obvious yet. |
I realized I have not been keeping track of my spam user net outside of the database. Here's the current view to detect unbanned spammers based on key URLs found in their profile:
When the table is non-empty, I will sweep up the mess with this command:
(nested subqueries required because the view reads the table being updated) Generally there are no additions to the spammer list unless the spammer list has new URLs added to it. |
I estimate we gained about 225,000 spam user accounts total over the 5 month botnet attack. We have about 250,000 migrated accounts, which would further estimate about 25,000 actual users. Of the migrated accounts, about 225,000 are not banned (compared to an estimated 25,000 legit users). Lots of work to be done still!
|
Bryan this is an incredible piece of long term work that is useful for |
Thanks @ebarry . It's certainly not a usual "fix it and close it" ticket, more like a weather forecast. That's why I named it like a weather forecast :) |
Given an estimated 7% chance of grabbing a legit user from the pool of unbanned, migrated users with a uniform draw, it should be fun to pull users at random. This cannot be done in Rails efficiently because the User and DrupalUser are not properly associated to check for both unbanned status and migrated user with a single join. I tried. How to grab a random migrated, unbanned user profile link from the database:
Odds are high that the profile returned will be spam! |
Polling random users does not disappoint. 15 out of 15 were spam users. I updated the spam finder view with more key words and banned somewheres around 10,000 users. Statistics in previous comments have been updated. |
The spammers view is built from domains and keywords which are almost certainly coming from spammers which end up in their profiles. It might be worth looking into external lists of blacklisted spam domains/keywords. There were intended for email, but same principle applies. |
The spammers view is taking a very long time to run. Text searching is fairly slow, and with so many qualifiers to search against, it runs even more slowly. It might be a good idea to start breaking the query apart using something like a priority queue:
At that point, storing it in the database as a view would be onerous due to updating which domain goes where. Instead, it might be better to run them as queries by external software and automate the process to some comfortable degree. |
SQL query to grab when banned user accounts were most recently updated (by access, login, or creation timestamp), and then count the total such users for each year. Triple nested query (two subqueries) ... not sure how to make that nicer though.
Presently that looks like this:
Assuming 2014 is over with (true in a few weeks), then 41746 banned users have not access their accounts in any way in at least one year. That is the summation of 2011, 2012, and 2013 in the table above. That is out of a total of 41,889 banned users. That is out of an estimated 225,000 total spam users. I think these user accounts are good candidates to be removed from the database. An additional test can be added to see if any valid research notes have been posted, and skip those users. It looks like Current stats for blocked and unblocked notes:
Assuming status means the same thing in node as it does user, then about 1/3 of the notes have been blocked (33.56%) |
Added filters to advanced search page and installed solr inside the project.
@yukiisbored What's the status on this? |
@ryzokuken ah, I think I added that by mistake. |
We can ban known spammer/scammer/proxy IPs, I think there's an REST API for those. |
@jywarren Are we planning on working on this? |
I think that with the new OAuth work https://github.com/publiclab/plots2/milestone/18 , there are lots of new spam accounts being made, but none of them are successfully getting spam posts through... |
wow blast from the past!!!! Hi everyone!
…--
+1 336-269-1539 / @lizbarry <http://twitter.com/lizbarry> / lizbarry.net
On Tue, Jul 24, 2018 at 4:10 PM, Jeffrey Warren ***@***.***> wrote:
I think that with the new OAuth work https://github.com/publiclab/
plots2/milestone/18 , there are lots of new spam accounts being made, but
none of them are successfully getting spam posts through...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAJ2n54qZ9-DBpsYCop2Y9e84UmnXYKsks5uJ39AgaJpZM4BI_HU>
.
|
Hi everyone. I had not seen any of the questions or updates to this ticket until recently. My email seems to be 50/50 spam or archived folder. I see a bunch of questions from over a year ago. For those: The original intent of this ticket was to have shared notes on the process and queries used to find and eliminate spam user accounts. There was never an end goal for this ticket. I think it should be fine to close this out. The process and queries will still be available in the ticket after it is closed. |
I will never forget Issue #4
We had a good run over these years <3
Spam Tracker 3000 will live on in our collective meme-ories, and hopefully
meme-orialized on publiclab.org/spam
…--
+1 336-269-1539 / @lizbarry <http://twitter.com/lizbarry> / lizbarry.net
On Fri, Aug 3, 2018 at 4:01 PM, Jeffrey Warren ***@***.***> wrote:
Hi, @btbonval <https://github.com/btbonval> -- good to hear from you!
Miss you!
Thanks, and we have a good overall spam issue now at #974
<#974> for what it's worth. I
guess I'll close this since there's not a specific to-do or end goal, and
we can trace back to it from #974
<#974>. Cool!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAJ2n3ryR0bfEXzo7Tj6G_vJNYwCd99jks5uNKwlgaJpZM4BI_HU>
.
|
In order to determine how much spam is being received at any time, and how much is being handled by the community (admin/moderators), we should have some way to track historical spam even after it is deleted.
This might be as simple as something like a table that has a date that a node id and revision id were deleted, and possibly cache some information such as when the spam was created, who created it, and the last IP address of the creator. Usually this would be done relationally, but since the user, node, and revision might be deleted, this table should store copies of information and not pointers.
The text was updated successfully, but these errors were encountered: