Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few not matched social handles #525

Closed
metalwarrior665 opened this issue Dec 4, 2019 · 5 comments · Fixed by #2758
Closed

Few not matched social handles #525

metalwarrior665 opened this issue Dec 4, 2019 · 5 comments · Fixed by #2758
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@metalwarrior665
Copy link
Member

metalwarrior665 commented Dec 4, 2019

https://www.linkedin.com/company/delegatus
https://www.facebook.com/pages/category/Lawyer---Law-Firm/Delegatus-services-juridiques-inc-131011223614905/
https://www.facebook.com/pages/KinEssor-Groupe-Conseil/208264345877578

Reproduce:

(new RegExp('(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:(?:[a-z]+\\.)?linkedin\\.com\\/in\\/)([a-z0-9\\-_%]{2,60})(?![a-z0-9\\-_%])(?:/)?')).test('https://www.linkedin.com/company/delegatus');


const FACEBOOK_RESERVED_PATHS = 'rsrc\\.php|apps|groups|events|l\\.php|friends|images|photo.php|chat|ajax|dyi|common|policies|login|recover|reg|help|security|messages|marketplace|pages|live|bookmarks|games|fundraisers|saved|gaming|salesgroups|jobs|people|ads|ad_campaign|weather|offers|recommendations|crisisresponse|onthisday|developers|settings|connect|business|plugins|intern|sharer';

const FACEBOOK_REGEX_STRING = `(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:www.)?(?:facebook.com|fb.com)\\/(?!(?:${FACEBOOK_RESERVED_PATHS})(?:[\\'\\"\\?\\.\\/]|$))(profile\\.php\\?id\\=[0-9]{3,20}|(?!profile\\.php)[a-z0-9\\.]{5,51})(?![a-z0-9\\.])(?:/)?`;

(new RegExp(FACEBOOK_REGEX_STRING )).test('https://www.facebook.com/pages/category/Lawyer---Law-Firm/Delegatus-services-juridiques-inc-131011223614905/');

@metalwarrior665 metalwarrior665 added bug Something isn't working. feature Issues that represent new features or improvements to existing features. and removed bug Something isn't working. labels Dec 4, 2019
@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 18, 2023
@oklinov
Copy link

oklinov commented Sep 11, 2024

Another one not matched: https://www.facebook.com/pages/U-Trumpety/394020690779011

'https://www.facebook.com/pages/U-Trumpety/394020690779011'.match(FACEBOOK_REGEX_STRING)

@lhotanok
Copy link
Contributor

New format of the Twitter / X links is also not supported:
https://x.com/apify

Not sure if we should:

  • expand the existing Twitter regex
  • create a new X regex + new socials field (x or xs?)
    • deprecate the old Twitter links and remove twitters field
    • keep the old Twitter links and store them separately in twitters field

We should investigate if they changed the URL patterns completely or just renamed the domain from twitter to x and left the rest as it was

@lhotanok
Copy link
Contributor

lhotanok commented Nov 28, 2024

And links of YouTube channels with @ prefix are not matched as well, e.g.:

@metalwarrior665
Copy link
Member Author

For now I would add X links under Twitter regex. I think most people still refer to it as Twitter, it will take years before the rebrand will fully happen. This is general enough that it should be done in Crawlee. Could you make an issue/PR there pls? @lhotanok

@lhotanok
Copy link
Contributor

I'll make a PR for this👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants