-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowed characters in preferredUsername #9
Comments
Per RFC 7565 linked above (and its dependency RFC 3986): acctURI = "acct" ":" userpart "@" host
userpart = unreserved / sub-delims
0*( unreserved / pct-encoded / sub-delims ) The username starts with at least 1 character from For reference:
HOWEVER: Mastodon further constrains usernames with its Looking at USERNAME_RE: USERNAME_RE = /[a-z0-9_]+([a-z0-9_.-]+[a-z0-9_]+)?/i Here we see that Mastodon only allows alphanumeric and underscores. It further allows dots and dashes in the middle of the username, but not at the beginning or end. In ABNF, this would be like: username = word
*( rest )
word = ALPHA / DIGIT / "_"
rest = *( extended )
word
extended = start / "." / "-" The username is matched case-insensitively, and case-insensitivity is further enforced by the UniqueUsernameValidator: normalized_username = account.username.downcase
normalized_domain = account.domain&.downcase
scope = Account.where(
Account.arel_table[:username].lower.eq normalized_username
).where(
Account.arel_table[:domain].lower.eq normalized_domain
) Or, in other words, Mastodon will check that the downcased username and domain do not already exist in the local database. |
Misskey enforces a 1-20 There is also a limit of 128 characters on usernames, and 128 characters on host: @Column('varchar', {
length: 128,
comment: 'The username of the User.',
})
public username: string;
@Index()
@Column('varchar', {
length: 128, nullable: true,
comment: 'The host of the User. It will be null if the origin of the user is local.',
})
public host: string | null; When searching for usernames, Misskey enforces the same downcasing behavior as Mastodon: |
a 20 character limit seems excessively small.
…On Wed, Feb 7, 2024 at 1:12 AM a ***@***.***> wrote:
Misskey enforces a 1-20 \w (word characters, equivalent to ALPHA / DIGIT
/ "_") limit on local usernames:
https://github.com/misskey-dev/misskey/blob/c81b61eb2e4e3a16f0039ebfa9c174ee216ccafb/packages/backend/src/models/User.ts#L288
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABZCV3S5DOELDFAKHJP6YTYSMLLLAVCNFSM6AAAAABC22JUSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGM2TOMZVGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
it's only 20 for local usernames. although i think if i'm reading this correctly then the database limits in misskey's schema are:
|
Appears Pleroma does this for remote users: https://git.pleroma.social/pleroma/pleroma/-/blob/develop/lib/pleroma/user.ex#L519 |> validate_format(:nickname, @email_regex) The regexes used are like so: @email_regex ~r/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
@strict_local_nickname_regex ~r/^[a-zA-Z\d]+$/
@extended_local_nickname_regex ~r/^[a-zA-Z\d_-]+$/ So it appears to be more permissive than Mastodon -- alphanumeric plus more symbols (not just dots dashes and underscores). For local users, it appears you can have dashes and underscores at any point in the username, including at the beginning or the end. As a consequence: it appears a username of |
Not entirely related, but RFC 2821 (for SMTP) describes email addresses as having a max 64 characters for the local-part and 255 characters for the domain. |
Pixelfed allows alphanumeric, dots, dashes, and underscores, but will skip creating or updating a profile for a username containing any other character: https://github.com/pixelfed/pixelfed/blob/dev/app/Util/ActivityPub/Helpers.php#L771-L776 // skip invalid usernames
if(!ctype_alnum($res['preferredUsername'])) {
$tmpUsername = str_replace(['_', '.', '-'], '', $res['preferredUsername']);
if(!ctype_alnum($tmpUsername)) {
return;
}
} |
Kitsune allows characters beyond those set by the RFC (although hidden behind a feature flag exactly because of that reason). We technically allow all unicode characters and unicode digits (because who wouldn't want their username to be all written in Hangul?). But again, just behind a feature flag. By default it only allows ASCII letters, numbers, underscores, dashes, and dots. So very much RFC compliant. That makes it a little complicated. |
Because WebFinger is mentioned above: I would not assume that the |
I think in some scripts it might seem unreasonably long! |
I'd like to start boiling this down to recommendations for publishers and consumers to maximize interoperability. For publishers:
For consumers:
What do we think? |
I don't see why usernames can't be internationalised. I have a demo user Further thoughts and comments at https://shkspr.mobi/blog/2024/02/internationalise-the-fediverse/ Similarly, lots of people have really long names. I appreciate that services like email typically limit to a set number of characters - so I agree with @trwnh that it might be better to follow existing standards rather than arbitrarily set other limits. |
It should be noted that if we are going to support internationalized usernames, the definition of And if we adopt the byte-length of percent-encoded UTF-8 string as the definition of |
If any limit would be applied, we should, in my opinion, use extended grapheme clusters since those are the closest us humans would interpret as "characters". Unicode codepoints (i.e. scalars) are a little less accurate but more accurate than bytes, and bytes are just completely unsuitable to find the length of a string if you consider anything besides ASCII. Edit: But this also needs some more careful implementation around the counting of characters, such as either a secondary byte limit or codepoint limit on the implementer side. While this in itself doesn't seem too bad, it's still significant when comparing it to counting codepoint, which takes 45µs on the same payload. Especially considering the ~600k codepoints only took up ~550KB in space and body limits are much larger than that (usually 4x larger), that can technically evolve into a DoS vector if not handled carefully enough. |
The whole point of webfinger usernames is that people can use them in plain text environments to mention other users, and figuring out "which set of Unicode inputs can people easily type using their keyboard and IME" is an almost impossible to solve problem. And yes, although you're very glib about the risk in your blog post, there are very serious security concerns implied by nonprintables, uncanonicalized input, and confusables. Even URLs, which have an i18n standard, unlike webfinger, very rarely show the Unicode form to users. This is a hard-fought lesson from the browser security space—we shouldn't be so eager to throw away the same lesson here. All accounts have a |
As I wrote in w3c/activitypub#395 (comment), Weibo is an example of such a platform. (It depends on the definition of Edit: Weibo's 用户修改昵称规则 (User Nickname Modification Rule) states the following:
English translation:
(I don't want to experiment with a real account as they, er, reviews profile changes for some reason, and they allow only few number of username changes in a certain period.) |
Thanks. I wasn't sure if a weibo Frankly, I don't think that new systems should use webfinger at all, and I don't really relish the prospect of extending such a system to permit non-ascii identifiers before having more robust non-uniqueness guarantees in place. |
Whatever fits into an SQL varchar column of length 128? At least, that's what Misskey seems to do, and I'm not sure if any other platforms have implicit limits on length like this. I assume that means unicode codepoints since most databases are encoded as UTF-8? (Note that I am not actually sure about this.) |
Terence's opinion in the blog post ( Also, the situation could be improved with the aid of a more clever autocompletion system, although that may mean a compromise for the design goal of being able to Latin alphabet usernames with diacritical marks can trivially be suggested from their counterparts without the diacritical marks (e.g. The autocompletion only works if the server knows the target user, but you can copy-and-paste the username at the first time and you are not likely to need to repeat it after that, which I think isn't bad UX. (Even with ASCII Latin alphabets, it's not quite a good idea to try to type the username all by yourself instead of copy-and-paste-ing, if the autocompletion doesn't work.) Admittedly, not all scripts can be trivially suggested from Latin alphabets. For example, many Japanese kanjis have many-to-many correspondence with their pronunciation and it's not as easy to suggest from Larin alphabets as Chinese hanzis. Converting the Mandarin Chinese username In systems that expect to process Japanese names, it's a common practice to ask users to input "phonetic" names along with their kanji names. Perhaps, ActivityPub actors can likewise have secondary usernames (that's used in discovery, but not for displaying purpose), using a mechanism like
IIUC, "homograph attacks" are only applicable if the target server of the attack accepts sign-ups by untrusted users. So I think it's fine to have an administrator configuration to opt into local non-ASCII usernames for single-user servers, and always accept remote non-ASCII usernames on the assumption that it's the remote servers' responsibility to restrict characters that may be harmful in their setup or otherwise moderate bad actors who pretend to be other users on the same server. And as for nonprintable characters, RFC 7565 (The ’acct’ URI Scheme) warns about/forbids them: https://datatracker.ietf.org/doc/html/rfc7565#section-5
https://datatracker.ietf.org/doc/html/rfc7565#section-6
Is that really the case? Web browsers' address bar (which I think is one of the most prominent places where end users see URLs) display the Unicode form of an IDN. Also, as a layperson, I feel the punycode/percent-encoded form to be more confusing and harder to distinguish from similarly named labels, but that's irrelevant to this discussion. |
Also, while I agree that the Latin script is the most widely accessible input method in a technical sense, I doubt if it's always a safe way of expressing one's name. For example, romanization systems of Chinese characters are subject to debate in Taiwan (although I'm not an expert of the issue): https://en.wikipedia.org/wiki/Chinese_language_romanization_in_Taiwan |
I have an instance of Mastodon where I test IDN domain names - see e.g. @north@ꩰ.com (where I have several patches applied to fix issues). I own quite a few single character IDNs, so I'm all too familiar with how browsers and other software render them. Mastodon, for example, won't convert https://ꩰ.com to a link; I have to instead use the canonical form xn--8r9a.com. (we'll see what GitHub does here... edit: good job GitHub!). Discord and Slack forcefully converts links to Punycode. Signal is quite happy with them. Browsers have policies (you can see Chrome's at https://chromium.googlesource.com/chromium/src/+/main/docs/idn.md) that are fairly similar to each other. When it's not deemed unsafe to do so, such as when using Unicode "confusables", browsers will generally render IDNs as the Unicode character. |
Yes. Browsers have a thousand lines of code dedicated to getting this
right, at least 4 different spec or pseudo-spec documents (UTS 46, UTS
39, rfc5892, and the google IDN documentation), an always-on reporting
service (Google Safe Browsing), global analytics (constantly updating
cached list of the global top 500 sites) and on-device analytics (Site
Engagement Service) to make IDN labels safe. I don't think the CG can in
good faith recommend people go down the route of providing IDN identifiers
unless they're willing to to implement checks at least as strict
For example, a popular attack on Mastodon a few years ago involved a
webfinger identifier that looked like ***@***.***(lots of empty /
space unicode characters).evil.server", making an account from evil.server
look like it was actually from mastodon.social (as long as the users didn't
notice the ellipses). This same kind of confusability would also be able to
be present in usernames if we relaxed this restriction.
…On Sun, Mar 3, 2024 at 12:15 PM Jason Parker ***@***.***> wrote:
I have an instance of Mastodon where I test IDN domain names - see e.g.
@north <https://github.com/north>@ꩰ.com <http://xn--8r9a.com> (where I
have several patches applied to fix issues). I own quite a few single
character IDNs, so I'm all too familiar with how browsers and other
software render them. Mastodon, for example, won't convert https://ꩰ.com
to a link; I have to instead use the canonical form xn--8r9a.com. (we'll
see what GitHub does here...). Discord and Slack forcefully converts links
to Punycode. Signal is quite happy with them.
Browsers have policies (you can see Chrome's at
https://chromium.googlesource.com/chromium/src/+/main/docs/idn.md) that
are fairly similar to each other. When it's not deemed unsafe to do so,
such as when using Unicode "confusables", browsers will generally render
IDNs as the Unicode character.
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABZCVYGUAGOHC64B44SZVDYWNLBZAVCNFSM6AAAAABC22JUSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGIZTIMJUHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
IDN has been a thing for about... a decade, and some non-Latin worlds have started using IDN for various purposes (this even includes traditional e-mail -- fortunately, IDN helped those countries to migrate from ugly DNS hacks installed on telco's servers to proper, standardized IDN) Not allowing IDN and rendering them as Safe IDN is hard (because human language is HARD), but I think it is worth it. |
Well, I feel like the IDN-specific discussion is getting off topic here. While IDNs have some similarities to internationalized As pointed out by perillamint, IDNs are already widespread and they have plenty of legitimate use cases, so just always displaying their gobbledegook forms would be troublesome. Admittedly, this argument doesn't stand for And as I wrote in #9 (comment), the scope of homograph attacks is different: while the IDN homograph attack allows a threat actor to impersonate someone else's (remote) domain, the attack against usernames is only applicable to local users, and the applicability depends on the server's setup. |
there is
in https://www.rfc-editor.org/rfc/rfc7565.html#section-6 as @tesaguri and @trwnh already mentioned. P.S.: It's undefined how that relates to the webfinger local part. |
I suppose that at least we all agree that the report shouldn't just advise every server to accept arbitrary Unocde usernames (including whitespaces, control characters, etc.) without any security considerations. But it's not obvious where the best compromise resides between the two extremes, i.e., accept everything vs. ASCII-only (though the latter is one of viable options). And I think the ambiguity is getting the discussion a little out of focus, as the security implications (among other concerns) would significantly differ depending on the specifics of the approach. So I'd like to show my personal vision of the requirements to have regarding this topic (others may have different visions, of course):
Also, a guidance like the following may be a nice addition:
… which is analogous to Web browsers' behavior when an IDN doesn't pass the display algorithm. Personally, I don't think we should have such a guidance because I believe the remote server should have the discretion/responsibility to determine what's appropriate for its users. For example, I've seen a person whose handle name is intentionally a mixture of Latin script and another kind of script, which isn't likely to pass an IDN display algorithm. If they launched a single-user server and set the handle name as the username of their account, there would be no problem as far as security is concerned. (Not that I personally admire it. It's a big headache for accessibility! But so is non-camelCased usernames, leet usernames, etc., and displaying percent-encoded forms wouldn't be a solution to accessibility concerns either.) Also, by making it the responsibility of producing servers of non-ASCII usernames, consuming implementations don't need to pay the extra cost to aggressively reject potentially malicious usernames if they don't choose to produce non-ASCII usernames by themselves (though I believe that they should implement something similar for IDNs either way), while producing servers can opt into allowing non-ASCII usernames by implementing a precaution or simply limiting user registrations, which is a significant step forward from the status quo without sacrificing security. |
isn't our case what https://www.unicode.org/reports/tr31/#Default_Identifier_Syntax could be a good fit for? |
Essentially, yes. This is what Kitsune is using already (well, almost. Kitsune uses the reduced If you then put it into a column using Level 1 Unicode collation, you've got a rather robust system against breakage and impersonation through confusables (since the collation will consider "a" and "ä" and "á" to be equal). Edit: Here is an example of how the DIS would be implemented. It seems like this could work for Mastodon and Misskey usernames, but personally I'd prefer if there was a little more openness(?) for the usernames (mainly, I'd like to see support for dots, but that's just me, I guess 🤷♀️): https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a2eb8118dacec801bc28f62ec3584699 |
what puzzles me, is that quite a few comments here read as if https://datatracker.ietf.org/doc/html/rfc7565#section-6 wouldn't exist. If there are implementations that chose to ignore this standard (and sure there are and other standards as well), what do we expect from inventing yet another one? edit: wrong section, off by one :-) |
Well, that is for Webfinger only. Technically you could use something like punycode or URL-encoding within Webfinger. But yeah, the combination of (Going off the assumption that I interpreted the standard right) It seems like even the implementations that already support i18n usernames follow the RFC you linked somehow. Going off of how Kitsune does it, we chose the latter route by simply restricting the usernames to pretty much the regex I wrote above. That way it adheres to the standard and allows for usernames like |
Note that Default Identifier Syntax you linked doesn't accept some portion of usernames accepted by existing implementations, such as If we are to adopt the Unicode identifier syntax, I think we need to have a custom profile modified from the default identifier syntax. In oarticular, to allow what Mastodon accepts today (#9 (comment)), we would need the following extensions (as suggested by the UAX in Table 3 and 3a):
|
On reflection, the explicit requirement for Unicode normalization isn't necessary because the PRECIS IdentifierClass already implies NFKC by disallowing characters in the Then, the only major security concern not covered by the WebFinger and |
How does the acct: URI format constrain the structure of
preferredUsername
?https://www.rfc-editor.org/rfc/rfc7565.html#section-7
The text was updated successfully, but these errors were encountered: