Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid XML can be generated out of bad user input #10996

Open
1 task done
jonasraoni opened this issue Feb 27, 2025 · 3 comments
Open
1 task done

Invalid XML can be generated out of bad user input #10996

jonasraoni opened this issue Feb 27, 2025 · 3 comments

Comments

@jonasraoni
Copy link
Contributor

jonasraoni commented Feb 27, 2025

Valid Title

  • I have updated the title to accurately reflect the bug description

Description

The XML has a well defined list of accepted characters, but user input might violate it, and the PHP functions are not going to help, therefore it makes sense to remove them from the user input (e.g. preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input)).

It also makes sense to fix bad UTF-8 characters on the way with:

$original = mb_substitute_character();
mb_substitute_character('none');
$data = @mb_convert_encoding((string) $input, 'UTF-8', 'UTF-8');
mb_substitute_character($original);

Steps to Reproduce

  1. Publish a submission with an invalid XML character, such as "Invalid character \x02 followed by an invalid UTF-8 sequence \x66\xFC\x72" (notice the special characters need to be encoded)
  2. Navigate to the /index.php/{$journal}/oai URL and list the records
  3. See the browser will not render the content properly

Expected Result

The character should be stripped or replaced by something acceptable, such as the 0xFFFD (https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character).

Actual Result

The browser will not render the content properly, but of course, that's not the main issue.

Environment Details

No response

Application Version

OJS 3.3.0

Logs

No response

Additional Information

No response

@asmecher
Copy link
Member

@jonasraoni, is this user input via the browser, or an invalid XML file (JATS), or something else?

We used to have some tools for this back in the early days of OJS 3.0 ("charset normalization"); they caused more problems than they solve. At some level we need to rely on the underlying platform and environment to provide sane user input; trying to work around gaps in userspace is fixing things at the wrong level, in my opinion. If there's a common source of bad input, we can look into it, but I'll be a little resistant to fixes that really should be happening upstream.

@jonasraoni
Copy link
Contributor Author

The case that I had came from the browser (see image below).

I think the input shouldn't be touched, but its our job to ensure the output fits the formats that we're implementing. For me it's just like escaping text to appear in a HTML attribute, if something isn't allowed, it needs to be transformed. At the OAI stuff, where I had the problem, there's already a OAIUtils::prepOutput().

Image


A couple of years ago, it was possible to break the JSON parser with two curious characters, which were valid in JavaScript, but not in JSON (they've updated the standard) 🙃

Image

@asmecher
Copy link
Member

Is this content that's pasted into TinyMCE? I think both input and output would be good to investigate...

  • If TinyMCE is configured for UTF8 but is letting bad content in via paste operations, there's probably a lot of discussion over there about possible fixes (hopefully by configuration, without us having to code anything)...

  • When generating XML output, I wonder if we simply need to fix how we're creating text elements... We have a few patterns in active use:

    I remember at least one of these methods wasn't reliably escaping entities, and wouldn't be surprised if it also handled binary characters/content poorly.

I would think it's possible to use a standard method to create XML text elements that's resilient -- without having to manually pass all our text through an additional filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants