Invalid XML can be generated out of bad user input #10996

jonasraoni · 2025-02-27T17:20:09Z

Valid Title

I have updated the title to accurately reflect the bug description

Description

The XML has a well defined list of accepted characters, but user input might violate it, and the PHP functions are not going to help, therefore it makes sense to remove them from the user input (e.g. preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input)).

It also makes sense to fix bad UTF-8 characters on the way with:

$original = mb_substitute_character();
mb_substitute_character('none');
$data = @mb_convert_encoding((string) $input, 'UTF-8', 'UTF-8');
mb_substitute_character($original);

Steps to Reproduce

Publish a submission with an invalid XML character, such as "Invalid character \x02 followed by an invalid UTF-8 sequence \x66\xFC\x72" (notice the special characters need to be encoded)
Navigate to the /index.php/{$journal}/oai URL and list the records
See the browser will not render the content properly

Expected Result

The character should be stripped or replaced by something acceptable, such as the 0xFFFD (https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character).

Actual Result

The browser will not render the content properly, but of course, that's not the main issue.

Environment Details

No response

Application Version

OJS 3.3.0

Logs

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

asmecher · 2025-02-27T18:10:51Z

@jonasraoni, is this user input via the browser, or an invalid XML file (JATS), or something else?

We used to have some tools for this back in the early days of OJS 3.0 ("charset normalization"); they caused more problems than they solve. At some level we need to rely on the underlying platform and environment to provide sane user input; trying to work around gaps in userspace is fixing things at the wrong level, in my opinion. If there's a common source of bad input, we can look into it, but I'll be a little resistant to fixes that really should be happening upstream.

jonasraoni · 2025-02-27T18:54:04Z

The case that I had came from the browser (see image below).

I think the input shouldn't be touched, but its our job to ensure the output fits the formats that we're implementing. For me it's just like escaping text to appear in a HTML attribute, if something isn't allowed, it needs to be transformed. At the OAI stuff, where I had the problem, there's already a OAIUtils::prepOutput().

A couple of years ago, it was possible to break the JSON parser with two curious characters, which were valid in JavaScript, but not in JSON (they've updated the standard) 🙃

asmecher · 2025-02-27T19:23:38Z

Is this content that's pasted into TinyMCE? I think both input and output would be good to investigate...

If TinyMCE is configured for UTF8 but is letting bad content in via paste operations, there's probably a lot of discussion over there about possible fixes (hopefully by configuration, without us having to code anything)...
When generating XML output, I wonder if we simply need to fix how we're creating text elements... We have a few patterns in active use:
- DOMDocument::createTextNode
- DOMDocument::createElement / createElementNS along with htmlspecialchars (not a fan of this one!)
- We used to use DOMNode::nodeValue but I think pretty much all of that is now removed.
I remember at least one of these methods wasn't reliably escaping entities, and wouldn't be surprised if it also handled binary characters/content poorly.

I would think it's possible to use a standard method to create XML text elements that's resilient -- without having to manually pass all our text through an additional filter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid XML can be generated out of bad user input #10996

Invalid XML can be generated out of bad user input #10996

jonasraoni commented Feb 27, 2025 •

edited

Loading

asmecher commented Feb 27, 2025

jonasraoni commented Feb 27, 2025

asmecher commented Feb 27, 2025

Invalid XML can be generated out of bad user input #10996

Invalid XML can be generated out of bad user input #10996

Comments

jonasraoni commented Feb 27, 2025 • edited Loading

Valid Title

Description

Steps to Reproduce

Expected Result

Actual Result

Environment Details

Application Version

Logs

Additional Information

asmecher commented Feb 27, 2025

jonasraoni commented Feb 27, 2025

asmecher commented Feb 27, 2025

jonasraoni commented Feb 27, 2025 •

edited

Loading