[linkcheck] PDF anchor (`...pdf#anchor`) leads to `'utf-8' codec can't decode byte ...` #11041

goekce · 2022-12-20T13:33:15Z

Describe the bug

Related to #7694.

Note that the query symbol ? is not required when using an anchor (i.e., #fragment).

How to Reproduce

index.rst:

`link1 <https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf?#page=226>`_
`link2 <https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf#page=226>`_
`link3 <https://docs.python.org/3/whatsnew/3.11.html?#whatsnew311-pep654>`_
`link4 <https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep654>`_

$ sphinx-build -b linkcheck . build

Output:

(           index: line    1) ok        https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep654
(           index: line    1) redirect  https://docs.python.org/3/whatsnew/3.11.html?#whatsnew311-pep654 - with unknown code to https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep654
(           index: line    1) broken    https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf#page=226 - 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
(           index: line    1) broken    https://wci.llnl.gov/sites/wci/files/2020-08/LLNL-SM-654357.pdf?#page=226 - 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Environment Information

Platform:              linux; (Linux-6.0.12-arch1-1-x86_64-with-glibc2.36)
Python version:        3.10.8 (main, Nov  1 2022, 14:18:21) [GCC 12.2.0])
Python implementation: CPython
Sphinx version:        5.3.0
Docutils version:      0.19
Jinja2 version:        3.1.2

The text was updated successfully, but these errors were encountered:

andy-maier · 2023-01-21T04:11:08Z

I have the same issue when linking to https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G53253.
Fails on all of our test environments, both with latest and with minimum versions: https://github.com/pywbem/nocasedict/actions/runs/3916927936

Circumvented by removing the #G53253 anchor.

Quite some irony that linking to the Unicode standard runs into a UTF-8 issue :-) :-)

jayaddison · 2024-01-15T22:47:06Z

I'm not sure what to suggest as a remedy for this, but from investigating why it occurs: this problem is due to the anchor-checking mechanism expecting to parse HTML content.

When anchor-checking is enabled and content with a binary header is retrieved from a URI with an anchor fragment (#....), decoding of that data is likely to fail, unless it happens to be a format that overlaps with valid unicode text.

In other words: it's not exactly the hash fragment in the URI that causes the problem, but it is a requirement for the problem to occur -- because for links without anchors, we don't need to read the HTTP response content, only the status line.

I think the most difficult decision for a fix is: what do we do for non-HTML formats like PDF when anchor-checking is enabled, and the hyperlink contains a fragment? Is it better to consider the result working (despite not checking for existence of a matching anchor destination), ignored/unchecked (informationally wasteful given that we have made a network request), or do something else?

jayaddison · 2024-01-15T22:48:42Z

only the status line

Self-nitpick: and sometimes response headers, I suppose - for example to handle rate-limiting.

andy-maier mentioned this issue Jan 21, 2023

Minor doc improvements pywbem/nocasedict#130

Merged

rffontenelle mentioned this issue Apr 18, 2023

GH-103484: Fix broken links reported by linkcheck python/cpython#103608

Merged

AA-Turner added this to the some future version milestone Apr 29, 2023

AA-Turner added the builder:linkcheck label Jan 15, 2024

jayaddison mentioned this issue Mar 24, 2024

linkcheck: Ignore URLs that respond with non-Unicode content #12197

Merged

This was referenced Jun 22, 2024

Command to automatically link to MCNP manual idaholab/MontePy#415

Merged

Use linktest as part of website build idaholab/MontePy#414

Closed

AA-Turner closed this as completed in #12197 Jul 14, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[linkcheck] PDF anchor (`...pdf#anchor`) leads to `'utf-8' codec can't decode byte ...` #11041

[linkcheck] PDF anchor (`...pdf#anchor`) leads to `'utf-8' codec can't decode byte ...` #11041

goekce commented Dec 20, 2022 •

edited

Loading

andy-maier commented Jan 21, 2023 •

edited

Loading

jayaddison commented Jan 15, 2024

jayaddison commented Jan 15, 2024

[linkcheck] PDF anchor (...pdf#anchor) leads to 'utf-8' codec can't decode byte ... #11041

[linkcheck] PDF anchor (...pdf#anchor) leads to 'utf-8' codec can't decode byte ... #11041

Comments

goekce commented Dec 20, 2022 • edited Loading

Describe the bug

How to Reproduce

Environment Information

andy-maier commented Jan 21, 2023 • edited Loading

jayaddison commented Jan 15, 2024

jayaddison commented Jan 15, 2024

[linkcheck] PDF anchor (`...pdf#anchor`) leads to `'utf-8' codec can't decode byte ...` #11041

[linkcheck] PDF anchor (`...pdf#anchor`) leads to `'utf-8' codec can't decode byte ...` #11041

goekce commented Dec 20, 2022 •

edited

Loading

andy-maier commented Jan 21, 2023 •

edited

Loading