Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in printing invalid char #17271

Closed
randy3k opened this issue Jul 4, 2016 · 5 comments
Closed

bug in printing invalid char #17271

randy3k opened this issue Jul 4, 2016 · 5 comments
Labels
unicode Related to unicode characters and encodings

Comments

@randy3k
Copy link
Contributor

randy3k commented Jul 4, 2016

b"\x90" here is not a valid char. However, since there is no conversion error, I guess it does make sense to print something like instead.

julia> String(b"\x90");

julia> String(b"\x90")
"Error showing value of type String:
ERROR: BoundsError: attempt to access 1-element Array{UInt8,1} at index [0]
 in endof(::String) at ./strings/string.jl:40
 in done at ./strings/basic.jl:0 [inlined]
 in escape_string(::IOContext{Base.Terminals.TTYTerminal}, ::String, ::String) at ./strings/io.jl:137
 in print_quoted at ./strings/io.jl:156 [inlined]
 in show at ./strings/io.jl:0 [inlined]
 in show at ./replutil.jl:0 [inlined]
 in display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::String) at ./REPL.jl:114
 in display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::String) at ./REPL.jl:117
 in display(::String) at ./multimedia.jl:143
 in print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:134
 in print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:121
 in (::Base.REPL.##18#19{Bool,Base.REPL.##29#38{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:631
 in run_interface(::Base.Terminals.TTYTerminal, ::Base.LineEdit.ModalInterface) at ./LineEdit.jl:1570
 in run_frontend(::Base.REPL.LineEditREPL, ::Base.REPL.REPLBackendRef) at ./REPL.jl:882
 in run_repl(::Base.REPL.LineEditREPL, ::Base.##598#599) at ./REPL.jl:167
 in _start() at ./client.jl:364

by the way, it is julia 0.5-dev

@TotalVerb
Copy link
Contributor

TotalVerb commented Jul 4, 2016

Do not use String(::Array{UInt8, 1}) unless you're sure the input data is valid UTF-8. The constructor does not do any checking.

Also, I disagree that � should be printed. It is better to raise an error early then to silently allow invalid UTF-8.

@randy3k
Copy link
Contributor Author

randy3k commented Jul 4, 2016

I suggested the above since String(b"\xce") prints

@tkelman tkelman added the unicode Related to unicode characters and encodings label Jul 4, 2016
@nalimilan
Copy link
Member

I suggested the above since String(b"\xce") prints �

@randy3k That's because it's not a valid continuation byte, so endof returns immediately the last index in the underlying array. OTOH, String(b"\x90") contains only a continuation byte, which makes endof go past the beginning of the array. I have a fix at #17276: with that change, String(b"\x90") prints as an empty string (since it contains no valid Unicode character).

Also, I disagree that � should be printed. It is better to raise an error early then to silently allow invalid UTF-8.

@TotalVerb I think the approach retained in Julia is that you're responsible for ensuring strings are valid on input (as you said). But then we do our best to accept broken strings, which is needed e.g. to work correctly with file paths (which are not necessarily valid Unicode). Anyway, checking strings everywhere would be too expensive (for example, endof would be much slower if it had to check whether String(b"\xce") contains at least one valid character).

@StefanKarpinski
Copy link
Member

Whether String objects can hold arbitrary data or not is in a bit of a limbo state since the changes in #16107 have ended up spanning the 0.5 and 0.6 releases, so for now @nalimilan's fix is the way to go.

@randy3k
Copy link
Contributor Author

randy3k commented Jul 6, 2016

Great. It solves the issue.

@randy3k randy3k closed this as completed Jul 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

5 participants