Better handling of the UnicodeDecodeError exception. #102
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The function
decodeString(...)
does not handle the UnicodeDecodeError exception very well. Although trying again /w errors ignored might work in theory; in practice it does NOT. Some special characters (e.g. ü, ş, ç. ö) raises another exception and it fails the whole proccesss. Instead of trying to ignore the error, we should try to fix it.I was getting similar errors mentioned in #81 and #35. The fix by @faisal-hameed in #81 uses Regex to "filter out" any non-ASCII characters. The idea is good, but regex is heavy (CPU time & Memory). When I tried /w Python 2.7.18 on my Windows 10 VM Machine /w Intel Xeon E-2236 and 16GiB of memory, the program runs for a few seconds and then crashes. I believe this is due to how the
re
regex library in Python 2.7.x works.A relatively "better" solution is to use Python's native
join(...)
operation and convert EVERY character inencodedstr
to ASCII characters. It works by checking each character's decimal value usingord(...)
. If the char values is less than 128 (max ASCII char value), then it is kept. If not then we just ignore it.This way we are manually converting from ANY Unicode
string
to ASCIIstring
. I'm sure there are better ways to handle the UnicodeDecodeError exception, but this one seemed the most trivial solution and it just Works™.If there are anyone experiencing the same error mentioned in #35 and #81. Try using this patch.
Hope that this helps <3