Jay Taylor's notes
back to listing indexUnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
[web search]
I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup. The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a One of the sections of code that is causing problems is shown below:
Here is a stack trace produced on SOME strings when the snippet above is run:
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English. Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
|
||||
protected by agf Nov 30 '15 at 21:27This question is protected to prevent "thanks!", "me too!", or spam answers by new users. To answer it, you must have earned at least 10 reputation on this site (the association bonus does not count). |
||||
You need to read the Python Unicode HOWTO. This error is the very first example. Basically, stop using Instead, properly use
or work entirely in unicode.
|
|||||||||||||||||||||||||||||||||
|
This is a classic python unicode pain point! Consider the following:
All good so far, but if we call str(a), let's see what happens:
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
Voil\u00E0! The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8. For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
|
|||||||||
|
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
|
|||
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
|
|||||||||
|
I've actually found that in most of my cases, just stripping out those characters is much simpler:
|
|||||||||||||||||||||||||||||||||
|
Simple helper functions found here.
|
|||||||||||||||||||||
|
The problem is that you're trying to print a unicode character, but your terminal doesn't support it. You can try installing
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print). On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually. Then when writing the code, make sure you use the right encoding in your code. For example:
If you've still a problem, double check your system configuration, such as:
Demonstrating the problem and solution in fresh VM.
|
|||||||||||||||||||||||||||||||||
|
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
I had this idea after reading Ned's presentation. I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
|
|||||||||
|
Your Answer
asked |
4 years ago |
viewed |
376281 times |
active |
Linked
Related
Technology | Life / Arts | Culture / Recreation | Science | Other | ||
---|---|---|---|---|---|---|