Jay Taylor's notes
back to listing indexUnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
[web search]
I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup. The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a One of the sections of code that is causing problems is shown below:
Here is a stack trace produced on SOME strings when the snippet above is run:
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English. Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem? |
||||
protected by agf Nov 30 '15 at 21:27This question is protected to prevent "thanks!", "me too!", or spam answers by new users. To answer it, you must have earned at least 10 reputation on this site (the association bonus does not count). |
||||
You need to read the Python Unicode HOWTO. This error is the very first example. Basically, stop using Instead, properly use
or work entirely in unicode. |
|||||||||||||||||||||||||||||||||
|
This is a classic python unicode pain point! Consider the following:
All good so far, but if we call str(a), let's see what happens:
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
Voil\u00E0! The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8. For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html |
|||||||||
|
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
|
|||
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
|
|||||||||
|
I've actually found that in most of my cases, just stripping out those characters is much simpler:
|
|||||||||||||||||||||||||||||||||
|
For me, what worked was:
Hope this helps someone. |
|||
Simple helper functions found here.
|
|||||||||||||||||||||
|
The problem is that you're trying to print a unicode character, but your terminal doesn't support it. You can try installing
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print). On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually. Then when writing the code, make sure you use the right encoding in your code. For example:
If you've still a problem, double check your system configuration, such as:
Demonstrating the problem and solution in fresh VM.
|
|||||||||||||||||||||||||||||||||
|
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
I had this idea after reading Ned's presentation. I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it. |
|||||||||
|
Your Answer
Not the answer you're looking for? Browse other questions tagged python unicode beautifulsoup python-2.x python-unicode or ask your own question.
asked |
4 years ago |
viewed |
376281 times |
active |
Linked
Related
Hot Network Questions
- Are there things Elixir/Erlang can do that the other cannot?
- How to disable kill command on linux
- Tips on choosing a research topic with little relative background?
- Why don't high-ranking journals go solo?
- Is it safe to keep a MacBook in the fridge to cool it down?
- Write a program to elasticize strings
- Even and Odd game
- Does the (Swift,) Fly spell cause physical resistance when an object was hovering
- Convert YYYYMM to MMMYY
- Is this investment opportunity problematic?
- I proved a theorem I heard from another person. What can I do with it?
- Null check chain vs catching NullPointerException
- One of my players is trying to cast a spell that isn't in the handbook, and he hasn't told me how his character knows it
- Referee report ignored and not sent to authors
- Is there an ideal PWM frequency for DC brush motors?
- Why did Harry tell Snape's secret to everyone?
- When should I let my teammates die?
- What is momentum really?
- Is it worth colonizing a planet that travels near the speed of light?
- Lost battles - under which conditions does the losing party actively commemorate a battle?
- Is there an alternative to .htaccess?
- How do you protect Ubuntu login?
- Why was there no funeral for Kirk at the end of Star Trek: Generations?
- How to select elements from a list of pairs based on 2nd element of the pair
Technology | Life / Arts | Culture / Recreation | Science | Other | ||
---|---|---|---|---|---|---|