Jay Taylor's notes

back to listing index

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

[web search]
Original source (stackoverflow.com)
Tags: python unicode stackoverflow.com
Clipped on: 2016-06-22

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

One of the sections of code that is causing problems is shown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?

asked Mar 30 '12 at 12:06
Image (Asset 3/12) alt=
Homunculus Reticulli
9,25135106183

protected by agf Nov 30 '15 at 21:27

This question is protected to prevent "thanks!", "me too!", or spam answers by new users. To answer it, you must have earned at least 10 reputation on this site (the association bonus does not count).

up vote 479 down vote accepted

You need to read the Python Unicode HOWTO. This error is the very first example.

Basically, stop using str to convert from unicode to encoded text / bytes.

Instead, properly use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.

answered Mar 30 '12 at 12:21
Image (Asset 4/12) alt=
agf
68.6k18158173
7 upvote
  flag
agreed! a good rule of thumb I was taught is to use the "unicode sandwich" idea. Your script accepts bytes from the outside world, but all processing should be done in unicode. Only when you are ready to output your data should it be mushed back into bytes! – Andbdrew Mar 30 '12 at 12:29
114 upvote
  flag
In case someone else gets confused by this, I found a strange thing: my terminal uses utf-8, and when I print my utf-8 strings it works nicely. However when I pipe my programs output to a file, it throws a UnicodeEncodeError. In fact, when output is redirected (to a file or a pipe), I find that sys.stdout.encoding is None! Tacking on .encode('utf-8') solves the problem. – drevicko Dec 18 '12 at 8:15
36 upvote
  flag
@drevicko: use PYTHONIOENCODING=utf-8 instead i.e., print Unicode strings and let the environment to set the expected encoding. – J.F. Sebastian Dec 21 '13 at 3:51
   upvote
  flag
Quite slow, however... – maudulus Jul 31 '14 at 19:58
   upvote
  flag
@J.F.Sebastian: Do you think that's a valid approach in every case? Let's say you have a tool that's exporting a report which needs to have a particular encoding, it seems less straightforward to me that the user would need to change the environment settings just for that one export. Then I would rather have the program take the encoding as a parameter with a sensible default. – steinar Nov 25 '15 at 9:59

This is a classic python unicode pain point! Consider the following:

a = u'bats\u00E0'
print a
 => batsà

All good so far, but if we call str(a), let's see what happens:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil\u00E0!

The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.

For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html

answered Mar 30 '12 at 12:25
Image (Asset 5/12) alt=
Andbdrew
6,48621831
25 upvote
  flag
Personal note: When trying to type ".encode" don't accidentally type ".unicode" then wonder why nothing is working. – Skip Huffman Dec 24 '12 at 14:38

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
answered Aug 20 '14 at 10:13
Image (Asset 6/12) alt=

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
answered Dec 2 '13 at 17:58
Image (Asset 7/12) alt=
maxpolk
905720
   upvote
  flag
Got exactly same issue, so bad I didn't checked it before reporting. Thanks a lot. By the way, you can replace first two commands with env|grep -E '(LC|LANG)'. – Dmitry Verkhoturov Aug 8 '15 at 7:28

I've actually found that in most of my cases, just stripping out those characters is much simpler:

s = mystring.decode('ascii', 'ignore')
answered Nov 1 '13 at 13:44
Image (Asset 8/12) alt=
Phil LaNasa
1,4201113
1 upvote
  flag
That works perfectly, thanks :) – vgoklani Jan 2 '14 at 4:46
17 upvote
  flag
"Perfectly" is not usually what it performs. It throws away stuff which you should figure out how to deal with properly. – tripleee Dec 13 '14 at 16:53
5 upvote
  flag
just stripping out "those" (non-english) characters is not the solution since python must support all languages dont you think? – alemol Jan 9 '15 at 19:47
4 upvote
  flag
Downvoted. This is not the correct solution at all. Learn how to work with Unicode: joelonsoftware.com/articles/Unicode.html – Andrew Ferrier Jan 13 '15 at 13:04
2 upvote
  flag
I burnt all my clothes, now I don't have to wash them. Hurrah! – Alastair McCormack Nov 30 '15 at 21:18

For me, what worked was:

BeautifulSoup(html_text,from_encoding="utf-8")

Hope this helps someone.

answered Jan 26 '15 at 14:53
Image (Asset 9/12) alt=
Animesh
79921231

Simple helper functions found here.

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')
answered Dec 31 '15 at 7:57
Image (Asset 10/12) alt=
   upvote
  flag
To get the escaped bytestring (to convert arbitrary Unicode string to bytes using ascii encoding), you could use backslashreplace error handler: u'\xa0'.encode('ascii', 'backslashreplace'). Though you should avoid such representation and configure your environment to accept non-ascii characters instead -- it is 2016! – J.F. Sebastian Jan 1 at 3:05
   upvote
  flag
Happy New Year @J.F.Sebastian. I just got frustrated with the Python-Unicode issue and then finally got this solution which was working. I didn't knew about this. Anyways thanks for the tip. – Parag Tyagi -morpheus- Jan 1 at 6:53
   upvote
  flag
I've read every dang line on the web about unicode. This worked. Thanks. – mattrweaver Feb 18 at 19:38

The problem is that you're trying to print a unicode character, but your terminal doesn't support it.

You can try installing language-pack-en package to fix that:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.

Then when writing the code, make sure you use the right encoding in your code.

For example:

open(foo, encoding='utf-8')

If you've still a problem, double check your system configuration, such as:

  • your locale file (/etc/default/locale), which should have e.g.

    LANG="en_US.UTF-8"
  • value of LANG/LC_CTYPE in shell


Demonstrating the problem and solution in fresh VM.

  1. Initialize and provision the VM (e.g. using vagrant):

    vagrant init ubuntu/vivid64; vagrant up; vagrant ssh
  2. Printing unicode characters (such as trade mark sign like ):

    $ python -c 'print(u"\u2122");'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
  3. Now installing language-pack-en:

    $ sudo apt-get -y install language-pack-en
    The following extra packages will be installed:
      language-pack-en-base
    Generating locales...
      en_GB.UTF-8... /usr/sbin/locale-gen: done
    Generation complete.
  4. Now problem is solved:

    $ python -c 'print(u"\u2122");'
    
answered Aug 13 '15 at 12:07
Image (Asset 11/12) alt=
kenorb
16.8k7121104
   upvote
  flag
What has language-pack-en got to do with Python or this question? AFAIK, it may provide language translations to messages but has nothing to do with encoding – Alastair McCormack Dec 26 '15 at 10:47
   upvote
  flag
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly, especially when running Python script on the Terminal. It worked for me at one point. See: character encoding – kenorb Dec 26 '15 at 11:00
   upvote
  flag
Ah, ok. You mean if you want to use a non-English locale? I guess the user will also have to edit /etc/locale.gen to ensure their locale is built before using it? – Alastair McCormack Dec 26 '15 at 11:04
   upvote
  flag
@AlastairMcCormack Added reproducible steps of the problem in VM to make it clearer. I've tested and the solution still works. – kenorb Dec 26 '15 at 22:55
1 upvote
  flag
@AlastairMcCormack Commented out LANG from /etc/default/locale (as /etc/locale.gen does't exist) and ran locale-gen, but it didn't help. I'm not sure what language-pack-en exactly does, as I didn't find much documentation and listing the content of it doesn't help much. – kenorb Dec 27 '15 at 13:07

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned's presentation.

I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

answered Mar 12 at 3:14
Image (Asset 12/12) alt=
pepoluan
2,26111232
1 upvote
  flag
What is the type of value? before and after this? I think why that works is that by doing a unic += value which is the same as unic = unic + value you are adding a string and a unicode, where python then assumes unicode for the resultant unic i.e. the more precise type (think about when you do this a = float(1) + int(1), a becomes a float) and then value = unic points value to the new unic object which happens to be unicode. – busfault May 24 at 21:16

Your Answer

asked

4 years ago

viewed

376281 times

active

3 months ago

Linked

Related

Hot Network Questions

Technology Life / Arts Culture / Recreation Science Other
  1. Stack Overflow
  2. Server Fault
  3. Super User
  4. Web Applications
  5. Ask Ubuntu
  6. Webmasters
  7. Game Development
  8. TeX - LaTeX
  1. Programmers
  2. Unix & Linux
  3. Ask Different (Apple)
  4. WordPress Development
  5. Geographic Information Systems
  6. Electrical Engineering
  7. Android Enthusiasts
  8. Information Security
  1. Database Administrators
  2. Drupal Answers
  3. SharePoint
  4. User Experience
  5. Mathematica
  6. Salesforce
  7. ExpressionEngine® Answers
  8. more (13)
  1. Photography
  2. Science Fiction & Fantasy
  3. Graphic Design
  4. Movies & TV
  5. Seasoned Advice (cooking)
  6. Home Improvement
  7. Personal Finance & Money
  8. Academia
  9. more (9)
  1. English Language & Usage
  2. Skeptics
  3. Mi Yodeya (Judaism)
  4. Travel
  5. Christianity
  6. Arqade (gaming)
  7. Bicycles
  8. Role-playing Games
  9. more (21)
  1. Mathematics
  2. Cross Validated (stats)
  3. Theoretical Computer Science
  4. Physics
  5. MathOverflow
  6. Chemistry
  7. Biology
  8. more (5)
  1. Stack Apps
  2. Meta Stack Exchange
  3. Area 51
  4. Stack Overflow Careers
site design / logo © 2016 Stack Exchange Inc; user contributions licensed under cc by-sa 3.0 with attribution required
rev 2016.6.22.3698