Jay Taylor's notes

back to listing index

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

[web search]

Original source (stackoverflow.com)

Tags: python unicode stackoverflow.com

Clipped on: 2016-06-22

Stack Exchange Inbox Reputation and Badges

5,352 42956

14 review help

Stack Overflow

Ask Question

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

up vote 414 down vote favorite

140

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

One of the sections of code that is causing problems is shown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?

share|edit|close|flag

edited Mar 22 at 13:59

Shane Bergsma

256

asked Mar 30 '12 at 12:06

Homunculus Reticulli

9,25135106183

protected by agf Nov 30 '15 at 21:27

This question is protected to prevent "thanks!", "me too!", or spam answers by new users. To answer it, you must have earned at least 10 reputation on this site (the association bonus does not count).

add a comment

start a bounty

9 Answers

active oldest votes

up vote 479 down vote accepted

You need to read the Python Unicode HOWTO. This error is the very first example.

Basically, stop using str to convert from unicode to encoded text / bytes.

Instead, properly use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.

share|edit|flag

answered Mar 30 '12 at 12:21

agf

68.6k18158173

7	upvote
	flag

agreed! a good rule of thumb I was taught is to use the "unicode sandwich" idea. Your script accepts bytes from the outside world, but all processing should be done in unicode. Only when you are ready to output your data should it be mushed back into bytes! – Andbdrew Mar 30 '12 at 12:29

114	upvote
	flag

In case someone else gets confused by this, I found a strange thing: my terminal uses utf-8, and when I print my utf-8 strings it works nicely. However when I pipe my programs output to a file, it throws a UnicodeEncodeError. In fact, when output is redirected (to a file or a pipe), I find that sys.stdout.encoding is None! Tacking on .encode('utf-8') solves the problem. – drevicko Dec 18 '12 at 8:15

36	upvote
	flag

@drevicko: use PYTHONIOENCODING=utf-8 instead i.e., print Unicode strings and let the environment to set the expected encoding. – J.F. Sebastian Dec 21 '13 at 3:51

	upvote
	flag

Quite slow, however... – maudulus Jul 31 '14 at 19:58

	upvote
	flag

@J.F.Sebastian: Do you think that's a valid approach in every case? Let's say you have a tool that's exporting a report which needs to have a particular encoding, it seems less straightforward to me that the user would need to change the environment settings just for that one export. Then I would rather have the program take the encoding as a parameter with a sensible default. – steinar Nov 25 '15 at 9:59

add a comment | show 3 more comments

up vote 204 down vote

This is a classic python unicode pain point! Consider the following:

a = u'bats\u00E0'
print a
 => batsà

All good so far, but if we call str(a), let's see what happens:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil\u00E0!

The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.

For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html

share|edit|flag

edited Aug 2 '12 at 14:16

answered Mar 30 '12 at 12:25

Andbdrew

6,48621831

25	upvote
	flag

Personal note: When trying to type ".encode" don't accidentally type ".unicode" then wonder why nothing is working. – Skip Huffman Dec 24 '12 at 14:38

add a comment

up vote 54 down vote

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

share|edit|flag

answered Aug 20 '14 at 10:13

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

share|edit|flag

edited Dec 2 '13 at 18:04

answered Dec 2 '13 at 17:58

maxpolk

905720

	upvote
	flag

Got exactly same issue, so bad I didn't checked it before reporting. Thanks a lot. By the way, you can replace first two commands with env|grep -E '(LC|LANG)'. – Dmitry Verkhoturov Aug 8 '15 at 7:28

add a comment

up vote 14 down vote

I've actually found that in most of my cases, just stripping out those characters is much simpler:

s = mystring.decode('ascii', 'ignore')

share|edit|flag

answered Nov 1 '13 at 13:44

Phil LaNasa

1,4201113

1	upvote
	flag

That works perfectly, thanks :) – vgoklani Jan 2 '14 at 4:46

17	upvote
	flag

"Perfectly" is not usually what it performs. It throws away stuff which you should figure out how to deal with properly. – tripleee Dec 13 '14 at 16:53

5	upvote
	flag

just stripping out "those" (non-english) characters is not the solution since python must support all languages dont you think? – alemol Jan 9 '15 at 19:47

4	upvote
	flag

Downvoted. This is not the correct solution at all. Learn how to work with Unicode: joelonsoftware.com/articles/Unicode.html – Andrew Ferrier Jan 13 '15 at 13:04

2	upvote
	flag

I burnt all my clothes, now I don't have to wash them. Hurrah! – Alastair McCormack Nov 30 '15 at 21:18

add a comment | show 2 more comments

up vote 8 down vote

For me, what worked was:

BeautifulSoup(html_text,from_encoding="utf-8")

Hope this helps someone.

share|edit|flag

answered Jan 26 '15 at 14:53

Simple helper functions found here.

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

share|edit|flag

edited Dec 31 '15 at 11:18

answered Dec 31 '15 at 7:57

Parag Tyagi -morpheus-

4,7852922

	upvote
	flag

To get the escaped bytestring (to convert arbitrary Unicode string to bytes using ascii encoding), you could use backslashreplace error handler: u'\xa0'.encode('ascii', 'backslashreplace'). Though you should avoid such representation and configure your environment to accept non-ascii characters instead -- it is 2016! – J.F. Sebastian Jan 1 at 3:05

	upvote
	flag

Happy New Year @J.F.Sebastian. I just got frustrated with the Python-Unicode issue and then finally got this solution which was working. I didn't knew about this. Anyways thanks for the tip. – Parag Tyagi -morpheus- Jan 1 at 6:53

	upvote
	flag

I've read every dang line on the web about unicode. This worked. Thanks. – mattrweaver Feb 18 at 19:38

add a comment

up vote 0 down vote

The problem is that you're trying to print a unicode character, but your terminal doesn't support it.

You can try installing language-pack-en package to fix that:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.

Then when writing the code, make sure you use the right encoding in your code.

For example:

open(foo, encoding='utf-8')

If you've still a problem, double check your system configuration, such as:

your locale file (/etc/default/locale), which should have e.g.
```
LANG="en_US.UTF-8"
```
value of LANG/LC_CTYPE in shell

Demonstrating the problem and solution in fresh VM.

Initialize and provision the VM (e.g. using vagrant):

vagrant init ubuntu/vivid64; vagrant up; vagrant ssh

Printing unicode characters (such as trade mark sign like ™):

$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)

Now installing language-pack-en:

$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
  language-pack-en-base
Generating locales...
  en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.

Now problem is solved:
```
$ python -c 'print(u"\u2122");'
™
```

share|edit|flag

edited Dec 26 '15 at 22:54

answered Aug 13 '15 at 12:07

kenorb

16.8k7121104

	upvote
	flag

What has language-pack-en got to do with Python or this question? AFAIK, it may provide language translations to messages but has nothing to do with encoding – Alastair McCormack Dec 26 '15 at 10:47

	upvote
	flag

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly, especially when running Python script on the Terminal. It worked for me at one point. See: character encoding – kenorb Dec 26 '15 at 11:00

	upvote
	flag

Ah, ok. You mean if you want to use a non-English locale? I guess the user will also have to edit /etc/locale.gen to ensure their locale is built before using it? – Alastair McCormack Dec 26 '15 at 11:04

	upvote
	flag

@AlastairMcCormack Added reproducible steps of the problem in VM to make it clearer. I've tested and the solution still works. – kenorb Dec 26 '15 at 22:55

1	upvote
	flag

@AlastairMcCormack Commented out LANG from /etc/default/locale (as /etc/locale.gen does't exist) and ran locale-gen, but it didn't help. I'm not sure what language-pack-en exactly does, as I didn't find much documentation and listing the content of it doesn't help much. – kenorb Dec 27 '15 at 13:07

add a comment | show 3 more comments

up vote 0 down vote

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned's presentation.

I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

share|edit|flag

answered Mar 12 at 3:14

pepoluan

2,26111232

1	upvote
	flag

What is the type of value? before and after this? I think why that works is that by doing a unic += value which is the same as unic = unic + value you are adding a string and a unicode, where python then assumes unicode for the resultant unic i.e. the more precise type (think about when you do this a = float(1) + int(1), a becomes a float) and then value = unic points value to the new unic object which happens to be unicode. – busfault May 24 at 21:16

add a comment

Your Answer

community wiki

Not the answer you're looking for? Browse other questions tagged python unicode beautifulsoup python-2.x python-unicode or ask your own question.

asked	4 years ago
viewed	376281 times
active	3 months ago

Hot Meta Posts

What to do when I found a “theoretical” version of a practical question…

For what purpose do we use the Android tag

Stack Snippets now support ReactJS and BabelJS

Raising rep by flooding the suggested edit queue

Linked

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-20: ordinal not in range(128)

python format and display non ASCII characters

-1

Error of conversion from Unicode to ASCII when writing a text file in Python

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 558: ordinal not in range(128)

Python string encoding

Kivy Designer UnicodeEncodeError: 'ascii' codec can't encode character u'\u0130' in position 12: ordinal not in range(128)

Issues while writing HTML file

'ascii' codec can't encode characters in position 39-40: ordinal not in range(128)

see more linked questions…

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

Caught UnicodeEncodeError while rendering: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)

Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-20: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-9: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 37: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 35: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 54: ordinal not in range(128)

Hot Network Questions

question feed

about us tour help blog chat data legal privacy policy work here advertising info mobile contact us feedback

Technology			Life / Arts	Culture / Recreation	Science	Other
Stack Overflow Server Fault Super User Web Applications Ask Ubuntu Webmasters Game Development TeX - LaTeX	Programmers Unix & Linux Ask Different (Apple) WordPress Development Geographic Information Systems Electrical Engineering Android Enthusiasts Information Security	Database Administrators Drupal Answers SharePoint User Experience Mathematica Salesforce ExpressionEngine® Answers more (13)	Photography Science Fiction & Fantasy Graphic Design Movies & TV Seasoned Advice (cooking) Home Improvement Personal Finance & Money Academia more (9)	English Language & Usage Skeptics Mi Yodeya (Judaism) Travel Christianity Arqade (gaming) Bicycles Role-playing Games more (21)	Mathematics Cross Validated (stats) Theoretical Computer Science Physics MathOverflow Chemistry Biology more (5)	Stack Apps Meta Stack Exchange Area 51 Stack Overflow Careers

rev 2016.6.22.3698

Jay Taylor's notes