getting rid of —

someone · Jul 1, 2009

Hello,

how can I replace '—' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)

Thanks, Pet

script is # -*- coding: UTF-8 -*-

Benjamin Peterson · Jul 1, 2009

someone said:
Hello,

how can I replace 'â€”' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)

Please paste your code. I suspect that you are mixing unicode and normal strings.

MRAB · Jul 1, 2009

someone said:
Hello,

how can I replace '—' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)

Thanks, Pet

script is # -*- coding: UTF-8 -*-

It sounds like you're mixing bytestrings with Unicode strings. I can't
be any more helpful because you haven't shown the code.

Tep · Jul 2, 2009

It sounds like you're mixing bytestrings with Unicode strings. I can't
be any more helpful because you haven't shown the code.

Oh, I'm sorry. Here it is

def cleanInput(input)
return input.replace('—', '')

Tep · Jul 2, 2009

Oh, I'm sorry. Here it is

def cleanInput(input)
return input.replace('—', '')

I also need:

#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
parts = input.split(' — ')
return parts[0]

Thanks!

Simon Forman · Jul 3, 2009

Oh, I'm sorry. Here it is

Click to expand...

def cleanInput(input)
return input.replace('—', '')

Click to expand...

I also need:

#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
parts = input.split(' — ')
return parts[0]

Thanks!

Okay people want to help you but you must make it easy for us.

Post again with a small piece of code that is runnable as-is and that
causes the traceback you're talking about, AND post the complete
traceback too, as-is.

I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

HTH,
~Simon

You might also read this: http://catb.org/esr/faqs/smart-questions.html

Tep · Jul 3, 2009

I also need:

Click to expand...

#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
parts = input.split(' — ')
return parts[0]

Click to expand...

Thanks!

Click to expand...

Okay people want to help you but you must make it easy for us.

Post again with a small piece of code that is runnable as-is and that
causes the traceback you're talking about, AND post the complete
traceback too, as-is.

I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

Mark Tolonen · Jul 3, 2009

Tep said:
how can I replace 'â€”' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:

Click to expand...

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)

Click to expand...

Thanks, Pet

Click to expand...

script is # -*- coding: UTF-8 -*-

Click to expand...

[snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo â€” bar'
|>>> data.split('â€”')
|['foo ', ' bar']
|>>> data = u'foo â€” bar' |>>> data.split(u'â€”')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

Click to expand...

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

You'd still benefit from posting some code. You shouldn't be converting
back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Also make sure your file
is actually saved in the encoding you declare. I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('â€”')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works. Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to
save your source code in the encoding you declare. If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('×§)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark

Tep · Jul 3, 2009

how can I replace 'â€”' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*- [snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo â€” bar'
|>>> data.split('â€”')
|['foo ', ' bar']
|>>> data = u'foo â€” bar' |>>> data.split(u'â€”')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

Click to expand...

Click to expand...

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

Click to expand...

You'd still benefit from posting some code. Â You shouldn't be converting

I've posted code below

back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Â Also make sure your file
is actually saved in the encoding you declare. Â I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

File was indeed in windows-1252, I've changed this. For errors see
below

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
Â File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
Â Â exec codeObj in __main__.__dict__
Â File "<auto import>", line 1, in <module>
Â File "x.py", line 6, in <module>
Â Â print data.split('â€”')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works. Â Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. Â In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Â Make sure to
save your source code in the encoding you declare. Â If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
Â File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
Â Â exec codeObj in __main__.__dict__
Â File "<auto import>", line 1, in <module>
Â File "x.py", line 6, in <module>
Â Â print data.split('×§)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark

#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
title = re.search('<title>(.*?)</title>', input)
title = title.group(1)
print "FULL TITLE", title.encode('UTF-8')
parts = title.split(' â€” ')
return parts[0]

def getWebPage(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req)
the_page = unicode(response.read(), 'UTF-8')
return the_page

def main():
url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
title = getTitle(getWebPage(url))
print title[0]

if __name__ == "__main__":
main()

Traceback (most recent call last):
File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
main()
File "C:\user\Projects\test\src\new_main.py", line 24, in main
title = getTitle(getWebPage(url))
FULL TITLE Ãâ€˜ÃÂ°Ã‘â€¦Ã‘â‚¬ÃÂµÃÂ¹ÃÂ½ Ã¢â‚¬â€ ÃÂ£ÃÂ¸ÃÂºÃÂ¸ÃÂ¿ÃÂµÃÂ´ÃÂ¸Ã‘ï¿½
File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
parts = title.split(' Ã¢â‚¬â€ ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)

MRAB · Jul 3, 2009

Tep said:
how can I replace 'â€”' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*- [snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo â€” bar'
|>>> data.split('â€”')
|['foo ', ' bar']
|>>> data = u'foo â€” bar'
|>>> data.split(u'â€”')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.
The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

Click to expand...

You'd still benefit from posting some code. You shouldn't be converting

Click to expand...

I've posted code below

back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Also make sure your file
is actually saved in the encoding you declare. I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

Click to expand...

File was indeed in windows-1252, I've changed this. For errors see
below

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('â€”')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works. Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to
save your source code in the encoding you declare. If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('×§)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark

Click to expand...

#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
title = re.search('<title>(.*?)</title>', input)

The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)

title = title.group(1)
print "FULL TITLE", title.encode('UTF-8')
parts = title.split(' â€” ')

The title is Unicode, so the string with which you're splitting should
also be Unicode:

parts = title.split(u' â€” ')

return parts[0]

def getWebPage(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req)
the_page = unicode(response.read(), 'UTF-8')
return the_page

def main():
url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
title = getTitle(getWebPage(url))
print title[0]

if __name__ == "__main__":
main()

Traceback (most recent call last):
File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
main()
File "C:\user\Projects\test\src\new_main.py", line 24, in main
title = getTitle(getWebPage(url))
FULL TITLE Ãâ€˜ÃÂ°Ã‘â€¦Ã‘â‚¬ÃÂµÃÂ¹ÃÂ½ Ã¢â‚¬â€ ÃÂ£ÃÂ¸ÃÂºÃÂ¸ÃÂ¿ÃÂµÃÂ´ÃÂ¸Ã‘ï¿½
File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
parts = title.split(' Ã¢â‚¬â€ ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)

Tep · Jul 3, 2009

Tep said:
Tep said:

[snip]
how can I replace 'â€”' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*-
[snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo â€” bar'
|>>> data.split('â€”')
|['foo ', ' bar']
|>>> data = u'foo â€” bar'
|>>> data.split(u'â€”')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.
The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet
You'd still benefit from posting some code. Â You shouldn't be converting

Click to expand...

Click to expand...

I've posted code below

Click to expand...

File was indeed in windows-1252, I've changed this. For errors see
below

Below, assume "data" is your "html source code" as a Unicode string:
# -*- coding: UTF-8 -*-
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')
OUTPUT:
'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
Â File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils..py",
line 427, in ImportFile
Â Â exec codeObj in __main__.__dict__
Â File "<auto import>", line 1, in <module>
Â File "x.py", line 6, in <module>
Â Â print data.split('â€”')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)
Note that using the Unicode string in split() works. Â Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. Â In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Â Make sure to
save your source code in the encoding you declare. Â If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.
# coding: windows-1252
data = u'foo â€” bar'
print repr(u'â€”'.encode('utf-8'))
print repr(u'â€”'.encode('windows-1252'))
print data.split(u'â€”')
print data.split('â€”')
'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
Â File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils..py",
line 427, in ImportFile
Â Â exec codeObj in __main__.__dict__
Â File "<auto import>", line 1, in <module>
Â File "x.py", line 6, in <module>
Â Â print data.split('×§)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)
-Mark

Click to expand...

Click to expand...

#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
Â Â title = re.search('<title>(.*?)</title>', input)

Click to expand...

The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

Â Â Â title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)

Â Â title = title.group(1)
Â Â print "FULL TITLE", title.encode('UTF-8')
Â Â parts = title.split(' â€” ')

Click to expand...

The title is Unicode, so the string with which you're splitting should
also be Unicode:

Â Â Â parts = title.split(u' â€” ')

Oh, so simple. I'm new to python and still feel uncomfortable with
unicode stuff.

Thanks to all for help!

Â Â return parts[0]

Click to expand...

def getWebPage(url):
Â Â user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
Â Â headers = { 'User-Agent' : user_agent }
Â Â req = urllib2.Request(url, '', headers)
Â Â response = urllib2.urlopen(req)
Â Â the_page = unicode(response.read(), 'UTF-8')
Â Â return the_page

Click to expand...

def main():
Â Â url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
Â Â title = getTitle(getWebPage(url))
Â Â print title[0]

Click to expand...

if __name__ == "__main__":
Â Â main()

Click to expand...

Traceback (most recent call last):
Â File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
Â Â main()
Â File "C:\user\Projects\test\src\new_main.py", line 24, in main
Â Â title = getTitle(getWebPage(url))
FULL TITLE Ãâ€˜ÃÂ°Ã‘â€¦Ã‘â‚¬ÃÂµÃÂ¹ÃÂ½ Ã¢â‚¬â€ ÃÂ£ÃÂ¸ÃÂºÃÂ¸ÃÂ¿ÃÂµÃÂ´ÃÂ¸Ã‘
Â File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
Â Â parts = title.split(' Ã¢â‚¬â€ ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)

Click to expand...

[email protected]	0	Jan 14, 2014
Encoding trouble when script called from application	0	Jan 14, 2014
Cookie aint retrieving when visiting happens from a backlink.	1	Oct 25, 2013
[2.5.1] "UnicodeDecodeError: 'ascii' codec can't decode byte"?	3	Oct 29, 2008
logging module and binary strings	1	Jul 1, 2009
Trouble with utf-8 values	0	Nov 4, 2013
How to pass Chinese characters as command-line arguments?	2	Jan 31, 2010
logging of strings with broken encoding	8	Jul 2, 2009

getting rid of —

someone

Benjamin Peterson

MRAB

Tep

Tep

Simon Forman

Tep

Mark Tolonen

Tep

MRAB

Tep

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads