getting rid of —

S

someone

Hello,

how can I replace '—' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)


Thanks, Pet

script is # -*- coding: UTF-8 -*-
 
B

Benjamin Peterson

someone said:
Hello,

how can I replace '—' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)


Please paste your code. I suspect that you are mixing unicode and normal strings.
 
M

MRAB

someone said:
Hello,

how can I replace '—' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)


Thanks, Pet

script is # -*- coding: UTF-8 -*-

It sounds like you're mixing bytestrings with Unicode strings. I can't
be any more helpful because you haven't shown the code.
 
T

Tep

It sounds like you're mixing bytestrings with Unicode strings. I can't
be any more helpful because you haven't shown the code.

Oh, I'm sorry. Here it is

def cleanInput(input)
return input.replace('—', '')
 
T

Tep

Oh, I'm sorry. Here it is

def cleanInput(input)
    return input.replace('—', '')

I also need:

#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
parts = input.split(' — ')
return parts[0]


Thanks!
 
S

Simon Forman

Oh, I'm sorry. Here it is
def cleanInput(input)
    return input.replace('—', '')

I also need:

#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
    parts = input.split(' — ')
    return parts[0]

Thanks!

Okay people want to help you but you must make it easy for us.

Post again with a small piece of code that is runnable as-is and that
causes the traceback you're talking about, AND post the complete
traceback too, as-is.

I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

HTH,
~Simon

You might also read this: http://catb.org/esr/faqs/smart-questions.html
 
T

Tep

I also need:
#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
    parts = input.split(' — ')
    return parts[0]

Okay people want to help you but you must make it easy for us.

Post again with a small piece of code that is runnable as-is and that
causes the traceback you're talking about, AND post the complete
traceback too, as-is.

I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet
 
M

Mark Tolonen

Tep said:
how can I replace '—' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*-
[snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar' |>>> data.split(u'—')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

You'd still benefit from posting some code. You shouldn't be converting
back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Also make sure your file
is actually saved in the encoding you declare. I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')


OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works. Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to
save your source code in the encoding you declare. If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark
 
T

Tep

how can I replace '—' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*- [snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar' |>>> data.split(u'—')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.
The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

You'd still benefit from posting some code.  You shouldn't be converting

I've posted code below
back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code".  Also make sure your file
is actually saved in the encoding you declare.  I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

File was indeed in windows-1252, I've changed this. For errors see
below
Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works.  Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data.  In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to
save your source code in the encoding you declare.  If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark

#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
title = re.search('<title>(.*?)</title>', input)
title = title.group(1)
print "FULL TITLE", title.encode('UTF-8')
parts = title.split(' — ')
return parts[0]


def getWebPage(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req)
the_page = unicode(response.read(), 'UTF-8')
return the_page


def main():
url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
title = getTitle(getWebPage(url))
print title[0]


if __name__ == "__main__":
main()


Traceback (most recent call last):
File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
main()
File "C:\user\Projects\test\src\new_main.py", line 24, in main
title = getTitle(getWebPage(url))
FULL TITLE Ñðхрõùý  ãøúøÿõôøÑ�
File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
parts = title.split('  ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)
 
M

MRAB

Tep said:
how can I replace '—' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*- [snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.
The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet
You'd still benefit from posting some code. You shouldn't be converting

I've posted code below
back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Also make sure your file
is actually saved in the encoding you declare. I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

File was indeed in windows-1252, I've changed this. For errors see
below
Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works. Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to
save your source code in the encoding you declare. If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark

#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
title = re.search('<title>(.*?)</title>', input)

The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)
title = title.group(1)
print "FULL TITLE", title.encode('UTF-8')
parts = title.split(' — ')

The title is Unicode, so the string with which you're splitting should
also be Unicode:

parts = title.split(u' — ')
return parts[0]


def getWebPage(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req)
the_page = unicode(response.read(), 'UTF-8')
return the_page


def main():
url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
title = getTitle(getWebPage(url))
print title[0]


if __name__ == "__main__":
main()


Traceback (most recent call last):
File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
main()
File "C:\user\Projects\test\src\new_main.py", line 24, in main
title = getTitle(getWebPage(url))
FULL TITLE Ñðхрõùý  ãøúøÿõôøÑ�
File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
parts = title.split('  ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)
 
T

Tep

Tep said:
[snip]
how can I replace '—' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*-
[snip]
I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.
The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet
You'd still benefit from posting some code.  You shouldn't be converting
I've posted code below
File was indeed in windows-1252, I've changed this. For errors see
below
Below, assume "data" is your "html source code" as a Unicode string:
# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')
OUTPUT:
'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils..py",
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)
Note that using the Unicode string in split() works.  Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data.  In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to
save your source code in the encoding you declare.  If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.
# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')
'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils..py",
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)
-Mark
#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
    title = re.search('<title>(.*?)</title>', input)

The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

     title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)
    title = title.group(1)
    print "FULL TITLE", title.encode('UTF-8')
    parts = title.split(' — ')

The title is Unicode, so the string with which you're splitting should
also be Unicode:

     parts = title.split(u' — ')


Oh, so simple. I'm new to python and still feel uncomfortable with
unicode stuff.

Thanks to all for help!
    return parts[0]
def getWebPage(url):
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent' : user_agent }
    req = urllib2.Request(url, '', headers)
    response = urllib2.urlopen(req)
    the_page = unicode(response.read(), 'UTF-8')
    return the_page
def main():
    url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
    title = getTitle(getWebPage(url))
    print title[0]
if __name__ == "__main__":
    main()
Traceback (most recent call last):
  File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
    main()
  File "C:\user\Projects\test\src\new_main.py", line 24, in main
    title = getTitle(getWebPage(url))
FULL TITLE Ñðхрõùý  ãøúøÿõôøÑ
  File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
    parts = title.split('  ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top