umlauts

Discussion in 'Python' started by Arian Kuschki, Oct 17, 2009.

  1. Hi all

    this has been bugging me for a long time and I do not seem to be able to
    understand what to do. I always have problems when dealing input text that
    contains umlauts. Consider the following:

    In [1]: import urllib

    In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

    In [3]: xml = f.read()

    In [4]: f.close()

    In [5]: print xml
    ------> print(xml)
    <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
    ><forecast_information><cit

    y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    data=""/><longitude_e6 data=""/><forecast_date
    data="2009-10-17"/><current_date_time data="2009-10
    -17 14:20:00 +0000"/><unit_system
    data="SI"/></forecast_information><current_conditions><condition data="Meistens
    bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
    umidity data="Feuchtigkeit: 87�%"/><icon
    data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    Windgeschwindigkeiten von 13 km/h"/></curr
    ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    data="1"/><high data="7"/><icon
    data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    data="So."/><low data="-1"/><high data="8"/><icon
    data="/ig/images/weather/chance_of_sno
    w.gif"/><condition data="Vereinzelt
    Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    mages/weather/mostly_sunny.gif"/><condition data="Teils
    sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    data="Di."/><low data="0"/><high data="8"
    /><icon data="/ig/images/weather/sunny.gif"/><condition
    data="Klar"/></forecast_conditions></weather></xml_api_reply>

    As you can see the umlauts in the XML are not displayed properly. When I want
    to process this text (for example with xml.sax), I get error messages because
    the parses can't read this.

    I've tried to read up on this and there is a lot of information on the web, but
    nothing seems to work for me. For example setting the coding to UTF like this:
    # -*- coding: utf-8 -*- or using the decode() string method.

    I always have this kind of problem when input contains umlauts, not just in
    this case. My locale (on Ubuntu) is en_GB.UTF-8.

    Cheers
    Arian
     
    Arian Kuschki, Oct 17, 2009
    #1
    1. Advertising

  2. Arian Kuschki schrieb:
    > Hi all
    >
    > this has been bugging me for a long time and I do not seem to be able to
    > understand what to do. I always have problems when dealing input text that
    > contains umlauts. Consider the following:
    >
    > In [1]: import urllib
    >
    > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >
    > In [3]: xml = f.read()
    >
    > In [4]: f.close()
    >
    > In [5]: print xml
    > ------> print(xml)
    > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
    >> <forecast_information><cit

    > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    > data=""/><longitude_e6 data=""/><forecast_date
    > data="2009-10-17"/><current_date_time data="2009-10
    > -17 14:20:00 +0000"/><unit_system
    > data="SI"/></forecast_information><current_conditions><condition data="Meistens
    > bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
    > umidity data="Feuchtigkeit: 87�%"/><icon
    > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    > Windgeschwindigkeiten von 13 km/h"/></curr
    > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    > data="1"/><high data="7"/><icon
    > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    > data="So."/><low data="-1"/><high data="8"/><icon
    > data="/ig/images/weather/chance_of_sno
    > w.gif"/><condition data="Vereinzelt
    > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    > mages/weather/mostly_sunny.gif"/><condition data="Teils
    > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    > data="Di."/><low data="0"/><high data="8"
    > /><icon data="/ig/images/weather/sunny.gif"/><condition
    > data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >
    > As you can see the umlauts in the XML are not displayed properly. When I want
    > to process this text (for example with xml.sax), I get error messages because
    > the parses can't read this.
    >
    > I've tried to read up on this and there is a lot of information on the web, but
    > nothing seems to work for me. For example setting the coding to UTF like this:
    > # -*- coding: utf-8 -*- or using the decode() string method.


    The encoding of the python-source-file has nothing to do with this. It's
    only relevant for unicode-literals (in python 2.x, that's u"...")

    >
    > I always have this kind of problem when input contains umlauts, not just in
    > this case. My locale (on Ubuntu) is en_GB.UTF-8.


    If we assume the data on the website is correct (it appears to be when I
    open it in FF), then your problem is most probably your display/terminal.

    What does this show you in your interactive interpreter?

    >>> print "\xc3\xb6"

    ö

    For me, it's o-umlaut, ö. This is because the above bytes are the
    sequence for ö in utf-8.

    If this shows something else, you need to adjust your terminal settings.

    Diez
     
    Diez B. Roggisch, Oct 17, 2009
    #2
    1. Advertising

  3. Arian Kuschki

    StarWing Guest

    On 10月17æ—¥, 下åˆ9æ—¶54分, Arian Kuschki <>
    wrote:
    > Hi all
    >
    > this has been bugging me for a long time and I do not seem to be able to
    > understand what to do. I always have problems when dealing input text that
    > contains umlauts. Consider the following:
    >
    > In [1]: import urllib
    >
    > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >
    > In [3]: xml = f.read()
    >
    > In [4]: f.close()
    >
    > In [5]: print xml
    > ------> print(xml)
    > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
    >
    > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    > data=""/><longitude_e6 data=""/><forecast_date
    > data="2009-10-17"/><current_date_time data="2009-10
    > -17 14:20:00 +0000"/><unit_system
    > data="SI"/></forecast_information><current_conditions><condition data="Meistens
    > bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    > umidity data="Feuchtigkeit: 87 %"/><icon
    > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    > Windgeschwindigkeiten von 13 km/h"/></curr
    > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    > data="1"/><high data="7"/><icon
    > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    > data="So."/><low data="-1"/><high data="8"/><icon
    > data="/ig/images/weather/chance_of_sno
    > w.gif"/><condition data="Vereinzelt
    > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    > mages/weather/mostly_sunny.gif"/><condition data="Teils
    > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    > data="Di."/><low data="0"/><high data="8"
    > /><icon data="/ig/images/weather/sunny.gif"/><condition
    > data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >
    > As you can see the umlauts in the XML are not displayed properly. When I want
    > to process this text (for example with xml.sax), I get error messages because
    > the parses can't read this.
    >
    > I've tried to read up on this and there is a lot of information on the web, but
    > nothing seems to work for me. For example setting the coding to UTF like this:
    > # -*- coding: utf-8 -*- or using the decode() string method.
    >
    > I always have this kind of problem when input contains umlauts, not just in
    > this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >
    > Cheers
    > Arian


    try this?

    # vim: set fencoding=utf-8:
    import urllib
    import xml.sax as sax, xml.sax.handler as handler

    f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    xml = f.read()
    xml = xml.decode("cp1252")
    f.close()

    class my_handler(handler.ContentHandler):
    def startElement(self, name, attrs):
    print "begin:", name, attrs

    def endElement(self, name):
    print "end:", name

    sax.parseString(xml, my_handler())
     
    StarWing, Oct 17, 2009
    #3
  4. StarWing schrieb:
    > On 10月17æ—¥, 下åˆ9æ—¶54分, Arian Kuschki <>
    > wrote:
    >> Hi all
    >>
    >> this has been bugging me for a long time and I do not seem to be able to
    >> understand what to do. I always have problems when dealing input text that
    >> contains umlauts. Consider the following:
    >>
    >> In [1]: import urllib
    >>
    >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>
    >> In [3]: xml = f.read()
    >>
    >> In [4]: f.close()
    >>
    >> In [5]: print xml
    >> ------> print(xml)
    >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
    >>
    >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    >> data=""/><longitude_e6 data=""/><forecast_date
    >> data="2009-10-17"/><current_date_time data="2009-10
    >> -17 14:20:00 +0000"/><unit_system
    >> data="SI"/></forecast_information><current_conditions><condition data="Meistens
    >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    >> umidity data="Feuchtigkeit: 87 %"/><icon
    >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    >> Windgeschwindigkeiten von 13 km/h"/></curr
    >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    >> data="1"/><high data="7"/><icon
    >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="So."/><low data="-1"/><high data="8"/><icon
    >> data="/ig/images/weather/chance_of_sno
    >> w.gif"/><condition data="Vereinzelt
    >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    >> mages/weather/mostly_sunny.gif"/><condition data="Teils
    >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Di."/><low data="0"/><high data="8"
    >> /><icon data="/ig/images/weather/sunny.gif"/><condition
    >> data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >>
    >> As you can see the umlauts in the XML are not displayed properly. When I want
    >> to process this text (for example with xml.sax), I get error messages because
    >> the parses can't read this.
    >>
    >> I've tried to read up on this and there is a lot of information on the web, but
    >> nothing seems to work for me. For example setting the coding to UTF like this:
    >> # -*- coding: utf-8 -*- or using the decode() string method.
    >>
    >> I always have this kind of problem when input contains umlauts, not just in
    >> this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >>
    >> Cheers
    >> Arian

    >
    > try this?
    >
    > # vim: set fencoding=utf-8:
    > import urllib
    > import xml.sax as sax, xml.sax.handler as handler
    >
    > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    > xml = f.read()
    > xml = xml.decode("cp1252")
    > f.close()
    >
    > class my_handler(handler.ContentHandler):
    > def startElement(self, name, attrs):
    > print "begin:", name, attrs
    >
    > def endElement(self, name):
    > print "end:", name
    >
    > sax.parseString(xml, my_handler())


    This is wrong. XML is a *byte*-based format, which explicitly states
    encodings. So decoding a byte-string to a unicode-object and then
    passing it to a parser is not working in the very moment you have data that

    - is outside your default-system-encoding (ususally ascii)
    - the system-encoding and the declared decoding differ

    Besides, I don't see where the whole SAX-stuff is supposed to do
    anything the direct print and the decode() don't do - smells like
    cargo-cult to me.

    Diez
     
    Diez B. Roggisch, Oct 17, 2009
    #4
  5. StarWing schrieb:
    > On 10月17æ—¥, 下åˆ9æ—¶54分, Arian Kuschki <>
    > wrote:
    >> Hi all
    >>
    >> this has been bugging me for a long time and I do not seem to be able to
    >> understand what to do. I always have problems when dealing input text that
    >> contains umlauts. Consider the following:
    >>
    >> In [1]: import urllib
    >>
    >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>
    >> In [3]: xml = f.read()
    >>
    >> In [4]: f.close()
    >>
    >> In [5]: print xml
    >> ------> print(xml)
    >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
    >>
    >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    >> data=""/><longitude_e6 data=""/><forecast_date
    >> data="2009-10-17"/><current_date_time data="2009-10
    >> -17 14:20:00 +0000"/><unit_system
    >> data="SI"/></forecast_information><current_conditions><condition data="Meistens
    >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    >> umidity data="Feuchtigkeit: 87 %"/><icon
    >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    >> Windgeschwindigkeiten von 13 km/h"/></curr
    >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    >> data="1"/><high data="7"/><icon
    >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="So."/><low data="-1"/><high data="8"/><icon
    >> data="/ig/images/weather/chance_of_sno
    >> w.gif"/><condition data="Vereinzelt
    >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    >> mages/weather/mostly_sunny.gif"/><condition data="Teils
    >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Di."/><low data="0"/><high data="8"
    >> /><icon data="/ig/images/weather/sunny.gif"/><condition
    >> data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >>
    >> As you can see the umlauts in the XML are not displayed properly. When I want
    >> to process this text (for example with xml.sax), I get error messages because
    >> the parses can't read this.
    >>
    >> I've tried to read up on this and there is a lot of information on the web, but
    >> nothing seems to work for me. For example setting the coding to UTF like this:
    >> # -*- coding: utf-8 -*- or using the decode() string method.
    >>
    >> I always have this kind of problem when input contains umlauts, not just in
    >> this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >>
    >> Cheers
    >> Arian

    >
    > try this?
    >
    > # vim: set fencoding=utf-8:
    > import urllib
    > import xml.sax as sax, xml.sax.handler as handler
    >
    > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    > xml = f.read()
    > xml = xml.decode("cp1252")
    > f.close()
    >
    > class my_handler(handler.ContentHandler):
    > def startElement(self, name, attrs):
    > print "begin:", name, attrs
    >
    > def endElement(self, name):
    > print "end:", name
    >
    > sax.parseString(xml, my_handler())


    This is wrong. XML is a *byte*-based format, which explicitly states
    encodings. So decoding a byte-string to a unicode-object and then
    passing it to a parser is not working in the very moment you have data that

    - is outside your default-system-encoding (ususally ascii)
    - the system-encoding and the declared decoding differ

    Besides, I don't see where the whole SAX-stuff is supposed to do
    anything the direct print and the decode() don't do - smells like
    cargo-cult to me.

    Diez
     
    Diez B. Roggisch, Oct 17, 2009
    #5
  6. Arian Kuschki

    StarWing Guest

    On 10月18æ—¥, 上åˆ12æ—¶50分, "Diez B. Roggisch" <> wrote:
    > StarWing schrieb:
    >
    >
    >
    > > On 10月17æ—¥, 下åˆ9æ—¶54分, Arian Kuschki <>
    > > wrote:
    > >> Hi all

    >
    > >> this has been bugging me for a long time and I do not seem to be able to
    > >> understand what to do. I always have problems when dealing input text that
    > >> contains umlauts. Consider the following:

    >
    > >> In [1]: import urllib

    >
    > >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

    >
    > >> In [3]: xml = f.read()

    >
    > >> In [4]: f.close()

    >
    > >> In [5]: print xml
    > >> ------> print(xml)
    > >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    > >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit

    >
    > >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    > >> data=""/><longitude_e6 data=""/><forecast_date
    > >> data="2009-10-17"/><current_date_time data="2009-10
    > >> -17 14:20:00 +0000"/><unit_system
    > >> data="SI"/></forecast_information><current_conditions><condition data="Meistens
    > >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    > >> umidity data="Feuchtigkeit: 87 %"/><icon
    > >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    > >> Windgeschwindigkeiten von 13 km/h"/></curr
    > >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    > >> data="1"/><high data="7"/><icon
    > >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    > >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    > >> data="So."/><low data="-1"/><high data="8"/><icon
    > >> data="/ig/images/weather/chance_of_sno
    > >> w.gif"/><condition data="Vereinzelt
    > >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    > >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    > >> mages/weather/mostly_sunny.gif"/><condition data="Teils
    > >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    > >> data="Di."/><low data="0"/><high data="8"
    > >> /><icon data="/ig/images/weather/sunny.gif"/><condition
    > >> data="Klar"/></forecast_conditions></weather></xml_api_reply>

    >
    > >> As you can see the umlauts in the XML are not displayed properly. When I want
    > >> to process this text (for example with xml.sax), I get error messages because
    > >> the parses can't read this.

    >
    > >> I've tried to read up on this and there is a lot of information on the web, but
    > >> nothing seems to work for me. For example setting the coding to UTF like this:
    > >> # -*- coding: utf-8 -*- or using the decode() string method.

    >
    > >> I always have this kind of problem when input contains umlauts, not just in
    > >> this case. My locale (on Ubuntu) is en_GB.UTF-8.

    >
    > >> Cheers
    > >> Arian

    >
    > > try this?

    >
    > > # vim: set fencoding=utf-8:
    > > import urllib
    > > import xml.sax as sax, xml.sax.handler as handler

    >
    > > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    > > xml = f.read()
    > > xml = xml.decode("cp1252")
    > > f.close()

    >
    > > class my_handler(handler.ContentHandler):
    > >     def startElement(self, name, attrs):
    > >         print "begin:", name, attrs

    >
    > >     def endElement(self, name):
    > >         print "end:", name

    >
    > > sax.parseString(xml, my_handler())

    >
    > This is wrong. XML is a *byte*-based format, which explicitly states
    > encodings. So decoding a byte-string to a unicode-object and then
    > passing it to a parser is not working in the very moment you have data that
    >
    >   - is outside your default-system-encoding (ususally ascii)
    >   - the system-encoding and the declared decoding differ
    >
    > Besides, I don't see where the whole SAX-stuff is supposed to do
    > anything the direct print  and the decode() don't do - smells like
    > cargo-cult to me.
    >
    > Diez


    yes, XML is a *byte*-based format, and so as utf-8 and code-page
    (cp936, cp1252, etc.). so usually XML will sign its coding at head.
    but this didn't work now.

    in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
    sys.setdefaultcoding(), and f.read() return a str. so it must be a
    undecoded, byte-base format (i.e. raw XML data). so use the right code-
    page to decode it is safe.(notice the webpage is google.de).

    in Python3.1, read() returns a bytes object. so we *must* decode it,
    nor we can't pass it into a parser.
     
    StarWing, Oct 17, 2009
    #6
  7. Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate

    >What does this show you in your interactive interpreter?
    >
    >>>> print "\xc3\xb6"

    >ö
    >
    >For me, it's o-umlaut, ö. This is because the above bytes are the
    >sequence for ö in utf-8.
    >
    >If this shows something else, you need to adjust your terminal settings.


    for me it also prints the correct o-umlaut (ö), so that was not the problem.


    All of the below result in xml that shows all umlauts correctly when printed:

    xml.decode("cp1252")
    xml.decode("cp1252").encode("utf-8")
    xml.decode("iso-8859-1")
    xml.decode("iso-8859-1").encode("utf-8")

    But when I want to parse the xml then, it only works if I
    do both decode and encode. If I only decode, I get the following error:
    SAXParseException: <unknown>:1:1: not well-formed (invalid token)

    Do I understand right that since the encoding was not specified in the xml
    response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
    would not have had the encoding problem in the first place?

    Anyway, thanks everybody, this has helped me a lot.

    Arian


    On Sat 17, 20:17 +0200, Diez B. Roggisch wrote:

    > StarWing schrieb:
    > >On 10月18æ—¥, 上åˆ12æ—¶50分, "Diez B. Roggisch" <> wrote:
    > >>StarWing schrieb:
    > >>
    > >>
    > >>
    > >>>On 10月17æ—¥, 下åˆ9æ—¶54分, Arian Kuschki <>
    > >>>wrote:
    > >>>>Hi all
    > >>>>this has been bugging me for a long time and I do not seem to be able to
    > >>>>understand what to do. I always have problems when dealing input text that
    > >>>>contains umlauts. Consider the following:
    > >>>>In [1]: import urllib
    > >>>>In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    > >>>>In [3]: xml = f.read()
    > >>>>In [4]: f.close()
    > >>>>In [5]: print xml
    > >>>>------> print(xml)
    > >>>><?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    > >>>>tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
    > >>>>y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    > >>>>data=""/><longitude_e6 data=""/><forecast_date
    > >>>>data="2009-10-17"/><current_date_time data="2009-10
    > >>>>-17 14:20:00 +0000"/><unit_system
    > >>>>data="SI"/></forecast_information><current_conditions><condition data="Meistens
    > >>>>bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    > >>>>umidity data="Feuchtigkeit: 87 %"/><icon
    > >>>>data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    > >>>>Windgeschwindigkeiten von 13 km/h"/></curr
    > >>>>ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    > >>>>data="1"/><high data="7"/><icon
    > >>>>data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    > >>>>ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    > >>>>data="So."/><low data="-1"/><high data="8"/><icon
    > >>>>data="/ig/images/weather/chance_of_sno
    > >>>>w.gif"/><condition data="Vereinzelt
    > >>>>Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    > >>>>data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    > >>>>mages/weather/mostly_sunny.gif"/><condition data="Teils
    > >>>>sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    > >>>>data="Di."/><low data="0"/><high data="8"
    > >>>>/><icon data="/ig/images/weather/sunny.gif"/><condition
    > >>>>data="Klar"/></forecast_conditions></weather></xml_api_reply>
    > >>>>As you can see the umlauts in the XML are not displayed properly. When I want
    > >>>>to process this text (for example with xml.sax), I get error messages because
    > >>>>the parses can't read this.
    > >>>>I've tried to read up on this and there is a lot of information on the web, but
    > >>>>nothing seems to work for me. For example setting the coding to UTF like this:
    > >>>># -*- coding: utf-8 -*- or using the decode() string method.
    > >>>>I always have this kind of problem when input contains umlauts, not just in
    > >>>>this case. My locale (on Ubuntu) is en_GB.UTF-8.
    > >>>>Cheers
    > >>>>Arian
    > >>>try this?
    > >>># vim: set fencoding=utf-8:
    > >>>import urllib
    > >>>import xml.sax as sax, xml.sax.handler as handler
    > >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    > >>>xml = f.read()
    > >>>xml = xml.decode("cp1252")
    > >>>f.close()
    > >>>class my_handler(handler.ContentHandler):
    > >>> def startElement(self, name, attrs):
    > >>> print "begin:", name, attrs
    > >>> def endElement(self, name):
    > >>> print "end:", name
    > >>>sax.parseString(xml, my_handler())
    > >>This is wrong. XML is a *byte*-based format, which explicitly states
    > >>encodings. So decoding a byte-string to a unicode-object and then
    > >>passing it to a parser is not working in the very moment you have data that
    > >>
    > >> - is outside your default-system-encoding (ususally ascii)
    > >> - the system-encoding and the declared decoding differ
    > >>
    > >>Besides, I don't see where the whole SAX-stuff is supposed to do
    > >>anything the direct print and the decode() don't do - smells like
    > >>cargo-cult to me.
    > >>
    > >>Diez

    > >
    > >yes, XML is a *byte*-based format, and so as utf-8 and code-page
    > >(cp936, cp1252, etc.). so usually XML will sign its coding at head.
    > >but this didn't work now.
    > >
    > >in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
    > >sys.setdefaultcoding(), and f.read() return a str. so it must be a
    > >undecoded, byte-base format (i.e. raw XML data). so use the right code-
    > >page to decode it is safe.(notice the webpage is google.de).
    > >
    > >in Python3.1, read() returns a bytes object. so we *must* decode it,
    > >nor we can't pass it into a parser.

    >
    > You didn't get my point. A XML-parser only *takes* a byte-string.
    > Decoding is it's business. So your above last sentence is wrong.
    >
    > Because regardless of the python-version, if you feed the parser a
    > unicode-object, python will first encode that to a byte-string,
    > possibly giving a UnicodeError (maybe this automated conversion has
    > gone in Py3K, but then you get a type-error instead).
    >
    > So to make the above work (if one wants to parse the xml), the
    > proper thing to do would be
    >
    > xml = xml.decode("cp1252").encode("utf-8")
    >
    > and then feed that. Of course the really good thing would be to fix
    > the webpage, but that's beyond our capabilities I fear...
    >
    > Diez
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    --
     
    Arian Kuschki, Oct 17, 2009
    #7
  8. StarWing schrieb:
    > On 10月18æ—¥, 上åˆ12æ—¶50分, "Diez B. Roggisch" <> wrote:
    >> StarWing schrieb:
    >>
    >>
    >>
    >>> On 10月17æ—¥, 下åˆ9æ—¶54分, Arian Kuschki <>
    >>> wrote:
    >>>> Hi all
    >>>> this has been bugging me for a long time and I do not seem to be able to
    >>>> understand what to do. I always have problems when dealing input text that
    >>>> contains umlauts. Consider the following:
    >>>> In [1]: import urllib
    >>>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>>> In [3]: xml = f.read()
    >>>> In [4]: f.close()
    >>>> In [5]: print xml
    >>>> ------> print(xml)
    >>>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    >>>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
    >>>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    >>>> data=""/><longitude_e6 data=""/><forecast_date
    >>>> data="2009-10-17"/><current_date_time data="2009-10
    >>>> -17 14:20:00 +0000"/><unit_system
    >>>> data="SI"/></forecast_information><current_conditions><condition data="Meistens
    >>>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    >>>> umidity data="Feuchtigkeit: 87 %"/><icon
    >>>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    >>>> Windgeschwindigkeiten von 13 km/h"/></curr
    >>>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    >>>> data="1"/><high data="7"/><icon
    >>>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    >>>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    >>>> data="So."/><low data="-1"/><high data="8"/><icon
    >>>> data="/ig/images/weather/chance_of_sno
    >>>> w.gif"/><condition data="Vereinzelt
    >>>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    >>>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    >>>> mages/weather/mostly_sunny.gif"/><condition data="Teils
    >>>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    >>>> data="Di."/><low data="0"/><high data="8"
    >>>> /><icon data="/ig/images/weather/sunny.gif"/><condition
    >>>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >>>> As you can see the umlauts in the XML are not displayed properly. When I want
    >>>> to process this text (for example with xml.sax), I get error messages because
    >>>> the parses can't read this.
    >>>> I've tried to read up on this and there is a lot of information on the web, but
    >>>> nothing seems to work for me. For example setting the coding to UTF like this:
    >>>> # -*- coding: utf-8 -*- or using the decode() string method.
    >>>> I always have this kind of problem when input contains umlauts, not just in
    >>>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >>>> Cheers
    >>>> Arian
    >>> try this?
    >>> # vim: set fencoding=utf-8:
    >>> import urllib
    >>> import xml.sax as sax, xml.sax.handler as handler
    >>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>> xml = f.read()
    >>> xml = xml.decode("cp1252")
    >>> f.close()
    >>> class my_handler(handler.ContentHandler):
    >>> def startElement(self, name, attrs):
    >>> print "begin:", name, attrs
    >>> def endElement(self, name):
    >>> print "end:", name
    >>> sax.parseString(xml, my_handler())

    >> This is wrong. XML is a *byte*-based format, which explicitly states
    >> encodings. So decoding a byte-string to a unicode-object and then
    >> passing it to a parser is not working in the very moment you have data that
    >>
    >> - is outside your default-system-encoding (ususally ascii)
    >> - the system-encoding and the declared decoding differ
    >>
    >> Besides, I don't see where the whole SAX-stuff is supposed to do
    >> anything the direct print and the decode() don't do - smells like
    >> cargo-cult to me.
    >>
    >> Diez

    >
    > yes, XML is a *byte*-based format, and so as utf-8 and code-page
    > (cp936, cp1252, etc.). so usually XML will sign its coding at head.
    > but this didn't work now.
    >
    > in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
    > sys.setdefaultcoding(), and f.read() return a str. so it must be a
    > undecoded, byte-base format (i.e. raw XML data). so use the right code-
    > page to decode it is safe.(notice the webpage is google.de).
    >
    > in Python3.1, read() returns a bytes object. so we *must* decode it,
    > nor we can't pass it into a parser.


    You didn't get my point. A XML-parser only *takes* a byte-string.
    Decoding is it's business. So your above last sentence is wrong.

    Because regardless of the python-version, if you feed the parser a
    unicode-object, python will first encode that to a byte-string, possibly
    giving a UnicodeError (maybe this automated conversion has gone in Py3K,
    but then you get a type-error instead).

    So to make the above work (if one wants to parse the xml), the proper
    thing to do would be

    xml = xml.decode("cp1252").encode("utf-8")

    and then feed that. Of course the really good thing would be to fix the
    webpage, but that's beyond our capabilities I fear...

    Diez
     
    Diez B. Roggisch, Oct 17, 2009
    #8
  9. Arian Kuschki schrieb:
    > Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate
    >
    >> What does this show you in your interactive interpreter?
    >>
    >>>>> print "\xc3\xb6"

    >> ö
    >>
    >> For me, it's o-umlaut, ö. This is because the above bytes are the
    >> sequence for ö in utf-8.
    >>
    >> If this shows something else, you need to adjust your terminal settings.

    >
    > for me it also prints the correct o-umlaut (ö), so that was not the problem.
    >
    >
    > All of the below result in xml that shows all umlauts correctly when printed:
    >
    > xml.decode("cp1252")
    > xml.decode("cp1252").encode("utf-8")
    > xml.decode("iso-8859-1")
    > xml.decode("iso-8859-1").encode("utf-8")
    >
    > But when I want to parse the xml then, it only works if I
    > do both decode and encode. If I only decode, I get the following error:
    > SAXParseException: <unknown>:1:1: not well-formed (invalid token)
    >
    > Do I understand right that since the encoding was not specified in the xml
    > response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
    > would not have had the encoding problem in the first place?


    Yes. XML without explicit encoding is implicitly UTF-8, and the page is
    borked using cp* or latin* without saying so.


    Diez
     
    Diez B. Roggisch, Oct 18, 2009
    #9
  10. Diez B. Roggisch schrieb:
    > Arian Kuschki schrieb:
    >> Whoa, that was quick! Thanks for all the answers, I'll try to
    >> recapitulate
    >>
    >>> What does this show you in your interactive interpreter?
    >>>
    >>>>>> print "\xc3\xb6"
    >>> ö
    >>>
    >>> For me, it's o-umlaut, ö. This is because the above bytes are the
    >>> sequence for ö in utf-8.
    >>>
    >>> If this shows something else, you need to adjust your terminal settings.

    >>
    >> for me it also prints the correct o-umlaut (ö), so that was not the
    >> problem.
    >>
    >>
    >> All of the below result in xml that shows all umlauts correctly when
    >> printed:
    >>
    >> xml.decode("cp1252")
    >> xml.decode("cp1252").encode("utf-8")
    >> xml.decode("iso-8859-1")
    >> xml.decode("iso-8859-1").encode("utf-8")
    >>
    >> But when I want to parse the xml then, it only works if I
    >> do both decode and encode. If I only decode, I get the following error:
    >> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
    >>
    >> Do I understand right that since the encoding was not specified in the
    >> xml response, it should have been utf-8 by default? And that if it had
    >> indeed been utf-8 I would not have had the encoding problem in the
    >> first place?

    >
    > Yes. XML without explicit encoding is implicitly UTF-8, and the page is
    > borked using cp* or latin* without saying so.


    Ok, after reading some other posts in this thread this assumption seems
    not to hold. HTTP-protocol allows for other encodings to be implicitly
    given. Which I think is an atrocity.

    Diez
     
    Diez B. Roggisch, Oct 18, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Axel Dahmen
    Replies:
    3
    Views:
    4,447
    Axel Dahmen
    Apr 30, 2005
  2. John Dalberg
    Replies:
    1
    Views:
    1,876
    Joerg Jooss
    Feb 17, 2006
  3. Replies:
    10
    Views:
    1,025
    Shmuel (Seymour J.) Metz
    Nov 1, 2005
  4. Moritz Beller

    Where have all the umlauts gone?

    Moritz Beller, Nov 7, 2004, in forum: C++
    Replies:
    1
    Views:
    333
    Victor Bazarov
    Nov 7, 2004
  5. Joerg Lehmann

    Print formatted Strings with Umlauts

    Joerg Lehmann, Feb 11, 2004, in forum: Python
    Replies:
    4
    Views:
    387
    Joerg Lehmann
    Feb 12, 2004
Loading...

Share This Page