Re: umlauts

Discussion in 'Python' started by MRAB, Oct 17, 2009.

  1. MRAB

    MRAB Guest

    Arian Kuschki wrote:
    > Hi all
    >
    > this has been bugging me for a long time and I do not seem to be able to
    > understand what to do. I always have problems when dealing input text that
    > contains umlauts. Consider the following:
    >
    > In [1]: import urllib
    >
    > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >
    > In [3]: xml = f.read()
    >
    > In [4]: f.close()
    >
    > In [5]: print xml
    > ------> print(xml)
    > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
    >> <forecast_information><cit

    > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    > data=""/><longitude_e6 data=""/><forecast_date
    > data="2009-10-17"/><current_date_time data="2009-10
    > -17 14:20:00 +0000"/><unit_system
    > data="SI"/></forecast_information><current_conditions><condition data="Meistens
    > bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
    > umidity data="Feuchtigkeit: 87�%"/><icon
    > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    > Windgeschwindigkeiten von 13 km/h"/></curr
    > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    > data="1"/><high data="7"/><icon
    > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    > data="So."/><low data="-1"/><high data="8"/><icon
    > data="/ig/images/weather/chance_of_sno
    > w.gif"/><condition data="Vereinzelt
    > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    > mages/weather/mostly_sunny.gif"/><condition data="Teils
    > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    > data="Di."/><low data="0"/><high data="8"
    > /><icon data="/ig/images/weather/sunny.gif"/><condition
    > data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >
    > As you can see the umlauts in the XML are not displayed properly. When I want
    > to process this text (for example with xml.sax), I get error messages because
    > the parses can't read this.
    >
    > I've tried to read up on this and there is a lot of information on the web, but
    > nothing seems to work for me. For example setting the coding to UTF like this:
    > # -*- coding: utf-8 -*- or using the decode() string method.
    >
    > I always have this kind of problem when input contains umlauts, not just in
    > this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >

    The string you received from the website is a bytestring and you're just
    printing it to your console, which is configured for UTF-8. However, the
    bytestring isn't valid UTF-8, so the console is replacing the invalid
    parts with the funny characters.

    You should decode the bytestring to Unicode and then re-encode it to
    UTF-8. I don't know what encoding the website is actually using; here
    I'm assuming ISO-8859-1:

    print xml.decode("iso-8859-1").encode("utf-8")
     
    MRAB, Oct 17, 2009
    #1
    1. Advertising

  2. MRAB schrieb:
    > Arian Kuschki wrote:
    >> Hi all
    >>
    >> this has been bugging me for a long time and I do not seem to be able
    >> to understand what to do. I always have problems when dealing input
    >> text that contains umlauts. Consider the following:
    >>
    >> In [1]: import urllib
    >>
    >> In [2]: f =
    >> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>
    >> In [3]: xml = f.read()
    >>
    >> In [4]: f.close()
    >>
    >> In [5]: print xml
    >> ------> print(xml)
    >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
    >>> <forecast_information><cit

    >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    >> data=""/><longitude_e6 data=""/><forecast_date
    >> data="2009-10-17"/><current_date_time data="2009-10
    >> -17 14:20:00 +0000"/><unit_system
    >> data="SI"/></forecast_information><current_conditions><condition
    >> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
    >> umidity data="Feuchtigkeit: 87�%"/><icon
    >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition
    >> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
    >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    >> data="1"/><high data="7"/><icon
    >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    >> ereinzelt
    >> Regen"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="So."/><low data="-1"/><high data="8"/><icon
    >> data="/ig/images/weather/chance_of_sno
    >> w.gif"/><condition data="Vereinzelt
    >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    >> mages/weather/mostly_sunny.gif"/><condition data="Teils
    >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Di."/><low data="0"/><high data="8"
    >> /><icon data="/ig/images/weather/sunny.gif"/><condition
    >> data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >>
    >> As you can see the umlauts in the XML are not displayed properly. When
    >> I want to process this text (for example with xml.sax), I get error
    >> messages because the parses can't read this.
    >>
    >> I've tried to read up on this and there is a lot of information on the
    >> web, but nothing seems to work for me. For example setting the coding
    >> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string
    >> method.
    >>
    >> I always have this kind of problem when input contains umlauts, not
    >> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >>

    > The string you received from the website is a bytestring and you're just
    > printing it to your console, which is configured for UTF-8. However, the
    > bytestring isn't valid UTF-8, so the console is replacing the invalid
    > parts with the funny characters.


    This is wierd. I looked at the site in FireFox - and it was displayed
    correctly, including umlauts. Bringing up the info-dialog claims the
    page is UTF-8, the XML itself says so as well (implicit, through the
    missing declaration of an encoding) - but it clearly is *not* utf-8.

    One would expect google to be better at this...

    Diez
     
    Diez B. Roggisch, Oct 17, 2009
    #2
    1. Advertising

  3. MRAB schrieb:
    > Arian Kuschki wrote:
    >> Hi all
    >>
    >> this has been bugging me for a long time and I do not seem to be able
    >> to understand what to do. I always have problems when dealing input
    >> text that contains umlauts. Consider the following:
    >>
    >> In [1]: import urllib
    >>
    >> In [2]: f =
    >> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>
    >> In [3]: xml = f.read()
    >>
    >> In [4]: f.close()
    >>
    >> In [5]: print xml
    >> ------> print(xml)
    >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
    >>> <forecast_information><cit

    >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    >> data=""/><longitude_e6 data=""/><forecast_date
    >> data="2009-10-17"/><current_date_time data="2009-10
    >> -17 14:20:00 +0000"/><unit_system
    >> data="SI"/></forecast_information><current_conditions><condition
    >> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
    >> umidity data="Feuchtigkeit: 87�%"/><icon
    >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition
    >> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
    >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    >> data="1"/><high data="7"/><icon
    >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    >> ereinzelt
    >> Regen"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="So."/><low data="-1"/><high data="8"/><icon
    >> data="/ig/images/weather/chance_of_sno
    >> w.gif"/><condition data="Vereinzelt
    >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    >> mages/weather/mostly_sunny.gif"/><condition data="Teils
    >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    >> data="Di."/><low data="0"/><high data="8"
    >> /><icon data="/ig/images/weather/sunny.gif"/><condition
    >> data="Klar"/></forecast_conditions></weather></xml_api_reply>
    >>
    >> As you can see the umlauts in the XML are not displayed properly. When
    >> I want to process this text (for example with xml.sax), I get error
    >> messages because the parses can't read this.
    >>
    >> I've tried to read up on this and there is a lot of information on the
    >> web, but nothing seems to work for me. For example setting the coding
    >> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string
    >> method.
    >>
    >> I always have this kind of problem when input contains umlauts, not
    >> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
    >>

    > The string you received from the website is a bytestring and you're just
    > printing it to your console, which is configured for UTF-8. However, the
    > bytestring isn't valid UTF-8, so the console is replacing the invalid
    > parts with the funny characters.


    This is wierd. I looked at the site in FireFox - and it was displayed
    correctly, including umlauts. Bringing up the info-dialog claims the
    page is UTF-8, the XML itself says so as well (implicit, through the
    missing declaration of an encoding) - but it clearly is *not* utf-8.

    One would expect google to be better at this...

    Diez
     
    Diez B. Roggisch, Oct 17, 2009
    #3
  4. MRAB

    StarWing Guest

    On 10月18æ—¥, 上åˆ12æ—¶14分, MRAB <> wrote:
    > Arian Kuschki wrote:
    > > Hi all

    >
    > > this has been bugging me for a long time and I do not seem to be able to
    > > understand what to do. I always have problems when dealing input text that
    > > contains umlauts. Consider the following:

    >
    > > In [1]: import urllib

    >
    > > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

    >
    > > In [3]: xml = f.read()

    >
    > > In [4]: f.close()

    >
    > > In [5]: print xml
    > > ------> print(xml)
    > > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
    > > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
    > >> <forecast_information><cit

    > > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
    > > data=""/><longitude_e6 data=""/><forecast_date
    > > data="2009-10-17"/><current_date_time data="2009-10
    > > -17 14:20:00 +0000"/><unit_system
    > > data="SI"/></forecast_information><current_conditions><condition data="Meistens
    > > bew kt"/><temp_f data="43"/><temp_c data="6"/><h
    > > umidity data="Feuchtigkeit: 87 %"/><icon
    > > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
    > > Windgeschwindigkeiten von 13 km/h"/></curr
    > > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
    > > data="1"/><high data="7"/><icon
    > > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
    > > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
    > > data="So."/><low data="-1"/><high data="8"/><icon
    > > data="/ig/images/weather/chance_of_sno
    > > w.gif"/><condition data="Vereinzelt
    > > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
    > > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
    > > mages/weather/mostly_sunny.gif"/><condition data="Teils
    > > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
    > > data="Di."/><low data="0"/><high data="8"
    > > /><icon data="/ig/images/weather/sunny.gif"/><condition
    > > data="Klar"/></forecast_conditions></weather></xml_api_reply>

    >
    > > As you can see the umlauts in the XML are not displayed properly. When I want
    > > to process this text (for example with xml.sax), I get error messages because
    > > the parses can't read this.

    >
    > > I've tried to read up on this and there is a lot of information on the web, but
    > > nothing seems to work for me. For example setting the coding to UTF like this:
    > > # -*- coding: utf-8 -*- or using the decode() string method.

    >
    > > I always have this kind of problem when input contains umlauts, not just in
    > > this case. My locale (on Ubuntu) is en_GB.UTF-8.

    >
    > The string you received from the website is a bytestring and you're just
    > printing it to your console, which is configured for UTF-8. However, the
    > bytestring isn't valid UTF-8, so the console is replacing the invalid
    > parts with the funny characters.
    >
    > You should decode the bytestring to Unicode and then re-encode it to
    > UTF-8. I don't know what encoding the website is actually using; here
    > I'm assuming ISO-8859-1:
    >
    > print xml.decode("iso-8859-1").encode("utf-8")


    in 2.6, str.decode return unicode, so you can directly print it.
    in 3.1, str.encode return bytes, so you can also directly print it.

    so, just decode("cp1252"), it's enough.
     
    StarWing, Oct 17, 2009
    #4
  5. I just checked and I see the following in the headers:
    Content-Type text/xml; charset=UTF-8

    Where does it say ISO-8859-1?

    On Sat 17, 20:57 +0200, I V wrote:

    > On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote:
    >
    > > This is wierd. I looked at the site in FireFox - and it was displayed
    > > correctly, including umlauts. Bringing up the info-dialog claims the
    > > page is UTF-8, the XML itself says so as well (implicit, through the
    > > missing declaration of an encoding) - but it clearly is *not* utf-8.

    >
    > The headers correctly identify it as ISO-8859-1, which overrides the
    > implicit specification of UTF-8. I'm not sure why Firefox is reporting it
    > as UTF-8 (it does that for me, too); I can see the umlauts, so it's
    > clearly processing it as ISO-8859-1.
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    --
     
    Arian Kuschki, Oct 17, 2009
    #5
  6. Hm yes, that is true. In Firefox on the other hand, the response header is
    "Content-Type text/xml; charset=UTF-8"

    On Sat 17, 13:16 -0700, Mark Tolonen wrote:

    >
    > "Diez B. Roggisch" <> wrote in message
    > news:-berlin.de...
    > [snip]
    > >This is wierd. I looked at the site in FireFox - and it was
    > >displayed correctly, including umlauts. Bringing up the
    > >info-dialog claims the page is UTF-8, the XML itself says so as
    > >well (implicit, through the missing declaration of an encoding) -
    > >but it clearly is *not* utf-8.
    > >
    > >One would expect google to be better at this...
    > >
    > >Diez

    >
    > According to the XML 1.0 specification:
    >
    > "Although an XML processor is required to read only entities in the
    > UTF-8 and UTF-16 encodings, it is recognized that other encodings
    > are used around the world, and it may be desired for XML processors
    > to read entities that use them. In the absence of external character
    > encoding information (such as MIME headers), parsed entities which
    > are stored in an encoding other than UTF-8 or UTF-16 must begin with
    > a text declaration..."
    >
    > So UTF-8 and UTF-16 are the defaults supported without an xml
    > declaration in the absence of external encoding information. But we
    > have external character encoding information:
    >
    > >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    > >>>f.headers.dict['content-type']

    > 'text/xml; charset=ISO-8859-1'
    >
    > So the page seems correct.
    >
    > -Mark
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    --
     
    Arian Kuschki, Oct 17, 2009
    #6
  7. MRAB

    Mark Tolonen Guest

    "Diez B. Roggisch" <> wrote in message
    news:-berlin.de...
    [snip]
    > This is wierd. I looked at the site in FireFox - and it was displayed
    > correctly, including umlauts. Bringing up the info-dialog claims the page
    > is UTF-8, the XML itself says so as well (implicit, through the missing
    > declaration of an encoding) - but it clearly is *not* utf-8.
    >
    > One would expect google to be better at this...
    >
    > Diez


    According to the XML 1.0 specification:

    "Although an XML processor is required to read only entities in the UTF-8
    and UTF-16 encodings, it is recognized that other encodings are used around
    the world, and it may be desired for XML processors to read entities that
    use them. In the absence of external character encoding information (such as
    MIME headers), parsed entities which are stored in an encoding other than
    UTF-8 or UTF-16 must begin with a text declaration..."

    So UTF-8 and UTF-16 are the defaults supported without an xml declaration in
    the absence of external encoding information. But we have external
    character encoding information:

    >>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
    >>> f.headers.dict['content-type']

    'text/xml; charset=ISO-8859-1'

    So the page seems correct.

    -Mark
     
    Mark Tolonen, Oct 17, 2009
    #7
  8. MRAB

    Neil Hodgson Guest

    The server is sniffing the User-Agent header to decide whether to
    send UTF-8 or ISO-8859-1. Try this code:

    import urllib2
    r = urllib2.Request("http://www.google.de/ig/api?weather=Muenchen",
    None, {"User-Agent":"Mozilla/5.0"})
    f = urllib2.urlopen(r)
    i = f.info()
    print(i)
    xml = f.read()
    f.close()
    print(xml)

    Neil
     
    Neil Hodgson, Oct 17, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Axel Dahmen
    Replies:
    3
    Views:
    4,479
    Axel Dahmen
    Apr 30, 2005
  2. John Dalberg
    Replies:
    1
    Views:
    1,907
    Joerg Jooss
    Feb 17, 2006
  3. Replies:
    10
    Views:
    1,032
    Shmuel (Seymour J.) Metz
    Nov 1, 2005
  4. Moritz Beller

    Where have all the umlauts gone?

    Moritz Beller, Nov 7, 2004, in forum: C++
    Replies:
    1
    Views:
    340
    Victor Bazarov
    Nov 7, 2004
  5. Joerg Lehmann

    Print formatted Strings with Umlauts

    Joerg Lehmann, Feb 11, 2004, in forum: Python
    Replies:
    4
    Views:
    400
    Joerg Lehmann
    Feb 12, 2004
Loading...

Share This Page