Encoding troubles

Discussion in 'Python' started by JB, May 17, 2010.

  1. JB

    JB Guest

    I'm working on the webapp of our company intranet and I had a question
    about proper handling of user input that's causing encoding issues.

    Some of the uesrs take notes in Microsoft Office and copy/paste these
    into textarea's of the webapp. Some of the characters from Word such
    as hypens (–) and apostrophes (’) are in an odd encoding. When passed
    to the database using sqlalchemy they appear as – and other
    characters.

    What's the proper handling (conversion?) of user input before it gets
    to my database. Do I need to start making a list of the offending
    characters and .replace them? Or is there a means to decode/encode the
    user input to something more generic? Thanks for your time.
     
    JB, May 17, 2010
    #1
    1. Advertising

  2. JB

    Neil Hodgson Guest

    JB:

    > as hypens (–) and apostrophes (’) are in an odd encoding. When passed
    > to the database using sqlalchemy they appear as – and other
    > characters.


    The encoding is UTF-8. Normally the best way to handle encodings is
    to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible
    and perform most processing in Unicode.

    Neil
     
    Neil Hodgson, May 18, 2010
    #2
    1. Advertising

  3. JB

    Bryan Guest

    Neil Hodgson wrote:
    > JB:
    >
    > > as hypens (–) and apostrophes (’) are in an odd encoding. When passed
    > > to the database using sqlalchemy they appear as – and other
    > > characters.

    >
    >    The encoding is UTF-8. Normally the best way to handle encodings is
    > to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible
    > and perform most processing in Unicode.


    Good advice to work in Unicode (and in Python 3.X str is unicode), but
    I'd guess the encoding he's getting is "Windows-1252". The default
    character set of HTTP is ISO-8859-1, but Microsoft likes to use
    Windows-1252 in it's place.

    What to do about it? First, try specifying utf-8 in the form
    containing the textarea, as in

    <form action="process.cgi" accept-charset="utf-8">

    Note that specifying ISO-8859-1 will not work, in that Microsoft will
    still use Windows-1252. I've heard they've gotten better at supporting
    utf-8, but I haven't tested.

    When a request comes in, check for a Content-Type header that names
    the character set, which should be:

    Content-Type: application/x-www-form-urlencoded; charset=utf-8

    Then you con decode to a unicode object as Neil Hodgson explained.

    In case you still have to deal with Windows-1252, Python knows how to
    translate Windows-1252 to the best-fit in Unicode. In current Python
    2.x:

    ustring = unicode(raw_string, 'Windows-1252')

    In Python 3.X, what comes from a socket is bytes, and str means
    unicode:

    ustring = str(raw_bytes, 'Windows-1252')


    Of course this all assumes that JB's database likes Unicode. If it
    chokes, then alternatives include encoding back to utf-8 and storing
    as binary, or translating characters to some best-fit in the set the
    database supports.


    --
    --Bryan Olson
     
    Bryan, May 18, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,871
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Matthijs Blaas

    encoding troubles

    Matthijs Blaas, Aug 19, 2004, in forum: Java
    Replies:
    11
    Views:
    802
    Matthijs Blaas
    Aug 21, 2004
  3. Replies:
    1
    Views:
    23,373
    Real Gagnon
    Oct 8, 2004
  4. Xaver Hinterhuber

    Encoding troubles

    Xaver Hinterhuber, May 17, 2004, in forum: Python
    Replies:
    2
    Views:
    363
    Xaver Hinterhuber
    May 18, 2004
  5. Replies:
    2
    Views:
    373
Loading...

Share This Page