Ascii to Unicode.

Discussion in 'Python' started by Joe Goldthwaite, Jul 28, 2010.

  1. Thanks to all of you who responded. I guess I was working from the wrong
    premise. I was thinking that a file could write any kind of data and that
    once I had my Unicode string, I could just write it out with a standard
    file.write() operation.

    What is actually happening is the file.write() operation was generating the
    error until I re-encoded the string as utf-8. This is what worked;

    import unicodedata

    input = file('ascii.csv', 'rb')
    output = file('unicode.csv','wb')

    for line in input.xreadlines():
    unicodestring = unicode(line, 'latin1')
    output.write(unicodestring.encode('utf-8')) # This second encode is
    what I was missing.

    input.close()
    output.close()

    A number of you pointed out what I was doing wrong but I couldn't understand
    it until I realized that the write operation didn't work until it was using
    a properly encoded Unicode string. I thought I was getting the error on the
    initial latin Unicode conversion not in the write operation.

    This still seems odd to me. I would have thought that the unicode function
    would return a properly encoded byte stream that could then simply be
    written to disk. Instead it seems like you have to re-encode the byte stream
    to some kind of escaped Ascii before it can be written back out.

    Thanks to all of you who took the time to respond. I really do appreciate
    it. I think with my mental block, I couldn't have figure it out without
    your help.
     
    Joe Goldthwaite, Jul 28, 2010
    #1
    1. Advertising

  2. On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote:

    > This still seems odd to me. I would have thought that the unicode
    > function would return a properly encoded byte stream that could then
    > simply be written to disk. Instead it seems like you have to re-encode
    > the byte stream to some kind of escaped Ascii before it can be written
    > back out.


    I'm afraid that's not even wrong. The unicode function returns a unicode
    string object, not a byte-stream, just as the list function returns a
    sequence of objects, not a byte-stream.

    Perhaps this will help:

    http://www.joelonsoftware.com/articles/Unicode.html


    Summary:

    ASCII is not a synonym for bytes, no matter what some English-speakers
    think. ASCII is an encoding from bytes like \x41 to characters like "A".

    Unicode strings are a sequence of code points. A code point is a number,
    implemented in some complex fashion that you don't need to care about.
    Each code point maps conceptually to a letter; for example, the English
    letter A is represented by the code point U+0041 and the Arabic letter
    Ain is represented by the code point U+0639.

    You shouldn't make any assumptions about the size of each code-point, or
    how they are put together. You shouldn't expect to write code points to a
    disk and have the result make sense, any more than you could expect to
    write a sequence of tuples or sets or dicts to disk in any sensible
    fashion. You have to serialise it to bytes first, and that's what the
    encode method does. Decode does the opposite, taking bytes and creating
    unicode strings from them.

    For historical reasons -- backwards compatibility with files already
    created, back in the Bad Old Days before unicode -- there are a whole
    slew of different encodings available. There is no 1:1 mapping between
    bytes and strings. If all you have are the bytes, there is literally no
    way of knowing what string they represent (although sometimes you can
    guess). You need to know what the encoding used was, or take a guess, or
    make repeated decodings until something doesn't fail and hope that's the
    right one.

    As a general rule, Python will try encoding/decoding using the ASCII
    encoding unless you tell it differently.

    Any time you are writing to disk, you need to serialise the objects,
    regardless of whether they are floats, or dicts, or unicode strings.


    --
    Steven
     
    Steven D'Aprano, Jul 29, 2010
    #2
    1. Advertising

  3. Joe Goldthwaite wrote:
    > import unicodedata
    >
    > input = file('ascii.csv', 'rb')
    > output = file('unicode.csv','wb')
    >
    > for line in input.xreadlines():
    > unicodestring = unicode(line, 'latin1')
    > output.write(unicodestring.encode('utf-8')) # This second encode
    > is what I was missing.


    Actually, I see two problems here:
    1. "ascii.csv" is not an ASCII file but a Latin-1 encoded file, so there
    starts the first confusion.
    2. "unicode.csv" is not a "Unicode" file, because Unicode is not a file
    format. Rather, it is a UTF-8 encoded file, which is one encoding of
    Unicode. This is the second confusion.

    > A number of you pointed out what I was doing wrong but I couldn't
    > understand it until I realized that the write operation didn't work until
    > it was using a properly encoded Unicode string.


    The write function wants bytes! Encoding a string in your favourite encoding
    yields bytes.

    > This still seems odd to me. I would have thought that the unicode
    > function would return a properly encoded byte stream that could then
    > simply be written to disk.


    No, unicode() takes a byte stream and decodes it according to the given
    encoding. You then get an internal representation of the string, a unicode
    object. This representation typically resembles UCS2 or UCS4, which are
    more suitable for internal manipulation than UTF-8. This object is a string
    btw, so typical stuff like concatenation etc are supported. However, the
    internal representation is a sequence of Unicode codepoints but not a
    guaranteed sequence of bytes which is what you want in a file.

    > Instead it seems like you have to re-encode the byte stream to some
    > kind of escaped Ascii before it can be written back out.


    As mentioned above, you have a string. For writing, that string needs to be
    transformed to bytes again.


    Note: You can also configure a file to read one encoding or write another.
    You then get unicode objects from the input which you can feed to the
    output. The important difference is that you only specify the encoding in
    one place and it will probably even be more performant. I'd have to search
    to find you the according library calls though, but starting point is
    http://docs.python.org.

    Good luck!

    Uli

    --
    Sator Laser GmbH
    Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
     
    Ulrich Eckhardt, Jul 29, 2010
    #3
  4. Hi Steven,

    I read through the article you referenced. I understand Unicode better now.
    I wasn't completely ignorant of the subject. My confusion is more about how
    Python is handling Unicode than Unicode itself. I guess I'm fighting my own
    misconceptions. I do that a lot. It's hard for me to understand how things
    work when they don't function the way I *think* they should.

    Here's the main source of my confusion. In my original sample, I had read a
    line in from the file and used the unicode function to create a
    unicodestring object;

    unicodestring = unicode(line, 'latin1')

    What I thought this step would do is translate the line to an internal
    Unicode representation. The problem character \xe1 would have been
    translated into a correct Unicode representation for the accented "a"
    character.

    Next I tried to write the unicodestring object to a file thusly;

    output.write(unicodestring)

    I would have expected the write function to request the byte string from the
    unicodestring object and simply write that byte string to a file. I thought
    that at this point, I should have had a valid Unicode latin1 encoded file.
    Instead get an error that the character \xe1 is invalid.

    The fact that the \xe1 character is still in the unicodestring object tells
    me it wasn't translated into whatever python uses for its internal Unicode
    representation. Either that or the unicodestring object returns the
    original string when it's asked for a byte stream representation.

    Instead of just writing the unicodestring object, I had to do this;

    output.write(unicodestring.encode('utf-8'))

    This is doing what I thought the other steps were doing. It's translating
    the internal unicodestring byte representation to utf-8 and writing it out.
    It still seems strange and I'm still not completely clear as to what is
    going on at the byte stream level for each of these steps.
     
    Joe Goldthwaite, Jul 29, 2010
    #4
  5. Hi Ulrich,

    Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
    few characters above the 128 range that are causing Postgresql Unicode
    errors. Those characters work fine in the Windows world but they're not the
    correct byte representation for Unicode. What I'm attempting to do is
    translate those upper range characters into the correct Unicode
    representations so that they look the same in the Postgresql database as
    they did in the CSV file.

    I wrote up the source of my confusion to Steven so I won't duplicate it
    here. You're comment on defining the encoding of the file directly instead
    of using functions to encode and decode the data lead me to the codecs
    module. Using it, I can define the encoding a file open time and then just
    read and write the lines. I ended up with this;

    import codecs

    input = codecs.open('ascii.csv', encoding='cp1252')
    output = codecs.open('unicode.csv', mode='wb', encoding='utf-8')

    output.writelines(input.readlines())

    input.close()
    output.close()

    This is doing exactly the same thing but it's much clearer to me. Readlines
    translates the input using the cp1252 codec and writelines encodes it to
    utf-8 and writes it out. And as you mentioned, it's probably higher
    performance. I haven't tested that but since both programs do the job in
    seconds, performance isn't and issue.

    Thanks again to everyone who posted. I really do appreciate it.
     
    Joe Goldthwaite, Jul 29, 2010
    #5
  6. Joe Goldthwaite

    Ethan Furman Guest

    Joe Goldthwaite wrote:
    > Hi Steven,
    >
    > I read through the article you referenced. I understand Unicode better now.
    > I wasn't completely ignorant of the subject. My confusion is more about how
    > Python is handling Unicode than Unicode itself. I guess I'm fighting my own
    > misconceptions. I do that a lot. It's hard for me to understand how things
    > work when they don't function the way I *think* they should.
    >
    > Here's the main source of my confusion. In my original sample, I had read a
    > line in from the file and used the unicode function to create a
    > unicodestring object;
    >
    > unicodestring = unicode(line, 'latin1')
    >
    > What I thought this step would do is translate the line to an internal
    > Unicode representation. The problem character \xe1 would have been
    > translated into a correct Unicode representation for the accented "a"
    > character.


    Correct. At this point you have unicode string.

    > Next I tried to write the unicodestring object to a file thusly;
    >
    > output.write(unicodestring)
    >
    > I would have expected the write function to request the byte string from the
    > unicodestring object and simply write that byte string to a file. I thought
    > that at this point, I should have had a valid Unicode latin1 encoded file.
    > Instead get an error that the character \xe1 is invalid.


    Here's the problem -- there is no byte string representing the unicode
    string, they are completely different. There are dozens of different
    possible encodings to go from unicode to a byte-string (of which UTF-8
    is one such possibility).

    > The fact that the \xe1 character is still in the unicodestring object tells
    > me it wasn't translated into whatever python uses for its internal Unicode
    > representation. Either that or the unicodestring object returns the
    > original string when it's asked for a byte stream representation.


    Wrong. It so happens that some of the unicode points are the same as
    some (but not all) of the ascii and upper-ascii values. When you
    attempt to write a unicode string without specifying which encoding you
    want, python falls back to ascii (not upper-ascii) so any character
    outside the 0-127 range is going to raise an error.

    > Instead of just writing the unicodestring object, I had to do this;
    >
    > output.write(unicodestring.encode('utf-8'))
    >
    > This is doing what I thought the other steps were doing. It's translating
    > the internal unicodestring byte representation to utf-8 and writing it out.
    > It still seems strange and I'm still not completely clear as to what is
    > going on at the byte stream level for each of these steps.



    Don't think of unicode as a byte stream. It's a bunch of numbers that
    map to a bunch of symbols. The byte stream only comes into play when
    you want to send unicode somewhere (file, socket, etc) and you then have
    to encode the unicode into bytes.

    Hope this helps!

    ~Ethan~
     
    Ethan Furman, Jul 29, 2010
    #6
  7. Joe Goldthwaite

    Carey Tilden Guest

    On Thu, Jul 29, 2010 at 10:59 AM, Joe Goldthwaite <> wrote:
    > Hi Ulrich,
    >
    > Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
    > few characters above the 128 range that are causing Postgresql Unicode
    > errors.  Those characters work fine in the Windows world but they're not the
    > correct byte representation for Unicode. What I'm attempting to do is
    > translate those upper range characters into the correct Unicode
    > representations so that they look the same in the Postgresql database as
    > they did in the CSV file.


    Having bytes outside of the ASCII range means, by definition, that the
    file is not ASCII encoded. ASCII only defines bytes 0-127. Bytes
    outside of that range mean either the file is corrupt, or it's in a
    different encoding. In this case, you've been able to determine the
    correct encoding (latin-1) for those errant bytes, so the file itself
    is thus known to be in that encoding.

    Carey
     
    Carey Tilden, Jul 29, 2010
    #7
  8. Joe Goldthwaite

    Ethan Furman Guest

    Joe Goldthwaite wrote:
    > Hi Ulrich,
    >
    > Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
    > few characters above the 128 range . . .


    It took me a while to get this point too (if you already have "gotten
    it", I apologize, but the above comment leads me to believe you haven't).

    *Every* file is an encoded file... even your UTF-8 file is encoded using
    the UTF-8 format. Someone correct me if I'm wrong, but I believe
    lower-ascii (0-127) matches up to the first 128 Unicode code points, so
    while those first 128 code-points translate easily to ascii, ascii is
    still an encoding, and if you have characters higher than 127, you don't
    really have an ascii file -- you have (for example) a cp1252 file (which
    also, not coincidentally, shares the first 128 characters/code points
    with ascii).

    Hopefully I'm not adding to the confusion. ;)

    ~Ethan~
     
    Ethan Furman, Jul 29, 2010
    #8
  9. Joe Goldthwaite

    John Nagle Guest

    On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
    > This still seems odd to me. I would have thought that the unicode function
    > would return a properly encoded byte stream that could then simply be
    > written to disk. Instead it seems like you have to re-encode the byte stream
    > to some kind of escaped Ascii before it can be written back out.


    Here's what's really going on.

    Unicode strings within Python have to be indexable. So the internal
    representation of Unicode has (usually) two bytes for each character,
    so they work like arrays.

    UTF-8 is a stream format for Unicode. It's slightly compressed;
    each character occupies 1 to 4 bytes, and the base ASCII characters
    (0..127 only, not 128..255) occupy one byte each. The format is
    described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
    stream has to be parsed from the beginning to keep track of where each
    Unicode character begins. So it's not a suitable format for
    data being actively worked on in memory; it can't be easily indexed.

    That's why it's necessary to convert to UTF-8 before writing
    to a file or socket.

    John Nagle
     
    John Nagle, Jul 29, 2010
    #9
  10. Joe Goldthwaite

    MRAB Guest

    John Nagle wrote:
    > On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
    >> This still seems odd to me. I would have thought that the unicode
    >> function
    >> would return a properly encoded byte stream that could then simply be
    >> written to disk. Instead it seems like you have to re-encode the byte
    >> stream
    >> to some kind of escaped Ascii before it can be written back out.

    >
    > Here's what's really going on.
    >
    > Unicode strings within Python have to be indexable. So the internal
    > representation of Unicode has (usually) two bytes for each character,
    > so they work like arrays.
    >
    > UTF-8 is a stream format for Unicode. It's slightly compressed;
    > each character occupies 1 to 4 bytes, and the base ASCII characters
    > (0..127 only, not 128..255) occupy one byte each. The format is
    > described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
    > stream has to be parsed from the beginning to keep track of where each
    > Unicode character begins. So it's not a suitable format for
    > data being actively worked on in memory; it can't be easily indexed.
    >

    Not entirely correct. The advantage of UTF-8 is that although different
    codepoints might be encoded into different numbers of bytes it's easy to
    tell whether a particular byte is the first in its sequence, so you
    don't have to parse from the start of the file. It is true, however, it
    can't be easily indexed.

    > That's why it's necessary to convert to UTF-8 before writing
    > to a file or socket.
    >
     
    MRAB, Jul 29, 2010
    #10
  11. On Thu, 29 Jul 2010 11:14:24 -0700, Ethan Furman wrote:

    > Don't think of unicode as a byte stream. It's a bunch of numbers that
    > map to a bunch of symbols.


    Not only are Unicode strings a bunch of numbers ("code points", in
    Unicode terminology), but the numbers are not necessarily all the same
    width.

    The full Unicode system allows for 1,114,112 characters, far more than
    will fit in a two-byte code point. The Basic Multilingual Plane (BMP)
    includes the first 2**16 (65536) of those characters, or code points
    U+0000 through U+FFFF; there are a further 16 supplementary planes of
    2**16 characters each, or code points U+10000 through U+10FFFF.

    As I understand it (and I welcome corrections), some implementations of
    Unicode only support the BMP and use a fixed-width implementation of 16-
    bit characters for efficiency reasons. Supporting the entire range of
    code points would require either a fixed-width of 21-bits (which would
    then probably be padded to four bytes), or a more complex variable-width
    implementation.

    It looks to me like Python uses a 16-bit implementation internally, which
    leads to some rather unintuitive results for code points in the
    supplementary place...

    >>> c = chr(2**18)
    >>> c

    '\U00040000'
    >>> len(c)

    2


    --
    Steven
     
    Steven D'Aprano, Jul 30, 2010
    #11
  12. Joe Goldthwaite

    Mark Tolonen Guest

    "Joe Goldthwaite" <> wrote in message
    news:5A04846ED83745A8A99A944793792810@NewMBP...
    > Hi Steven,
    >
    > I read through the article you referenced. I understand Unicode better
    > now.
    > I wasn't completely ignorant of the subject. My confusion is more about
    > how
    > Python is handling Unicode than Unicode itself. I guess I'm fighting my
    > own
    > misconceptions. I do that a lot. It's hard for me to understand how
    > things
    > work when they don't function the way I *think* they should.
    >
    > Here's the main source of my confusion. In my original sample, I had read
    > a
    > line in from the file and used the unicode function to create a
    > unicodestring object;
    >
    > unicodestring = unicode(line, 'latin1')
    >
    > What I thought this step would do is translate the line to an internal
    > Unicode representation.


    Correct.

    > The problem character \xe1 would have been
    > translated into a correct Unicode representation for the accented "a"
    > character.


    Which just so happens to be u'\xe1', which probably adds to your confusion
    later :^) The first 256 Unicode code points map to latin1.

    >
    > Next I tried to write the unicodestring object to a file thusly;
    >
    > output.write(unicodestring)
    >
    > I would have expected the write function to request the byte string from
    > the
    > unicodestring object and simply write that byte string to a file. I
    > thought
    > that at this point, I should have had a valid Unicode latin1 encoded file.
    > Instead get an error that the character \xe1 is invalid.


    Incorrect. The unicodestring object doesn't save the original byte string,
    so there is nothing to "request".

    > The fact that the \xe1 character is still in the unicodestring object
    > tells
    > me it wasn't translated into whatever python uses for its internal Unicode
    > representation. Either that or the unicodestring object returns the
    > original string when it's asked for a byte stream representation.


    Both incorrect. As I mentioned earlier, the first Unicode code points map
    to latin1. It *was* translated to a Unicode code point whose value (but not
    internal representation!) is the same as latin1.

    > Instead of just writing the unicodestring object, I had to do this;
    >
    > output.write(unicodestring.encode('utf-8'))


    This is exactly what you need to do...explicitly encode the Unicode string
    into a byte string.

    > This is doing what I thought the other steps were doing. It's translating
    > the internal unicodestring byte representation to utf-8 and writing it
    > out.
    > It still seems strange and I'm still not completely clear as to what is
    > going on at the byte stream level for each of these steps.


    I'm surprised that by now no one has mentioned the codecs module. You
    original stated you are using Python 2.4.4, which I looked up and does
    support the codecs module.

    import codecs

    infile = codecs.open('ascii.csv,'r','latin1')
    outfile = codecs.open('unicode.csv','w','utf-8')
    for line in infile:
    outfile.write(line)
    infile.close()
    outfile.close()

    As you can see, codecs.open takes a parameter for the encoding of the file.
    Lines read are automatically decoded into Unicode; Unicode lines written are
    automatically encoded into a byte stream.

    -Mark
     
    Mark Tolonen, Jul 30, 2010
    #12
  13. Joe Goldthwaite

    Nobody Guest

    On Thu, 29 Jul 2010 23:49:40 +0000, Steven D'Aprano wrote:

    > It looks to me like Python uses a 16-bit implementation internally,


    It typically uses the platform's wchar_t, which is 16-bit on Windows and
    (typically) 32-bit on Unix.

    IIRC, it's possible to build Python with 32-bit Unicode on Windows, but
    that will be inefficient (because it has to convert to/from 16-bit
    when calling Windows API functions) and will break any C modules which
    pass the pointer to the internal buffer directly to API functions.
     
    Nobody, Jul 30, 2010
    #13
  14. In message <>, Joe
    Goldthwaite wrote:

    > Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
    > few characters above the 128 range that are causing Postgresql Unicode
    > errors. Those characters work fine in the Windows world but they're not
    > the correct byte representation for Unicode.


    In other words, the encoding you want to decode from in this case is
    windows-1252.
     
    Lawrence D'Oliveiro, Jul 30, 2010
    #14
  15. In message <4c51d3b6$0$1638$>, John Nagle wrote:

    > UTF-8 is a stream format for Unicode. It's slightly compressed ...


    “Variable-length†is not the same as “compressedâ€.

    Particularly if you’re mainly using non-Roman scripts...
     
    Lawrence D'Oliveiro, Jul 30, 2010
    #15
  16. In message <>, Joe
    Goldthwaite wrote:

    > Next I tried to write the unicodestring object to a file thusly;
    >
    > output.write(unicodestring)
    >
    > I would have expected the write function to request the byte string from
    > the unicodestring object and simply write that byte string to a file.


    Encoded according to which encoding?
     
    Lawrence D'Oliveiro, Jul 30, 2010
    #16
  17. Joe Goldthwaite

    John Machin Guest

    On Jul 30, 4:18 am, Carey Tilden <> wrote:
    > In this case, you've been able to determine the
    > correct encoding (latin-1) for those errant bytes, so the file itself
    > is thus known to be in that encoding.


    The most probably "correct" encoding is, as already stated, and agreed
    by the OP to be, cp1252.
     
    John Machin, Jul 31, 2010
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TOXiC
    Replies:
    5
    Views:
    1,333
    TOXiC
    Jan 31, 2007
  2. James O'Brien
    Replies:
    3
    Views:
    319
    Ben Morrow
    Mar 5, 2004
  3. Alextophi
    Replies:
    8
    Views:
    588
    Alan J. Flavell
    Dec 30, 2005
  4. bruce
    Replies:
    38
    Views:
    335
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    118
Loading...

Share This Page