A Unicode problem -HELP

Discussion in 'Python' started by manstey, May 12, 2006.

  1. manstey

    manstey Guest

    I am writing a program to translate a list of ascii letters into a
    different language that requires unicode encoding. This is what I have
    done so far:

    1. I have # -*- coding: UTF-8 -*- as my first line.
    2. In Wing IDE I have set Default Encoding to UTF-8
    3. I have imported codecs and opened and written my file, which doesn't
    have a BOM, as encoding=UTF-8
    4. I have written a dictionary for translation, with entries such as
    {'F':u'\u0254'} and a function to do the translation

    Everything works fine, except that my output file, when loaded in
    unicode aware emeditor has
    (u'F', u'\u0254')

    But I want to display it as:
    ('F', 'É”') # where the É” is a back-to-front 'c'

    So my questions are:
    1. How do I do this?
    2. Do I need to change any of my steps above?
    manstey, May 12, 2006
    #1
    1. Advertising

  2. manstey wrote:
    > 1. I have # -*- coding: UTF-8 -*- as my first line.
    > 2. In Wing IDE I have set Default Encoding to UTF-8
    > 3. I have imported codecs and opened and written my file, which doesn't
    > have a BOM, as encoding=UTF-8
    > 4. I have written a dictionary for translation, with entries such as
    > {'F':u'\u0254'} and a function to do the translation
    >
    > Everything works fine, except that my output file, when loaded in
    > unicode aware emeditor has
    > (u'F', u'\u0254')


    I couldn't quite follow this description: what is "your output file"
    (in what step is it created?), and how does

    (u'F', u'\u0254')

    get into this file? What is the precise Python statement that
    produces that line of output?

    > So my questions are:
    > 1. How do I do this?


    Most likely, you use (directly or indirectly) the repr() function
    to convert a tuple into that string. You shouldn't do that;
    instead, you should format the elements of the tuple yourself, e.g.
    through

    print >>f, u"('%s', '%s')" % value

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, May 12, 2006
    #2
    1. Advertising

  3. manstey

    manstey Guest

    Hi Martin,

    HEre is how I write:

    input_file = open(input_file_loc, 'r')
    output_file = open(output_file_loc, 'w')
    for line in input_file:
    output_file.write(str(word_info + parse + gloss)) # = three
    functions that return tuples

    (u'F', u'\u0254') are two of the many unicode tuple elements returned
    by the three functions.

    What am I doing wrong?
    manstey, May 17, 2006
    #3
  4. manstey

    Ben Finney Guest

    "manstey" <> writes:

    > input_file = open(input_file_loc, 'r')
    > output_file = open(output_file_loc, 'w')
    > for line in input_file:
    > output_file.write(str(word_info + parse + gloss)) # = three functions that return tuples


    If you mean that 'word_info', 'parse' and 'gloss' are three functions
    that return tuples, then you get that return value by calling them.

    >>> def foo():

    ... return "foo's return value"
    ...
    >>> def bar(baz):

    ... return "bar's return value (including '%s')" % baz
    ...
    >>> print foo()

    foo's return value
    >>> print bar

    <function bar at 0x401fe80c>
    >>> print bar("orange")

    bar's return value (including 'orange')

    --
    \ "A man must consider what a rich realm he abdicates when he |
    `\ becomes a conformist." -- Ralph Waldo Emerson |
    _o__) |
    Ben Finney
    Ben Finney, May 17, 2006
    #4
  5. manstey

    manstey Guest

    I'm a newbie at python, so I don't really understand how your answer
    solves my unicode problem.

    I have done more reading on unicode and then tried my code in IDLE
    rather than WING IDE, and discovered that it works fine in IDLE, so I
    think WING has a problem with unicode. For example, in WING this code
    returns an error:

    a={'a':u'\u0254'}
    print a['a']


    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0254' in
    position 0: ordinal not in range(128)

    but in IDLE it correctly prints open o

    So, assuming I now work in IDLE, all I want help with is how to read in
    an ascii string and convert its letters to various unicode values and
    save the resulting 'string' to a utf-8 text file. Is this clear?

    so in pseudo code
    1. F is converted to \u0254, $ is converted to \u0283, C is converted
    to \u02A6\02C1, etc.
    (i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
    2. I read in a file with lines like:
    F$
    FCF$
    $$C$ etc
    3. I convert this to
    \u0254\u0283
    \u0254\u02A6\02C1\u0254 etc
    4. i save the results in a new file

    when i read the new file in a unicode editor (EmEditor), i don't see
    \u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
    ts digraph, modified letter reversed glottal stop, etc.

    I'm sure this is straightforward but I can't get it to work.

    All help appreciated!
    manstey, May 17, 2006
    #5
  6. manstey

    Ben Finney Guest

    "manstey" <> writes:

    > I'm a newbie at python, so I don't really understand how your answer
    > solves my unicode problem.


    Since your replies fail to give any context of the existing
    discussion, I could only go by the content of what you'd written in
    that message. I didn't see a problem with anything Unicode -- I saw
    three objects being added together, which you told us were function
    objects. That's the problem I pointed out.

    --
    \ "When a well-packaged web of lies has been sold to the masses |
    `\ over generations, the truth will seem utterly preposterous and |
    _o__) its speaker a raving lunatic." -- Dresden James |
    Ben Finney
    Ben Finney, May 17, 2006
    #6
  7. manstey wrote:
    > input_file = open(input_file_loc, 'r')
    > output_file = open(output_file_loc, 'w')
    > for line in input_file:
    > output_file.write(str(word_info + parse + gloss)) # = three
    > functions that return tuples
    >
    > (u'F', u'\u0254') are two of the many unicode tuple elements returned
    > by the three functions.
    >
    > What am I doing wrong?


    Well, the primary problem is that you don't tell us what you are really
    doing. For example, it is very hard to believe that this is the actual
    code that you are running:

    If word_info, parse, and gloss are functions, the code should read

    input_file = open(input_file_loc, 'r')
    output_file = open(output_file_loc, 'w')
    for line in input_file:
    output_file.write(str(word_info() + parse() + gloss()))

    I.e. you need to call the functions for this code to make any sense.
    You have probably chosen to edit the code in order to not show us
    your real code. Unfortunately, since you are a newbie in Python,
    you make errors in doing so, and omit important details. That makes
    it very difficult to help you.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, May 17, 2006
    #7
  8. manstey

    manstey Guest

    OK, I apologise for not being clearer.

    1. Here is my input data file, line 2:
    gn1:1,1.2 R")$I73YT R")$IYT@ncfsa

    2. Here is my output data file, line 2:
    u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
    u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
    '', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

    3. Here is my main program:
    # -*- coding: UTF-8 -*-
    import codecs

    import splitFunctions
    import surfaceIPA

    # Constants for file location

    # Working directory constants
    dir_root = 'E:\\'
    dir_relative = '2 Core\\2b Data\\Data Working\\'

    # Input file constants
    input_file_name = 'in.grab.txt'
    input_file_loc = dir_root + dir_relative + input_file_name
    # Initialise input file
    input_file = codecs.open(input_file_loc, 'r', 'utf-8')

    # Output file constants
    output_file_name = 'out.grab.txt'
    output_file_loc = dir_root + dir_relative + output_file_name
    # Initialise output file
    output_file = codecs.open(output_file_loc, 'w', 'utf-8') # unicode

    i = 0
    for line in input_file:
    if line[0] != '>': # Ignore headers
    i += 1
    if i != 1:
    word_info = splitFunctions.splitGrab(line, i)
    parse=splitFunctions.splitParse(word_info[10])
    gloss=surfaceIPA.surfaceIPA(word_info[6],word_info[8],word_info[9],parse)
    a=str(word_info + parse + gloss).encode('utf-8')
    a=a[1:len(a)-1]
    output_file.write(a)
    output_file.write('\n')

    input_file.close()
    output_file.close()

    print 'done'


    4. Here is my problem:
    At the end of my output file, where my unicode character \u0254 (OPEN
    O) appears, the file has '\xc9\x94'

    What I want is an output file like:

    'gn', '1', '1', '1', '2', '-', ..... 'É”'

    where É” is an open O, and would display correctly in the appropriate
    font.

    Once I can get it to display properly, I will rewrite gloss so that it
    returns a proper translation of 'R")$I73YT', which will be a string of
    unicode characters.

    Is this clearer? The other two functions are basic. splitGrab turns
    'gn1:1,1.2 R")$I73YT R")$IYT@ncfsa' into 'gn 1 1 1 2 R")$I73YT R")$IYT
    @ ncfsa' and splitParse turns the final piece of this 'ncfsa' into 'n c
    f s a'. They have to be done separately as splitParse involves some
    translation and program logic. SurfaceIPA reads in 'R")$I73YT' and
    other data to produce the unicode string. At the moment it just returns
    two dummy strings and u'\u0254'.encode('utf-8').

    All help is appreciated!

    Thanks
    manstey, May 17, 2006
    #8
  9. manstey wrote:
    > a=str(word_info + parse + gloss).encode('utf-8')
    > a=a[1:len(a)-1]
    >
    > Is this clearer?


    Indeed. The problem is your usage of str() to "render" the output.
    As word_info+parse+gloss is a list (or is it a tuple?), str() will
    already produce "Python source code", i.e. an ASCII byte string
    that can be read back into the interpreter; all Unicode is gone
    from that string. If you want comma-separated output, you should
    do this:

    def comma_separated_utf8(items):
    result = []
    for item in items:
    result.append(item.encode('utf-8'))
    return ", ".join(result)

    and then
    a = comma_separated_utf8(word_info + parse + gloss)

    Then you don't have to drop the parentheses from a anymore, as
    it won't have parentheses in the first place.

    As the encoding will be done already in the output file,
    the following should also work:

    a = u", ".join(word_info + parse + gloss)

    This would make "a" a comma-separated unicode string, so that
    the subsequent output_file.write(a) encodes it as UTF-8.

    If that doesn't work, I would like to know what the exact
    value of gloss is, do

    print "GLOSS IS", repr(gloss)

    to print it out.

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, May 17, 2006
    #9
  10. manstey

    Tim Roberts Guest

    "manstey" <> wrote:
    >
    >I have done more reading on unicode and then tried my code in IDLE
    >rather than WING IDE, and discovered that it works fine in IDLE, so I
    >think WING has a problem with unicode.


    Rather, its output defaults to ASCII.

    >So, assuming I now work in IDLE, all I want help with is how to read in
    >an ascii string and convert its letters to various unicode values and
    >save the resulting 'string' to a utf-8 text file. Is this clear?
    >
    >so in pseudo code
    >1. F is converted to \u0254, $ is converted to \u0283, C is converted
    >to \u02A6\02C1, etc.
    >(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
    >2. I read in a file with lines like:
    >F$
    >FCF$
    >$$C$ etc
    >3. I convert this to
    >\u0254\u0283
    >\u0254\u02A6\02C1\u0254 etc
    >4. i save the results in a new file
    >
    >when i read the new file in a unicode editor (EmEditor), i don't see
    >\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
    >ts digraph, modified letter reversed glottal stop, etc.


    Of course. Isn't that exactly what you wanted? The Python string
    u"\u0254" contains one character (Latin small open o). It does NOT contain
    6 characters. If you write that to a file, that file will contain 1
    character -- 2 bytes.

    If you actually want the 6-character string \u0254 written to a file, then
    you need to escape the \u special code: "\\u0254". However, I don't see
    what good that would do you. The \u escape is a Python source code thing.

    >I'm sure this is straightforward but I can't get it to work.


    I think it is working exactly as you want.
    --
    - Tim Roberts,
    Providenza & Boekelheide, Inc.
    Tim Roberts, May 17, 2006
    #10
  11. manstey

    Ben Finney Guest

    "manstey" <> writes:

    > 1. Here is my input data file, line 2:
    > gn1:1,1.2 R")$I73YT R")$IYT@ncfsa


    Your program is reading this using the 'utf-8' encoding. When it does
    so, all the characters you show above will be read in happily as you
    see them (so long as you view them with the 'utf-8' encoding), and
    converted to Unicode characters representing the same thing.

    Do you have any other information that might indicate this is *not*
    utf-8 encoded data?

    > 2. Here is my output data file, line 2:
    > u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
    > u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
    > '', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'


    As you can see, reading the file with 'utf-8' encoding and writing it
    out again as 'utf-8' encoding, the characters (as you posted them in
    the message) have been faithfully preserved by Unicode processing and
    encoding.


    Bear in mind that when you present the "input data file, line 2" to
    us, your message is itself encoded using a particular character
    encoding. (In the case of the message where you wrote the above, it's
    'utf-8'.) This means we may or may not be seeing the exact same bytes
    you see in the input file; we're seeing characters in the encoding you
    used to post the message.

    You need to know what encoding was used when the data in that file was
    written. You can then read the file using that encoding, and convert
    the characters to unicode for processing inside your program. When you
    write them out again, you can choose the 'utf-8' encoding as you have
    done.

    Have you read this excellent article on understanding the programming
    implications of character sets and Unicode?

    "The Absolute Minimum Every Software Developer Absolutely,
    Positively Must Know About Unicode and Character Sets (No
    Excuses!)"
    <URL:http://www.joelonsoftware.com/articles/Unicode.html>

    --
    \ "I'd like to see a nude opera, because when they hit those high |
    `\ notes, I bet you can really see it in those genitals." -- Jack |
    _o__) Handey |
    Ben Finney
    Ben Finney, May 17, 2006
    #11
  12. manstey

    manstey Guest

    Hi Martin,

    Thanks very much. Your def comma_separated_utf8(items): approach raises
    an exception in codecs.py, so I tried = u", ".join(word_info + parse +
    gloss), which works perfectly. So I want to understand exactly why this
    works. word_info and parse and gloss are all tuples. does str convert
    the three into an ascii string? but the join method retains their
    unicode status.

    In the text file, the unicode characters appear perfectly, so I'm very
    happy.

    cheers
    matthew
    manstey, May 17, 2006
    #12
  13. manstey wrote:
    > Thanks very much. Your def comma_separated_utf8(items): approach raises
    > an exception in codecs.py, so I tried = u", ".join(word_info + parse +
    > gloss), which works perfectly. So I want to understand exactly why this
    > works. word_info and parse and gloss are all tuples. does str convert
    > the three into an ascii string?


    Correct: a tuple is converted into a string with (contents), where
    contents is achieved through comma-separating repr() of each tuple
    element. repr(a_unicode_string) creates a \x or \u representation.

    > but the join method retains their unicode status.


    Correct. The result is a Unicode string if the joiner is a Unicode
    string, and all tuple elements are Unicode strings. If one is not,
    a conversion to Unicode is attempted.

    > In the text file, the unicode characters appear perfectly, so I'm very
    > happy.


    Glad it works.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, May 17, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,921
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    548
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    519
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    551
    Leo Kislov
    Nov 18, 2006
  5. Chirag Mistry
    Replies:
    6
    Views:
    169
    Ollivier Robert
    Feb 8, 2008
Loading...

Share This Page