Re: string to unicode

Discussion in 'Python' started by Chris Angelico, Aug 15, 2011.

  1. On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <> wrote:
    > if I am using the standard csv library to read contents of a csv file which
    > contains Unicode strings (short example: '\xe8\x9f\x92\xe8\x9b\x87'), how do
    > I use a python Unicode method such as decode or encode to transform this
    > string type into a python unicode type? Must I know the encoding (byte
    > groupings) of the Unicode? Can I get this from the file? Perhaps I need to
    > open the file with particular attributes?
    >


    Start here:

    http://www.joelonsoftware.com/articles/Unicode.html

    The CSV file, being stored on disk, cannot contain Unicode strings; it
    can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
    etc), then you can decode it using that. If you don't, your best bet
    is to ask the origin of the file; failing that, check the first few
    bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
    probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
    encodings of the BOM). There may be other clues, too, but normally
    it's best to get the encoding separately from the data rather than try
    to decode it from the data itself.

    Chris Angelico
    Chris Angelico, Aug 15, 2011
    #1
    1. Advertising

  2. Chris Angelico wrote:

    > On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <> wrote:
    >> if I am using the standard csv library to read contents of a csv file
    >> which contains Unicode strings (short example:
    >> '\xe8\x9f\x92\xe8\x9b\x87'), how do I use a python Unicode method such as
    >> decode or encode to transform this string type into a python unicode
    >> type? Must I know the encoding (byte groupings) of the Unicode? Can I get
    >> this from the file? Perhaps I need to open the file with particular
    >> attributes?

    >
    > Start here:
    >
    > http://www.joelonsoftware.com/articles/Unicode.html
    >
    > The CSV file, being stored on disk, cannot contain Unicode strings; it
    > can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
    > etc), then you can decode it using that. If you don't, your best bet
    > is to ask the origin of the file; failing that, check the first few
    > bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
    > probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
    > encodings of the BOM). There may be other clues, too, but normally
    > it's best to get the encoding separately from the data rather than try
    > to decode it from the data itself.


    As this problem really is not a new one, there are several more – if I may
    say so – pythonic approaches:

    <http://stackoverflow.com/questions/436220/python-is-there-a-way-to-
    determine-the-encoding-of-text-file>

    Improving Billy Mays' "matching brackets" checker, chardet worked for me
    (the test file was UTF-8-encoded). Watch for word-wrap:

    -----------------------------------------------------------------------
    # encoding: utf-8
    '''
    Created on 2011-07-18

    @author: Thomas 'PointedEars' Lahn <>, based on an idea of
    Billy Mays <>
    in <news:j01ph6$knt$>
    '''
    import sys, os, chardet

    pairs = {u'}': u'{', u')': u'(', u']': u'[',
    u'â€': u'“', u'›': u'‹', u'»': u'«',
    u'】': u'ã€', u'〉': u'〈', u'》': u'《',
    u'ã€': u'「', u'ã€': u'『'}
    valid = set(v for pair in pairs.items() for v in pair)

    if __name__ == '__main__':
    for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
    for name in filenames:
    stack = [' ']

    file_path = os.path.join(dirpath, name)

    with open(file_path, 'rb') as f:
    reported = False
    lines = enumerate(f, 1)

    encoding = chardet.detect(''.join(map(lambda x: x[1],
    lines)))['encoding']

    chars = ((c, line_no, col) for line_no, line in lines for
    col, c in enumerate(line.decode(encoding), 1) if c in valid)
    for c, line_no, col in chars:
    if c in pairs:
    if stack[-1] == pairs[c]:
    stack.pop()
    else:
    if not reported:
    first_bad = (c, line_no, col)
    reported = True
    else:
    stack.append(c)

    print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
    '%s' at %s:%s" % first_bad))
    -----------------------------------------------------------------------

    HTH

    --
    PointedEars

    Bitte keine Kopien per E-Mail. / Please do not Cc: me.
    Thomas 'PointedEars' Lahn, Aug 15, 2011
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,904
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    527
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    494
    Gabriele *darkbard* Farina
    May 16, 2006
  4. Holger Joukl
    Replies:
    5
    Views:
    513
    Ben Finney
    Dec 13, 2006
  5. Chirag Mistry
    Replies:
    6
    Views:
    161
    Ollivier Robert
    Feb 8, 2008
Loading...

Share This Page