convert string with raw binary data to unicode

Discussion in 'Python' started by Achim Domma, Feb 12, 2004.

  1. Achim Domma

    Achim Domma Guest

    Hi,

    I want to pass raw binary data from a file to a COM object. I read the data
    from file like this:

    data = file('path_to_file','rb').read()

    If passed to a COM object, data is converted to unicode in the way one would
    expect for strings. I.e. a lot of zeros are filled in. I want each two
    characters from data to be interpreted as one unicode character. I read the
    docu about codecs but can not find a suitable codec. I also tried to read
    the data like this:

    data = codecs.open('path_to_file','rb','???').read()

    I tried to use UCS2 for the ???, but this encoding does not exist. A posting
    found via google supposes to use UTF-16 but this is not the same and raises
    an error.

    This shouldn't be a big problem, but I can figure out how to solve it. Can
    anybody help?

    regards,
    Achim
     
    Achim Domma, Feb 12, 2004
    #1
    1. Advertising

  2. "Achim Domma" <> writes:

    > Hi,
    >
    > I want to pass raw binary data from a file to a COM object. I read the data
    > from file like this:
    >
    > data = file('path_to_file','rb').read()
    >
    > If passed to a COM object, data is converted to unicode in the way one would
    > expect for strings. I.e. a lot of zeros are filled in. I want each two
    > characters from data to be interpreted as one unicode character. I read the
    > docu about codecs but can not find a suitable codec. I also tried to read
    > the data like this:
    >
    > data = codecs.open('path_to_file','rb','???').read()
    >
    > I tried to use UCS2 for the ???, but this encoding does not exist. A posting
    > found via google supposes to use UTF-16 but this is not the same and raises
    > an error.
    >
    > This shouldn't be a big problem, but I can figure out how to solve it. Can
    > anybody help?


    If I understand your problem correctly, you want to construct a unicode
    object containing arbitrary data in it's internal buffer.

    And if I understand Python's unicode implementation correctly, than I
    would say it isn't possible - since unicode objects do not contain
    binary data, they contain characters (or how is this called in the
    unicode world?).

    OTOH, it should be possible to write a small extension wrapping the
    PyUnicode_FromUnicode() function to accept arbitrary data.

    Is there also a possibility to write a codec which does this?

    Note that the 'if's above are probably big 'if's...

    Thomas
     
    Thomas Heller, Feb 12, 2004
    #2
    1. Advertising

  3. Achim Domma

    Neil Hodgson Guest

    Achim Domma:

    > data = codecs.open('path_to_file','rb','???').read()
    >
    > I tried to use UCS2 for the ???, but this encoding does not exist. A

    posting
    > found via google supposes to use UTF-16 but this is not the same and

    raises
    > an error.


    It is better to show the error message when sending queries to a news
    group. You may want to look at the 'errors' argument which can be one of:

    'strict' Raise ValueError (or a subclass); this is the default.
    'ignore' Ignore the character and continue with the next.
    'replace' Replace with a suitable replacement character
    'xmlcharrefreplace' Replace with the appropriate XML character reference
    'backslashreplace' Replace with backslashed escape sequences.

    Take a look at the results after using, say, 'backslashreplace' and you
    may find that much of your file is not UTF-16 or that it is byte swapped or
    that there are just a few bad characters in a header or similar.

    Neil
     
    Neil Hodgson, Feb 12, 2004
    #3
  4. Achim Domma wrote:
    > Hi,
    >
    > I want to pass raw binary data from a file to a COM object. I read the data
    > from file like this:
    >
    > data = file('path_to_file','rb').read()
    >
    > If passed to a COM object, data is converted to unicode in the way one would
    > expect for strings. I.e. a lot of zeros are filled in. I want each two
    > characters from data to be interpreted as one unicode character. I read the
    > docu about codecs but can not find a suitable codec. I also tried to read
    > the data like this:
    >
    > data = codecs.open('path_to_file','rb','???').read()
    >
    > I tried to use UCS2 for the ???, but this encoding does not exist. A posting
    > found via google supposes to use UTF-16 but this is not the same and raises
    > an error.
    >
    > This shouldn't be a big problem, but I can figure out how to solve it. Can
    > anybody help?


    Try utf-16-le or utf-16-be (depending on endianness of the data) as
    encoding.


    --
    Sjoerd Mullender <>
     
    Sjoerd Mullender, Feb 17, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Phd
    Replies:
    3
    Views:
    1,157
    Steven Bethard
    Dec 6, 2004
  2. ldng
    Replies:
    3
    Views:
    1,958
    Tim Golden
    May 10, 2007
  3. OKB (not okblacke)

    Unicode raw string containing \u

    OKB (not okblacke), Oct 28, 2007, in forum: Python
    Replies:
    3
    Views:
    319
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=
    Oct 28, 2007
  4. brad
    Replies:
    4
    Views:
    770
    James Kanze
    Jul 21, 2008
  5. r2
    Replies:
    7
    Views:
    5,050
    Jan Kaliszewski
    Jul 27, 2009
Loading...

Share This Page