converting to and from octal escaped UTF--8

Discussion in 'Python' started by Michael Goerz, Dec 3, 2007.

  1. Hi,

    I am writing unicode stings into a special text file that requires to
    have non-ascii characters as as octal-escaped UTF-8 codes.

    For example, the letter "Ã" (latin capital I with acute, code point 205)
    would come out as "\303\215".

    I will also have to read back from the file later on and convert the
    escaped characters back into a unicode string.

    Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
    vice versa?

    I know I can get the code point by doing
    >>> "Ã".decode('utf-8').encode('unicode_escape')

    but there doesn't seem to be any similar method for getting the octal
    escaped version.

    Thanks,
    Michael
     
    Michael Goerz, Dec 3, 2007
    #1
    1. Advertising

  2. Michael Goerz wrote:
    > Hi,
    >
    > I am writing unicode stings into a special text file that requires to
    > have non-ascii characters as as octal-escaped UTF-8 codes.
    >
    > For example, the letter "Ã" (latin capital I with acute, code point 205)
    > would come out as "\303\215".
    >
    > I will also have to read back from the file later on and convert the
    > escaped characters back into a unicode string.
    >
    > Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
    > vice versa?
    >
    > I know I can get the code point by doing
    >>>> "Ã".decode('utf-8').encode('unicode_escape')

    > but there doesn't seem to be any similar method for getting the octal
    > escaped version.
    >
    > Thanks,
    > Michael


    I've come up with the following solution. It's not very pretty, but it
    works (no bugs, I hope). Can anyone think of a better way to do it?

    Michael
    _________

    import binascii

    def escape(s):
    hexstring = binascii.b2a_hex(s)
    result = ""
    while len(hexstring) > 0:
    (hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
    octbyte = oct(int(hexbyte, 16)).zfill(3)
    result += "\\" + octbyte[-3:]
    return result

    def unescape(s):
    result = ""
    while len(s) > 0:
    if s[0] == "\\":
    (octbyte, s) = (s[1:4], s[4:])
    try:
    result += chr(int(octbyte, 8))
    except ValueError:
    result += "\\"
    s = octbyte + s
    else:
    result += s[0]
    s = s[1:]
    return result

    print escape("\303\215")
    print unescape('adf\\303\\215adf')
     
    Michael Goerz, Dec 3, 2007
    #2
    1. Advertising

  3. Michael Goerz

    MonkeeSage Guest

    On Dec 2, 8:38 pm, Michael Goerz <4ward.com> wrote:
    > Michael Goerz wrote:
    > > Hi,

    >
    > > I am writing unicode stings into a special text file that requires to
    > > have non-ascii characters as as octal-escaped UTF-8 codes.

    >
    > > For example, the letter "Í" (latin capital I with acute, code point 205)
    > > would come out as "\303\215".

    >
    > > I will also have to read back from the file later on and convert the
    > > escaped characters back into a unicode string.

    >
    > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    > > vice versa?

    >
    > > I know I can get the code point by doing
    > >>>> "Í".decode('utf-8').encode('unicode_escape')

    > > but there doesn't seem to be any similar method for getting the octal
    > > escaped version.

    >
    > > Thanks,
    > > Michael

    >
    > I've come up with the following solution. It's not very pretty, but it
    > works (no bugs, I hope). Can anyone think of a better way to do it?
    >
    > Michael
    > _________
    >
    > import binascii
    >
    > def escape(s):
    > hexstring = binascii.b2a_hex(s)
    > result = ""
    > while len(hexstring) > 0:
    > (hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
    > octbyte = oct(int(hexbyte, 16)).zfill(3)
    > result += "\\" + octbyte[-3:]
    > return result
    >
    > def unescape(s):
    > result = ""
    > while len(s) > 0:
    > if s[0] == "\\":
    > (octbyte, s) = (s[1:4], s[4:])
    > try:
    > result += chr(int(octbyte, 8))
    > except ValueError:
    > result += "\\"
    > s = octbyte + s
    > else:
    > result += s[0]
    > s = s[1:]
    > return result
    >
    > print escape("\303\215")
    > print unescape('adf\\303\\215adf')


    Looks like escape() can be a bit simpler...

    def escape(s):
    result = []
    for char in s:
    result.append("\%o" % ord(char))
    return ''.join(result)

    Regards,
    Jordan
     
    MonkeeSage, Dec 3, 2007
    #3
  4. MonkeeSage wrote:
    > Looks like escape() can be a bit simpler...
    >
    > def escape(s):
    > result = []
    > for char in s:
    > result.append("\%o" % ord(char))
    > return ''.join(result)
    >
    > Regards,
    > Jordan

    Very neat! Thanks a lot...
    Michael
     
    Michael Goerz, Dec 3, 2007
    #4
  5. Michael Goerz wrote:
    > Hi,
    >
    > I am writing unicode stings into a special text file that requires to
    > have non-ascii characters as as octal-escaped UTF-8 codes.
    >
    > For example, the letter "Ã" (latin capital I with acute, code point 205)
    > would come out as "\303\215".
    >
    > I will also have to read back from the file later on and convert the
    > escaped characters back into a unicode string.
    >
    > Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
    > vice versa?
    >

    Perhaps something along the lines of:

    >>> def encode(source):

    ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
    ...
    >>> def decode(encoded):

    ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
    ... return bytes.decode('utf8')
    ...
    >>> encode(u"Ã")

    '\\303\\215'
    >>> print decode(_)

    Ã
    >>>


    HTH
    Michael
     
    Michael Spencer, Dec 3, 2007
    #5
  6. Michael Goerz

    MonkeeSage Guest

    On Dec 2, 11:46 pm, Michael Spencer <> wrote:
    > Michael Goerz wrote:
    > > Hi,

    >
    > > I am writing unicode stings into a special text file that requires to
    > > have non-ascii characters as as octal-escaped UTF-8 codes.

    >
    > > For example, the letter "Í" (latin capital I with acute, code point 205)
    > > would come out as "\303\215".

    >
    > > I will also have to read back from the file later on and convert the
    > > escaped characters back into a unicode string.

    >
    > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    > > vice versa?

    >
    > Perhaps something along the lines of:
    >
    > >>> def encode(source):

    > ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
    > ...
    > >>> def decode(encoded):

    > ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
    > ... return bytes.decode('utf8')
    > ...
    > >>> encode(u"Í")

    > '\\303\\215'
    > >>> print decode(_)

    > Í
    > >>>

    >
    > HTH
    > Michael


    Nice one. :) If I might suggest a slight variation to handle cases
    where the "encoded" string contains plain text as well as octal
    escapes...

    def decode(encoded):
    for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
    encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return encoded.decode('utf8')

    This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
    as well as "adf\\303\\215adf".

    Regards,
    Jordan
     
    MonkeeSage, Dec 3, 2007
    #6
  7. Michael Goerz

    MonkeeSage Guest

    On Dec 3, 1:31 am, MonkeeSage <> wrote:
    > On Dec 2, 11:46 pm, Michael Spencer <> wrote:
    >
    >
    >
    > > Michael Goerz wrote:
    > > > Hi,

    >
    > > > I am writing unicode stings into a special text file that requires to
    > > > have non-ascii characters as as octal-escaped UTF-8 codes.

    >
    > > > For example, the letter "Í" (latin capital I with acute, code point 205)
    > > > would come out as "\303\215".

    >
    > > > I will also have to read back from the file later on and convert the
    > > > escaped characters back into a unicode string.

    >
    > > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    > > > vice versa?

    >
    > > Perhaps something along the lines of:

    >
    > > >>> def encode(source):

    > > ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
    > > ...
    > > >>> def decode(encoded):

    > > ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
    > > ... return bytes.decode('utf8')
    > > ...
    > > >>> encode(u"Í")

    > > '\\303\\215'
    > > >>> print decode(_)

    > > Í

    >
    > > HTH
    > > Michael

    >
    > Nice one. :) If I might suggest a slight variation to handle cases
    > where the "encoded" string contains plain text as well as octal
    > escapes...
    >
    > def decode(encoded):
    > for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
    > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    > return encoded.decode('utf8')
    >
    > This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
    > as well as "adf\\303\\215adf".
    >
    > Regards,
    > Jordan


    err...

    def decode(encoded):
    for octc in re.findall(r'\\(\d{3})', encoded):
    encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return encoded.decode('utf8')
     
    MonkeeSage, Dec 3, 2007
    #7
  8. MonkeeSage wrote:
    > On Dec 3, 1:31 am, MonkeeSage <> wrote:
    >> On Dec 2, 11:46 pm, Michael Spencer <> wrote:
    >>
    >>
    >>
    >>> Michael Goerz wrote:
    >>>> Hi,
    >>>> I am writing unicode stings into a special text file that requires to
    >>>> have non-ascii characters as as octal-escaped UTF-8 codes.
    >>>> For example, the letter "Í" (latin capital I with acute, code point 205)
    >>>> would come out as "\303\215".
    >>>> I will also have to read back from the file later on and convert the
    >>>> escaped characters back into a unicode string.
    >>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    >>>> vice versa?
    >>> Perhaps something along the lines of:
    >>> >>> def encode(source):
    >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
    >>> ...
    >>> >>> def decode(encoded):
    >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
    >>> ... return bytes.decode('utf8')
    >>> ...
    >>> >>> encode(u"Í")
    >>> '\\303\\215'
    >>> >>> print decode(_)
    >>> Í
    >>> HTH
    >>> Michael

    >> Nice one. :) If I might suggest a slight variation to handle cases
    >> where the "encoded" string contains plain text as well as octal
    >> escapes...
    >>
    >> def decode(encoded):
    >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
    >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    >> return encoded.decode('utf8')
    >>
    >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
    >> as well as "adf\\303\\215adf".
    >>
    >> Regards,
    >> Jordan

    >
    > err...
    >
    > def decode(encoded):
    > for octc in re.findall(r'\\(\d{3})', encoded):
    > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    > return encoded.decode('utf8')

    Great suggestions from both of you! I came up with my "final" solution
    based on them. It encodes only non-ascii and non-printables, and stays
    in unicode strings for both input and output. Also, low ascii values now
    encode into a 3-digit octal sequence also, so that decode can catch them
    properly.

    Thanks a lot,
    Michael

    ____________

    import re

    def encode(source):
    encoded = ""
    for character in source:
    if (ord(character) < 32) or (ord(character) > 128):
    for byte in character.encode('utf8'):
    encoded += ("\%03o" % ord(byte))
    else:
    encoded += character
    return encoded.decode('utf-8')

    def decode(encoded):
    decoded = encoded.encode('utf-8')
    for octc in re.findall(r'\\(\d{3})', decoded):
    decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return decoded.decode('utf8')


    orig = u"blaÍblub" + chr(10)
    enc = encode(orig)
    dec = decode(enc)
    print orig
    print enc
    print dec
     
    Michael Goerz, Dec 3, 2007
    #8
  9. >>>>> Michael Goerz <4ward.com> (MG) wrote:

    >MG> if (ord(character) < 32) or (ord(character) > 128):


    If you encode chars < 32 it seems more appropriate to also encode 127.

    Moreover your code is quadratic in the size of the string so if you use
    long strings it would be better to use join.
    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
    Private email:
     
    Piet van Oostrum, Dec 4, 2007
    #9
  10. Michael Goerz

    MonkeeSage Guest

    On Dec 3, 8:10 am, Michael Goerz <4ward.com> wrote:
    > MonkeeSage wrote:
    > > On Dec 3, 1:31 am, MonkeeSage <> wrote:
    > >> On Dec 2, 11:46 pm, Michael Spencer <> wrote:

    >
    > >>> Michael Goerz wrote:
    > >>>> Hi,
    > >>>> I am writing unicode stings into a special text file that requires to
    > >>>> have non-ascii characters as as octal-escaped UTF-8 codes.
    > >>>> For example, the letter "Í" (latin capital I with acute, code point 205)
    > >>>> would come out as "\303\215".
    > >>>> I will also have to read back from the file later on and convert the
    > >>>> escaped characters back into a unicode string.
    > >>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
    > >>>> vice versa?
    > >>> Perhaps something along the lines of:
    > >>> >>> def encode(source):
    > >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
    > >>> ...
    > >>> >>> def decode(encoded):
    > >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
    > >>> ... return bytes.decode('utf8')
    > >>> ...
    > >>> >>> encode(u"Í")
    > >>> '\\303\\215'
    > >>> >>> print decode(_)
    > >>> Í
    > >>> HTH
    > >>> Michael
    > >> Nice one. :) If I might suggest a slight variation to handle cases
    > >> where the "encoded" string contains plain text as well as octal
    > >> escapes...

    >
    > >> def decode(encoded):
    > >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
    > >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    > >> return encoded.decode('utf8')

    >
    > >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
    > >> as well as "adf\\303\\215adf".

    >
    > >> Regards,
    > >> Jordan

    >
    > > err...

    >
    > > def decode(encoded):
    > > for octc in re.findall(r'\\(\d{3})', encoded):
    > > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    > > return encoded.decode('utf8')

    >
    > Great suggestions from both of you! I came up with my "final" solution
    > based on them. It encodes only non-ascii and non-printables, and stays
    > in unicode strings for both input and output. Also, low ascii values now
    > encode into a 3-digit octal sequence also, so that decode can catch them
    > properly.
    >
    > Thanks a lot,
    > Michael
    >
    > ____________
    >
    > import re
    >
    > def encode(source):
    > encoded = ""
    > for character in source:
    > if (ord(character) < 32) or (ord(character) > 128):
    > for byte in character.encode('utf8'):
    > encoded += ("\%03o" % ord(byte))
    > else:
    > encoded += character
    > return encoded.decode('utf-8')
    >
    > def decode(encoded):
    > decoded = encoded.encode('utf-8')
    > for octc in re.findall(r'\\(\d{3})', decoded):
    > decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    > return decoded.decode('utf8')
    >
    > orig = u"blaÍblub" + chr(10)
    > enc = encode(orig)
    > dec = decode(enc)
    > print orig
    > print enc
    > print dec


    An optimization...in decode() store matches as keys in a dict, so you
    only do the string replacement once for each unique character...

    def decode(encoded):
    decoded = encoded.encode('utf-8')
    matches = {}
    for octc in re.findall(r'\\(\d{3})', decoded):
    matches[octc] = None
    for octc in matches:
    decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return decoded.decode('utf8')

    Untested...

    Regards,
    Jordan
     
    MonkeeSage, Dec 4, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hostos
    Replies:
    7
    Views:
    5,233
    La'ie Techie
    Oct 15, 2003
  2. Replies:
    2
    Views:
    600
    Christos TZOTZIOY Georgiou
    Dec 24, 2004
  3. Replies:
    15
    Views:
    11,967
    Eric Sosman
    Jun 23, 2006
  4. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,337
    P.J. Plauger
    Aug 1, 2006
  5. Chris Worrall

    Converting escaped html to utf-8

    Chris Worrall, Jul 26, 2007, in forum: Ruby
    Replies:
    2
    Views:
    132
    Daniel DeLorme
    Jul 26, 2007
Loading...

Share This Page