Good cross-version ASCII serialisation protocol for simple types

Discussion in 'Python' started by Paul Moore, Feb 23, 2013.

  1. Paul  Moore

    Paul Moore Guest

    I need to transfer some data (nothing fancy, some dictionaries, strings, numbers and lists, basically) between 2 Python processes. However, the data (string values) is potentially not ASCII, but the transport is (I'm piping between 2 processes, but thanks to nasty encoding issues, the only characters I can be sure won't be mangled are ASCII).

    What's the best ASCII-only protocol to use that's portable between versionsof Python back to about 2.6/2.7 and in the stdlib, so I don't need external modules?

    At the moment, I'm using

    encoded = json.dumps([ord(c) for c in json.dumps(obj)])
    decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

    The double-encoding ensures that non-ASCII characters don't make it into the result.

    This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).

    Thanks,
    Paul
    Paul Moore, Feb 23, 2013
    #1
    1. Advertising

  2. On Sun, Feb 24, 2013 at 2:45 AM, Paul Moore <> wrote:
    > At the moment, I'm using
    >
    > encoded = json.dumps([ord(c) for c in json.dumps(obj)])
    > decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))
    >
    > The double-encoding ensures that non-ASCII characters don't make it into the result.
    >
    > This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).


    Hmm. How likely is it that you'll have non-ASCII characters in the
    input? If they're fairly uncommon, you could use UTF-7 - it's fairly
    space-efficient when the input is mostly ASCII, but inefficient on
    other characters.

    Not sure what the problem is with bytes vs strings; you can always do
    an encode("ascii") or decode("ascii") to convert 7-bit strings between
    those types.

    With that covered, I'd just go with a single JSON packaging, and work
    with the resulting Unicode string.

    Python 2.6:
    >>> s=u"asdf\u1234zxcv"
    >>> s.encode("utf-7").decode("ascii")

    u'asdf+EjQ-zxcv'

    Python 3.3:
    >>> s=u"asdf\u1234zxcv"
    >>> s.encode("utf-7").decode("ascii")

    'asdf+EjQ-zxcv'

    Another option would be to JSON-encode in pure-ASCII mode:

    >>> json.dumps(,ensure_ascii=True)

    '["asdf\\u1234zxcv"]'

    Would that cover it?

    ChrisA
    Chris Angelico, Feb 23, 2013
    #2
    1. Advertising

  3. On 23-2-2013 16:45, Paul Moore wrote:
    > I need to transfer some data (nothing fancy, some dictionaries, strings, numbers and
    > lists, basically) between 2 Python processes. However, the data (string values) is
    > potentially not ASCII, but the transport is (I'm piping between 2 processes, but
    > thanks to nasty encoding issues, the only characters I can be sure won't be mangled
    > are ASCII).
    >
    > What's the best ASCII-only protocol to use that's portable between versions of Python
    > back to about 2.6/2.7 and in the stdlib, so I don't need external modules?
    >
    > At the moment, I'm using
    >
    > encoded = json.dumps([ord(c) for c in json.dumps(obj)]) decoded =
    > json.loads(''.join([chr(n) for n in json.loads(encoded)]))
    >
    > The double-encoding ensures that non-ASCII characters don't make it into the result.


    Eww.

    >
    > This works fine, but is there something simpler (i.e., less of a hack!) that I could
    > use? (Base64 and the like don't work because they encode bytes->strings, not
    > strings->strings).


    For Python < 3.0, strings and bytes are the same;

    >>> import base64
    >>> base64.b64encode("hello there")

    'aGVsbG8gdGhlcmU='
    >>> base64.b64decode(_)

    'hello there'
    >>>



    Other than that, maybe a simple repr(stuff) / ast.literal_eval(string) might do the job?


    Irmen
    Irmen de Jong, Feb 23, 2013
    #3
  4. Paul Moore writes:

    > I need to transfer some data (nothing fancy, some dictionaries,
    > strings, numbers and lists, basically) between 2 Python
    > processes. However, the data (string values) is potentially not
    > ASCII, but the transport is (I'm piping between 2 processes, but
    > thanks to nasty encoding issues, the only characters I can be sure
    > won't be mangled are ASCII).
    >
    > What's the best ASCII-only protocol to use that's portable between
    > versions of Python back to about 2.6/2.7 and in the stdlib, so I
    > don't need external modules?
    >
    > At the moment, I'm using
    >
    > encoded = json.dumps([ord(c) for c in json.dumps(obj)])
    > decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))
    >
    > The double-encoding ensures that non-ASCII characters don't make it
    > into the result.
    >
    > This works fine, but is there something simpler (i.e., less of a
    > hack!) that I could use? (Base64 and the like don't work because
    > they encode bytes->strings, not strings->strings).


    I don't know much of these things but I've been using Python's
    json.dump and json.load for a couple of weeks now and they seem to use
    ASCII-friendly escapes automatically, writing a four-character string
    as "\u00e4\u00e4ni" instead of using the UTF-8 characters that my
    environment is set to handle. That's written to stdout which is then
    directed to a file in a shell script, and I copy-pasted it here from
    the resulting file.

    I'm using Python 3.3, though.
    Jussi Piitulainen, Feb 23, 2013
    #4
  5. Paul  Moore

    Paul Moore Guest

    On Saturday, 23 February 2013 16:06:11 UTC, Jussi Piitulainen wrote:
    > I don't know much of these things but I've been using Python's
    > json.dump and json.load for a couple of weeks now and they seem to use
    > ASCII-friendly escapes automatically, writing a four-character string
    > as "\u00e4\u00e4ni" instead of using the UTF-8 characters that my
    > environment is set to handle.


    Thanks. When I tried to write a short program to demo what I was doing, I realised that my problem was actually with my test code, not with json. Here's my test code:

    import json, subprocess
    CODE="""
    import json
    p = {'x': '\N{EURO SIGN}'}
    print json.dumps(p)
    """
    data_bytes = subprocess.check_output(['py', '-2', '-c', CODE])
    data = json.loads(data_bytes.decode('ASCII'))
    print(data)

    The problem is that I'm not using a raw string for CODE, so the Euro sign is being put into the string literally, and that causes all sorts of encoding-related fun that I didn't intend!

    As you say, json actually works fine for this application, so thanks for pointing that out. I thought it shouldn't need to be as hard as I was making it!!!

    Paul.
    Paul Moore, Feb 23, 2013
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nick Gilbert
    Replies:
    4
    Views:
    1,490
  2. Pierre Couderc
    Replies:
    1
    Views:
    341
    Victor Bazarov
    Jul 29, 2005
  3. chewie54
    Replies:
    2
    Views:
    283
  4. V Green
    Replies:
    0
    Views:
    809
    V Green
    Feb 5, 2008
  5. PA Bear [MS MVP]
    Replies:
    0
    Views:
    911
    PA Bear [MS MVP]
    Feb 5, 2008
Loading...

Share This Page