Good cross-version ASCII serialisation protocol for simple types

Paul Moore · Feb 23, 2013

I need to transfer some data (nothing fancy, some dictionaries, strings, numbers and lists, basically) between 2 Python processes. However, the data (string values) is potentially not ASCII, but the transport is (I'm piping between 2 processes, but thanks to nasty encoding issues, the only characters I can be sure won't be mangled are ASCII).

What's the best ASCII-only protocol to use that's portable between versionsof Python back to about 2.6/2.7 and in the stdlib, so I don't need external modules?

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)])
decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it into the result.

This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).

Thanks,
Paul

Chris Angelico · Feb 23, 2013

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)])
decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it into the result.

This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).

Hmm. How likely is it that you'll have non-ASCII characters in the
input? If they're fairly uncommon, you could use UTF-7 - it's fairly
space-efficient when the input is mostly ASCII, but inefficient on
other characters.

Not sure what the problem is with bytes vs strings; you can always do
an encode("ascii") or decode("ascii") to convert 7-bit strings between
those types.

With that covered, I'd just go with a single JSON packaging, and work
with the resulting Unicode string.

Python 2.6:u'asdf+EjQ-zxcv'

Python 3.3:'asdf+EjQ-zxcv'

Another option would be to JSON-encode in pure-ASCII mode:

json.dumps(~~,ensure_ascii=True)~~

Click to expand...

Click to expand...

'["asdf\\u1234zxcv"]'

Would that cover it?

ChrisA

Irmen de Jong · Feb 23, 2013

I need to transfer some data (nothing fancy, some dictionaries, strings, numbers and
lists, basically) between 2 Python processes. However, the data (string values) is
potentially not ASCII, but the transport is (I'm piping between 2 processes, but
thanks to nasty encoding issues, the only characters I can be sure won't be mangled
are ASCII).

What's the best ASCII-only protocol to use that's portable between versions of Python
back to about 2.6/2.7 and in the stdlib, so I don't need external modules?

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)]) decoded =
json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it into the result.
Eww.

This works fine, but is there something simpler (i.e., less of a hack!) that I could
use? (Base64 and the like don't work because they encode bytes->strings, not
strings->strings).

For Python < 3.0, strings and bytes are the same;

Other than that, maybe a simple repr(stuff) / ast.literal_eval(string) might do the job?

Irmen

Jussi Piitulainen · Feb 23, 2013

Paul said:
I need to transfer some data (nothing fancy, some dictionaries,
strings, numbers and lists, basically) between 2 Python
processes. However, the data (string values) is potentially not
ASCII, but the transport is (I'm piping between 2 processes, but
thanks to nasty encoding issues, the only characters I can be sure
won't be mangled are ASCII).

What's the best ASCII-only protocol to use that's portable between
versions of Python back to about 2.6/2.7 and in the stdlib, so I
don't need external modules?

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)])
decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it
into the result.

This works fine, but is there something simpler (i.e., less of a
hack!) that I could use? (Base64 and the like don't work because
they encode bytes->strings, not strings->strings).

I don't know much of these things but I've been using Python's
json.dump and json.load for a couple of weeks now and they seem to use
ASCII-friendly escapes automatically, writing a four-character string
as "\u00e4\u00e4ni" instead of using the UTF-8 characters that my
environment is set to handle. That's written to stdout which is then
directed to a file in a shell script, and I copy-pasted it here from
the resulting file.

I'm using Python 3.3, though.

Paul Moore · Feb 23, 2013

I don't know much of these things but I've been using Python's
json.dump and json.load for a couple of weeks now and they seem to use
ASCII-friendly escapes automatically, writing a four-character string
as "\u00e4\u00e4ni" instead of using the UTF-8 characters that my
environment is set to handle.

Thanks. When I tried to write a short program to demo what I was doing, I realised that my problem was actually with my test code, not with json. Here's my test code:

import json, subprocess
CODE="""
import json
p = {'x': '\N{EURO SIGN}'}
print json.dumps(p)
"""
data_bytes = subprocess.check_output(['py', '-2', '-c', CODE])
data = json.loads(data_bytes.decode('ASCII'))
print(data)

The problem is that I'm not using a raw string for CODE, so the Euro sign is being put into the string literally, and that causes all sorts of encoding-related fun that I didn't intend!

As you say, json actually works fine for this application, so thanks for pointing that out. I thought it shouldn't need to be as hard as I was making it!!!

Paul.

generate and send mail with python: tutorial	8	Aug 11, 2011
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
python-dev summary for 2005-07-01 to 2005-07-15	1	Jul 31, 2005
python-dev Summary for 2006-02-16 through 2006-02-28	1	Apr 29, 2006
python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
[SUMMARY] hexdump (#171)	0	Jul 31, 2008
python-dev Summary for 2004-08-16 through 2004-08-31	1	Sep 19, 2004

Good cross-version ASCII serialisation protocol for simple types

Paul Moore

Chris Angelico

Irmen de Jong

Jussi Piitulainen

Paul Moore

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads