Good cross-version ASCII serialisation protocol for simple types

P

Paul Moore

I need to transfer some data (nothing fancy, some dictionaries, strings, numbers and lists, basically) between 2 Python processes. However, the data (string values) is potentially not ASCII, but the transport is (I'm piping between 2 processes, but thanks to nasty encoding issues, the only characters I can be sure won't be mangled are ASCII).

What's the best ASCII-only protocol to use that's portable between versionsof Python back to about 2.6/2.7 and in the stdlib, so I don't need external modules?

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)])
decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it into the result.

This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).

Thanks,
Paul
 
C

Chris Angelico

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)])
decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it into the result.

This works fine, but is there something simpler (i.e., less of a hack!) that I could use? (Base64 and the like don't work because they encode bytes->strings, not strings->strings).

Hmm. How likely is it that you'll have non-ASCII characters in the
input? If they're fairly uncommon, you could use UTF-7 - it's fairly
space-efficient when the input is mostly ASCII, but inefficient on
other characters.

Not sure what the problem is with bytes vs strings; you can always do
an encode("ascii") or decode("ascii") to convert 7-bit strings between
those types.

With that covered, I'd just go with a single JSON packaging, and work
with the resulting Unicode string.

Python 2.6:u'asdf+EjQ-zxcv'

Python 3.3:'asdf+EjQ-zxcv'

Another option would be to JSON-encode in pure-ASCII mode:

'["asdf\\u1234zxcv"]'

Would that cover it?

ChrisA
 
I

Irmen de Jong

I need to transfer some data (nothing fancy, some dictionaries, strings, numbers and
lists, basically) between 2 Python processes. However, the data (string values) is
potentially not ASCII, but the transport is (I'm piping between 2 processes, but
thanks to nasty encoding issues, the only characters I can be sure won't be mangled
are ASCII).

What's the best ASCII-only protocol to use that's portable between versions of Python
back to about 2.6/2.7 and in the stdlib, so I don't need external modules?

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)]) decoded =
json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it into the result.
Eww.


This works fine, but is there something simpler (i.e., less of a hack!) that I could
use? (Base64 and the like don't work because they encode bytes->strings, not
strings->strings).

For Python < 3.0, strings and bytes are the same;


Other than that, maybe a simple repr(stuff) / ast.literal_eval(string) might do the job?


Irmen
 
J

Jussi Piitulainen

Paul said:
I need to transfer some data (nothing fancy, some dictionaries,
strings, numbers and lists, basically) between 2 Python
processes. However, the data (string values) is potentially not
ASCII, but the transport is (I'm piping between 2 processes, but
thanks to nasty encoding issues, the only characters I can be sure
won't be mangled are ASCII).

What's the best ASCII-only protocol to use that's portable between
versions of Python back to about 2.6/2.7 and in the stdlib, so I
don't need external modules?

At the moment, I'm using

encoded = json.dumps([ord(c) for c in json.dumps(obj)])
decoded = json.loads(''.join([chr(n) for n in json.loads(encoded)]))

The double-encoding ensures that non-ASCII characters don't make it
into the result.

This works fine, but is there something simpler (i.e., less of a
hack!) that I could use? (Base64 and the like don't work because
they encode bytes->strings, not strings->strings).

I don't know much of these things but I've been using Python's
json.dump and json.load for a couple of weeks now and they seem to use
ASCII-friendly escapes automatically, writing a four-character string
as "\u00e4\u00e4ni" instead of using the UTF-8 characters that my
environment is set to handle. That's written to stdout which is then
directed to a file in a shell script, and I copy-pasted it here from
the resulting file.

I'm using Python 3.3, though.
 
P

Paul Moore

I don't know much of these things but I've been using Python's
json.dump and json.load for a couple of weeks now and they seem to use
ASCII-friendly escapes automatically, writing a four-character string
as "\u00e4\u00e4ni" instead of using the UTF-8 characters that my
environment is set to handle.

Thanks. When I tried to write a short program to demo what I was doing, I realised that my problem was actually with my test code, not with json. Here's my test code:

import json, subprocess
CODE="""
import json
p = {'x': '\N{EURO SIGN}'}
print json.dumps(p)
"""
data_bytes = subprocess.check_output(['py', '-2', '-c', CODE])
data = json.loads(data_bytes.decode('ASCII'))
print(data)

The problem is that I'm not using a raw string for CODE, so the Euro sign is being put into the string literally, and that causes all sorts of encoding-related fun that I didn't intend!

As you say, json actually works fine for this application, so thanks for pointing that out. I thought it shouldn't need to be as hard as I was making it!!!

Paul.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top