Specifying an API for a straeming parser

T

tyler

Howdy folks, I'm working on a JSON Python module [1] and I'm struggling with an
appropriate syntax for dealing with incrementally parsing streams of data as
they come in (off a socket or file object).

The underlying C-level parsing library that I'm using (Yajl [2]) already uses a
callback system internally for handling such things, but I'm worried about:
* Ease of use, simplicity
* Python method invocation overhead going from C back into Python

One of the ideas I've had is to "iterparse" a la:

Effectively building a generator for the JSON string coming off of the `fp`
object and when generator.next() is called reading more of the stream object.
This has some shortcomings however:
* For JSON like: '''{"rc":0,"data":<large JSON object>}''' the iterloads()
function would block for some time when processing the value of the "data"
key.
* Presumes the developer has prior knowledge of the kind of JSON strings
being passed in

I've searched around, following this "iterloads" notion, for a tree-generator
and I came up with nothing.

Any suggestions on how to accomplish iterloads, or perhaps a suggestion for a
more sensible syntax for incrementally parsing objects from the stream and
passing them up into Python?

Cheers,
-R. Tyler Ballance
--------------------------------------
Jabber: (e-mail address removed)
GitHub: http://github.com/rtyler
Twitter: http://twitter.com/agentdero
Blog: http://unethicalblogger.com



[1] http://github.com/rtyler/py-yajl
[2] http://lloyd.github.com/yajl/



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.13 (GNU/Linux)

iEYEARECAAYFAksZhFIACgkQFCbH3D9R4W/WfgCgq8O3wFOpUvvrmfZncwiTSUGu
EjUAnivNeZdSfk0M1Mut+2bpjRTg5g8P
=kzdu
-----END PGP SIGNATURE-----
 
N

Nobody

Howdy folks, I'm working on a JSON Python module [1] and I'm struggling with
an appropriate syntax for dealing with incrementally parsing streams of data
as they come in (off a socket or file object).

The underlying C-level parsing library that I'm using (Yajl [2]) already uses
a callback system internally for handling such things, but I'm worried about:
* Ease of use, simplicity
* Python method invocation overhead going from C back into Python

One of the ideas I've had is to "iterparse" a la:

Effectively building a generator for the JSON string coming off of the `fp`
object and when generator.next() is called reading more of the stream object.
This has some shortcomings however:
* For JSON like: '''{"rc":0,"data":<large JSON object>}''' the iterloads()
function would block for some time when processing the value of the "data"
key.
* Presumes the developer has prior knowledge of the kind of JSON strings
being passed in

I've searched around, following this "iterloads" notion, for a tree-generator
and I came up with nothing.

Any suggestions on how to accomplish iterloads, or perhaps a suggestion for a
more sensible syntax for incrementally parsing objects from the stream and
passing them up into Python?

One option is to return values as opaque objects with .type() and .data()
methods. The opaque object can be returned as soon as the parser starts to
parse the value.

If the user calls the .data() method for an atomic object (string,
number, boolean, null) before parsing is complete, the call will block.
For composite objects (array, object), the call will return an iterator
immediately.

If the user never calls the data() method, there's no need to convert the
element to a Python value. If the object's refcount reaches zero while
parsing is still ongoing, the parser can discard any existing data and
discard further data as it is read.

E.g. a program to read JSON data from stdin and print the data back to
stdout in (approximately) JSON format would look like:

def print_json(f, node):
if node.type() == json.NULL:
f.write("null")
elif node.type() == json.BOOL:
f.write("true" if node.data() else "false")
elif node.type() == json.NUMBER:
f.write(node.data())
elif node.type() == json.STRING:
f.write('"' + node.data() + '"')
elif node.type() == json.ARRAY:
f.write('[')
for i, v in enumerate(node.data()):
if i > 0: f.write(',')
print_json(f, v)
f.write(']')
elif node.type() == json.OBJECT:
f.write('{')
for i, (k, v) in enumerate(node.data()):
if i > 0: f.write(',')
print_json(f, k)
f.write(": ")
print_json(f, v)
f.write('}')

root = json.parse(sys.stdin)
print_json(sys.stdout, root)

For greater pythonicity, you could make the composite types implement the
iterator interface directly, so the data() method becomes redundant (if
called, it would just return "self"), and use distinct classes for the
distinct types (so that you can use type() or isinstance()), i.e.:

def print_json(f, node):
if isinstance(node, json.Null):
f.write("null")
elif isinstance(node, json.Bool):
f.write("true" if node.data() else "false")
elif isinstance(node, json.Number):
f.write(node.data())
elif isinstance(node, json.String):
f.write('"' + node.data() + '"')
elif isinstance(node, json.Array):
f.write('[')
for i, v in enumerate(node):
if i > 0: f.write(',')
print_json(f, v)
f.write(']')
elif isinstance(node, json.Object):
f.write('{')
for i, (k, v) in enumerate(node):
if i > 0: f.write(',')
print_json(f, k)
f.write(": ")
print_json(f, v)
f.write('}')

root = json.parse(sys.stdin)
print_json(sys.stdout, root)

If you think that some of the individual strings may be large, you could
make the String class implement the iterator interface (or even the file
interface with .read() etc), to allow the data to be read incrementally:

elif isinstance(node, json.String):
f.write('"')
for s in node:
f.write(s)
f.write('"')

The main point is to allow the node object to be returned as soon as the
type is known, without having to wait until the data has been fully
parsed, and to require a separate step (for which there may be various
choices) in order to actually retrieve the data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top