Specifying an API for a straeming parser

Discussion in 'Python' started by tyler@monkeypox.org, Dec 4, 2009.

  1. Guest

    Howdy folks, I'm working on a JSON Python module [1] and I'm struggling with an
    appropriate syntax for dealing with incrementally parsing streams of data as
    they come in (off a socket or file object).

    The underlying C-level parsing library that I'm using (Yajl [2]) already uses a
    callback system internally for handling such things, but I'm worried about:
    * Ease of use, simplicity
    * Python method invocation overhead going from C back into Python

    One of the ideas I've had is to "iterparse" a la:

    >>> for k, v in yajl.iterloads(fp):

    ... print ('key, value', k, v)
    >>>


    Effectively building a generator for the JSON string coming off of the `fp`
    object and when generator.next() is called reading more of the stream object.
    This has some shortcomings however:
    * For JSON like: '''{"rc":0,"data":<large JSON object>}''' the iterloads()
    function would block for some time when processing the value of the "data"
    key.
    * Presumes the developer has prior knowledge of the kind of JSON strings
    being passed in

    I've searched around, following this "iterloads" notion, for a tree-generator
    and I came up with nothing.

    Any suggestions on how to accomplish iterloads, or perhaps a suggestion for a
    more sensible syntax for incrementally parsing objects from the stream and
    passing them up into Python?

    Cheers,
    -R. Tyler Ballance
    --------------------------------------
    Jabber:
    GitHub: http://github.com/rtyler
    Twitter: http://twitter.com/agentdero
    Blog: http://unethicalblogger.com



    [1] http://github.com/rtyler/py-yajl
    [2] http://lloyd.github.com/yajl/



    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.13 (GNU/Linux)

    iEYEARECAAYFAksZhFIACgkQFCbH3D9R4W/WfgCgq8O3wFOpUvvrmfZncwiTSUGu
    EjUAnivNeZdSfk0M1Mut+2bpjRTg5g8P
    =kzdu
    -----END PGP SIGNATURE-----
    , Dec 4, 2009
    #1
    1. Advertising

  2. Nobody Guest

    On Fri, 04 Dec 2009 13:51:15 -0800, tyler wrote:

    > Howdy folks, I'm working on a JSON Python module [1] and I'm struggling with
    > an appropriate syntax for dealing with incrementally parsing streams of data
    > as they come in (off a socket or file object).
    >
    > The underlying C-level parsing library that I'm using (Yajl [2]) already uses
    > a callback system internally for handling such things, but I'm worried about:
    > * Ease of use, simplicity
    > * Python method invocation overhead going from C back into Python
    >
    > One of the ideas I've had is to "iterparse" a la:
    >
    > >>> for k, v in yajl.iterloads(fp):

    > ... print ('key, value', k, v)
    > >>>

    >
    > Effectively building a generator for the JSON string coming off of the `fp`
    > object and when generator.next() is called reading more of the stream object.
    > This has some shortcomings however:
    > * For JSON like: '''{"rc":0,"data":<large JSON object>}''' the iterloads()
    > function would block for some time when processing the value of the "data"
    > key.
    > * Presumes the developer has prior knowledge of the kind of JSON strings
    > being passed in
    >
    > I've searched around, following this "iterloads" notion, for a tree-generator
    > and I came up with nothing.
    >
    > Any suggestions on how to accomplish iterloads, or perhaps a suggestion for a
    > more sensible syntax for incrementally parsing objects from the stream and
    > passing them up into Python?


    One option is to return values as opaque objects with .type() and .data()
    methods. The opaque object can be returned as soon as the parser starts to
    parse the value.

    If the user calls the .data() method for an atomic object (string,
    number, boolean, null) before parsing is complete, the call will block.
    For composite objects (array, object), the call will return an iterator
    immediately.

    If the user never calls the data() method, there's no need to convert the
    element to a Python value. If the object's refcount reaches zero while
    parsing is still ongoing, the parser can discard any existing data and
    discard further data as it is read.

    E.g. a program to read JSON data from stdin and print the data back to
    stdout in (approximately) JSON format would look like:

    def print_json(f, node):
    if node.type() == json.NULL:
    f.write("null")
    elif node.type() == json.BOOL:
    f.write("true" if node.data() else "false")
    elif node.type() == json.NUMBER:
    f.write(node.data())
    elif node.type() == json.STRING:
    f.write('"' + node.data() + '"')
    elif node.type() == json.ARRAY:
    f.write('[')
    for i, v in enumerate(node.data()):
    if i > 0: f.write(',')
    print_json(f, v)
    f.write(']')
    elif node.type() == json.OBJECT:
    f.write('{')
    for i, (k, v) in enumerate(node.data()):
    if i > 0: f.write(',')
    print_json(f, k)
    f.write(": ")
    print_json(f, v)
    f.write('}')

    root = json.parse(sys.stdin)
    print_json(sys.stdout, root)

    For greater pythonicity, you could make the composite types implement the
    iterator interface directly, so the data() method becomes redundant (if
    called, it would just return "self"), and use distinct classes for the
    distinct types (so that you can use type() or isinstance()), i.e.:

    def print_json(f, node):
    if isinstance(node, json.Null):
    f.write("null")
    elif isinstance(node, json.Bool):
    f.write("true" if node.data() else "false")
    elif isinstance(node, json.Number):
    f.write(node.data())
    elif isinstance(node, json.String):
    f.write('"' + node.data() + '"')
    elif isinstance(node, json.Array):
    f.write('[')
    for i, v in enumerate(node):
    if i > 0: f.write(',')
    print_json(f, v)
    f.write(']')
    elif isinstance(node, json.Object):
    f.write('{')
    for i, (k, v) in enumerate(node):
    if i > 0: f.write(',')
    print_json(f, k)
    f.write(": ")
    print_json(f, v)
    f.write('}')

    root = json.parse(sys.stdin)
    print_json(sys.stdout, root)

    If you think that some of the individual strings may be large, you could
    make the String class implement the iterator interface (or even the file
    interface with .read() etc), to allow the data to be read incrementally:

    elif isinstance(node, json.String):
    f.write('"')
    for s in node:
    f.write(s)
    f.write('"')

    The main point is to allow the node object to be returned as soon as the
    type is known, without having to wait until the data has been fully
    parsed, and to require a separate step (for which there may be various
    choices) in order to actually retrieve the data.
    Nobody, Dec 5, 2009
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bernd Oninger
    Replies:
    0
    Views:
    756
    Bernd Oninger
    Jun 9, 2004
  2. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    810
    Paul King
    Oct 5, 2004
  3. Bernd Oninger
    Replies:
    0
    Views:
    810
    Bernd Oninger
    Jun 9, 2004
  4. Joel Hedlund
    Replies:
    2
    Views:
    506
    Joel Hedlund
    Nov 11, 2006
  5. ezmiller
    Replies:
    0
    Views:
    96
    ezmiller
    Dec 29, 2005
Loading...

Share This Page