parsing xml (xmpp) with ruby

Discussion in 'Ruby' started by Eric Will, Sep 27, 2008.

  1. Eric Will

    Eric Will Guest

    Hello World,

    I am writing an XMPP (Jabber) server in Ruby. XMPP uses XML for its
    protocol. This means I have to do a good deal of XML parsing, in Ruby.

    Right now I am using REXML to parse the individual stanzas as they
    come in. However, in order to do this without REXML complaining of
    "multiple root elements" (that is, XMPP is streaming XML over a TCP
    socket, so I only get the root element once) I have to wrap every
    incoming chunk of XMPP with my own <root/> tag, and then ignore that
    after REXML parses it. I am currently unhappy with this approach.

    Another option is to use REXML's stream parsing. I don't really like
    this idea. It seems the only benefit of using SAX(ish) parsing is when
    you're dealing with huge documents that you don't want to load into
    memory. This isn't the case. I get maybe 5-10 objects per parse. Most
    of the people I've talked to in XMPP insist on using SAX (or something
    like it, such as REXML's stream parsing). The other reason I don't
    like REXML's stream parsing (or libxml's SAX) is because I have to
    provide a class instance for it to use for the event-parsing, and this
    class has to be a giant state machine, which seems wrong to me. I
    don't want to have to write a complicated class to, in effect, parse
    the XML myself when the XML parser should be doing this for me.

    The other options include using hpricot to do the incoming parsing
    (since it's C, and way faster than REXML) and continue to use REXML
    for generating the outgoing XML (I can't seem to figure out how to do
    this in hpricot, if it's even possible). Although, XMPP requires XML
    well-formedness, and hpricot does not do validation (to the best of my
    knowledge). I also like xml-simple, but it uses REXML underneath it
    all, so I'm left with the same issues.

    My real question is, is there a GOOD REASON to switch for the scheme I
    currently use? A number of people seem to think it's the "Wrong Thing"
    to do, but I'm not quite sure what the "Right Thing" is. I don't think
    it's SAX.

    Thanks for any feedback.

    -- rakaur
    Eric Will, Sep 27, 2008
    #1
    1. Advertising

  2. Eric Will

    Dejan Dimic Guest

    On Sep 27, 9:27 pm, Eric Will <> wrote:
    > Hello World,
    >
    > I am writing an XMPP (Jabber) server in Ruby. XMPP uses XML for its
    > protocol. This means I have to do a good deal of XML parsing, in Ruby.
    >
    > Right now I am using REXML to parse the individual stanzas as they
    > come in. However, in order to do this without REXML complaining of
    > "multiple root elements" (that is, XMPP is streaming XML over a TCP
    > socket, so I only get the root element once) I have to wrap every
    > incoming chunk of XMPP with my own <root/> tag, and then ignore that
    > after REXML parses it. I am currently unhappy with this approach.
    >
    > Another option is to use REXML's stream parsing. I don't really like
    > this idea. It seems the only benefit of using SAX(ish) parsing is when
    > you're dealing with huge documents that you don't want to load into
    > memory. This isn't the case. I get maybe 5-10 objects per parse. Most
    > of the people I've talked to in XMPP insist on using SAX (or something
    > like it, such as REXML's stream parsing). The other reason I don't
    > like REXML's stream parsing (or libxml's SAX) is because I have to
    > provide a class instance for it to use for the event-parsing, and this
    > class has to be a giant state machine, which seems wrong to me. I
    > don't want to have to write a complicated class to, in effect, parse
    > the XML myself when the XML parser should be doing this for me.
    >
    > The other options include using hpricot to do the incoming parsing
    > (since it's C, and way faster than REXML) and continue to use REXML
    > for generating the outgoing XML (I can't seem to figure out how to do
    > this in hpricot, if it's even possible). Although, XMPP requires XML
    > well-formedness, and hpricot does not do validation (to the best of my
    > knowledge). I also like xml-simple, but it uses REXML underneath it
    > all, so I'm left with the same issues.
    >
    > My real question is, is there a GOOD REASON to switch for the scheme I
    > currently use? A number of people seem to think it's the "Wrong Thing"
    > to do, but I'm not quite sure what the "Right Thing" is. I don't think
    > it's SAX.
    >
    > Thanks for any feedback.
    >
    > -- rakaur


    Every problem can have multiple solutions.

    Personally I will go for the SAX XML processing of the incoming XML
    stream.
    It can not be so hard to build the event driven solution and the state
    machine should not be more complicated then the DOM node processing.
    The benefit you can get is to start building the response while you
    processing the XML input.
    You can't get much faster then that.

    If you think it's not your cup of tee thats totally OK.

    If you have to parse chinks of XML data then hpricot is my favorite
    choice.
    While analyzing the DOM for nods of interest, preferably with XPath
    you should build the response.
    You can do it with hpricot to.

    In a word, do it as you see fit, and then try to make it better. :)
    Dejan Dimic, Sep 27, 2008
    #2
    1. Advertising

  3. On 27.09.2008 21:27, Eric Will wrote:
    > Another option is to use REXML's stream parsing. I don't really like
    > this idea. It seems the only benefit of using SAX(ish) parsing is when
    > you're dealing with huge documents that you don't want to load into
    > memory. This isn't the case. I get maybe 5-10 objects per parse. Most
    > of the people I've talked to in XMPP insist on using SAX (or something
    > like it, such as REXML's stream parsing). The other reason I don't
    > like REXML's stream parsing (or libxml's SAX) is because I have to
    > provide a class instance for it to use for the event-parsing, and this
    > class has to be a giant state machine, which seems wrong to me. I
    > don't want to have to write a complicated class to, in effect, parse
    > the XML myself when the XML parser should be doing this for me.


    Well, this is not true. You can have multiple classes cooperating in
    doing XML stream parsing. You need one instance for receiving the
    events but that can delegate to any number of other instances. A scheme
    I usually use is to have a class per element type and the front end
    instance keeps a stack of those.

    Typically XML is parsed to instantiate classes of a particular object
    model that is built do implement the business logic (in your case
    message exchange). It is a waste of resources to create an XML DOM and
    then traverse it in order to transform it into other objects. Also, not
    all input data is needed in every case. That's why stream parsing has
    serious advantages over DOM parsing.

    OTOH, if you can do all your processing efficiently on the DOM then
    maybe that is a better way. In your situation I would still choose the
    stream approach because it also better fits the way the data is provided.

    My 0.02 EUR

    robert
    Robert Klemme, Sep 27, 2008
    #3
  4. Eric Will

    Eric Will Guest

    On Sat, Sep 27, 2008 at 5:39 PM, Robert Klemme
    <> wrote:

    > Well, this is not true. You can have multiple classes cooperating in doing
    > XML stream parsing. You need one instance for receiving the events but that
    > can delegate to any number of other instances. A scheme I usually use is to
    > have a class per element type and the front end instance keeps a stack of
    > those.


    This is how I'd implement it. I just don't wanna.

    > Typically XML is parsed to instantiate classes of a particular object model
    > that is built do implement the business logic (in your case message
    > exchange). It is a waste of resources to create an XML DOM and then
    > traverse it in order to transform it into other objects. Also, not all
    > input data is needed in every case. That's why stream parsing has serious
    > advantages over DOM parsing.


    The thing is, I'm only parsing out like 5-10 objects at a time. It's
    nothing huge
    to transverse, but I'm thinking it'll be a hard performance hit to keep on like
    that when I try to scale.

    > robert


    -- rakaur
    Eric Will, Sep 27, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James Mills

    XMPP xmpppy - User Authorization

    James Mills, Nov 5, 2008, in forum: Python
    Replies:
    0
    Views:
    378
    James Mills
    Nov 5, 2008
  2. James Mills

    Re: XMPP xmpppy - User Authorization

    James Mills, Nov 5, 2008, in forum: Python
    Replies:
    3
    Views:
    765
    Henson
    Dec 15, 2008
  3. Gabriel Rossetti

    Blocking XMPP API?

    Gabriel Rossetti, Jul 9, 2009, in forum: Python
    Replies:
    2
    Views:
    404
    Gabriel Rossetti
    Jul 13, 2009
  4. Astan Chee

    webcam in gtalk/xmpp

    Astan Chee, Sep 15, 2010, in forum: Python
    Replies:
    0
    Views:
    312
    Astan Chee
    Sep 15, 2010
  5. Ivan Shmakov

    XML in XMPP

    Ivan Shmakov, Jul 6, 2012, in forum: XML
    Replies:
    8
    Views:
    987
    Joe Kesselman
    Jul 12, 2012
Loading...

Share This Page