ElementTree should parse string and file in teh same way

Discussion in 'Python' started by Peter Pei, Dec 31, 2007.

  1. Peter Pei

    Peter Pei Guest

    One bad design about elementtree is that it has different ways parsing a
    string and a file, even worse they return different objects:
    1) When you parse a file, you can simply call parse, which returns a
    elementtree, on which you can then apply xpath;
    2) To parse a string (xml section), you can call XML or fromstring, but both
    return element instead of elementtree. This alone is bad. To make it worse,
    you have to create an elementtree from this element before you can utilize
    xpath.
    Peter Pei, Dec 31, 2007
    #1
    1. Advertising

  2. Peter Pei

    Paddy Guest

    On Dec 31, 3:42 am, "Peter Pei" <> wrote:
    > One bad design about elementtree is that it has different ways parsing a
    > string and a file, even worse they return different objects:
    > 1) When you parse a file, you can simply call parse, which returns a
    > elementtree, on which you can then apply xpath;
    > 2) To parse a string (xml section), you can call XML or fromstring, but both
    > return element instead of elementtree. This alone is bad. To make it worse,
    > you have to create an elementtree from this element before you can utilize
    > xpath.


    I haven't tried this, but you should be able to wrap your text string
    so that it looks like a file using the stringio module and pass that
    to elementtree:

    http://blog.doughellmann.com/2007/04/pymotw-stringio-and-cstringio.html

    - Paddy.
    Paddy, Dec 31, 2007
    #2
    1. Advertising

  3. Re: ElementTree should parse string and file in the same way

    Peter Pei wrote:
    > One bad design about elementtree is that it has different ways parsing a
    > string and a file, even worse they return different objects:
    > 1) When you parse a file, you can simply call parse, which returns a
    > elementtree, on which you can then apply xpath;


    ElementTree doesn't support XPath. In case you mean the simpler ElementPath
    language that is supported by the find*() methods, I do not see a reason why
    you can't use it on elements.


    > 2) To parse a string (xml section), you can call XML or fromstring, but
    > both return element instead of elementtree. This alone is bad. To make
    > it worse, you have to create an elementtree from this element before you
    > can utilize xpath.


    a) how hard is it to write a wrapper function around fromstring() that wraps
    the result Element in an ElementTree object and returns it?

    b) the same as above applies: I can't see the problem you are talking about.

    Stefan
    Stefan Behnel, Dec 31, 2007
    #3
  4. Peter Pei

    Peter Pei Guest

    Re: ElementTree should parse string and file in the same way

    You are talking shit. It is never about whether it is hard to write a
    wrapper. It is about bad design. I should be able to parse a string and a
    file in exactly same way, and that should be provided as part of the
    package.

    Looks like you are just a code monkey not a designer, so I forgive you. You
    didn't understand the issue I described? That's your issue. You are not at
    the same level to talk to me, so chill.
    ===================================================================


    "Stefan Behnel" <> wrote in message
    news:...
    > Peter Pei wrote:
    >> One bad design about elementtree is that it has different ways parsing a
    >> string and a file, even worse they return different objects:
    >> 1) When you parse a file, you can simply call parse, which returns a
    >> elementtree, on which you can then apply xpath;

    >
    > ElementTree doesn't support XPath. In case you mean the simpler
    > ElementPath
    > language that is supported by the find*() methods, I do not see a reason
    > why
    > you can't use it on elements.
    >
    >
    >> 2) To parse a string (xml section), you can call XML or fromstring, but
    >> both return element instead of elementtree. This alone is bad. To make
    >> it worse, you have to create an elementtree from this element before you
    >> can utilize xpath.

    >
    > a) how hard is it to write a wrapper function around fromstring() that
    > wraps
    > the result Element in an ElementTree object and returns it?
    >
    > b) the same as above applies: I can't see the problem you are talking
    > about.
    >
    > Stefan
    Peter Pei, Jan 1, 2008
    #4
  5. Peter Pei

    Peter Pei Guest

    Re: ElementTree should parse string and file in the same way

    To be preise, XPath is not fully supported. Don't be a smart asshole.
    =====================================================================
    "Stefan Behnel" <> wrote in message
    news:...
    > Peter Pei wrote:
    >> One bad design about elementtree is that it has different ways parsing a
    >> string and a file, even worse they return different objects:
    >> 1) When you parse a file, you can simply call parse, which returns a
    >> elementtree, on which you can then apply xpath;

    >
    > ElementTree doesn't support XPath. In case you mean the simpler
    > ElementPath
    > language that is supported by the find*() methods, I do not see a reason
    > why
    > you can't use it on elements.
    >
    >
    >> 2) To parse a string (xml section), you can call XML or fromstring, but
    >> both return element instead of elementtree. This alone is bad. To make
    >> it worse, you have to create an elementtree from this element before you
    >> can utilize xpath.

    >
    > a) how hard is it to write a wrapper function around fromstring() that
    > wraps
    > the result Element in an ElementTree object and returns it?
    >
    > b) the same as above applies: I can't see the problem you are talking
    > about.
    >
    > Stefan
    Peter Pei, Jan 1, 2008
    #5
  6. Re: ElementTree should parse string and file in the same way

    On Tue, 01 Jan 2008 01:53:47 +0000, Peter Pei wrote:

    > You are talking shit. It is never about whether it is hard to write a
    > wrapper. It is about bad design. I should be able to parse a string and
    > a file in exactly same way, and that should be provided as part of the
    > package.


    Oh my, somebody decided to start the new year with all guns blazing.

    Before abusing anyone else, have you considered asking *why* ElementTree
    does not treat files and strings the same way? I believe the writer of
    ElementTree, Fredrik Lundh, frequents this newsgroup.

    It may be that Fredrik doesn't agree with you that you should be able to
    parse a string and a file the same way, in which case there's nothing you
    can do but work around it. On the other hand, perhaps he just hasn't had
    a chance to implement that functionality, and would welcome a patch.

    Fredrik, if you're reading this, I'm curious what your reason is. I don't
    have an opinion on whether you should or shouldn't treat files and
    strings the same way. Over to you...



    --
    Steven
    Steven D'Aprano, Jan 1, 2008
    #6
  7. Re: ElementTree should parse string and file in the same way

    Peter Pei wrote:
    > To be preise

    [...]

    Preise the lord, not me. :)

    Happy New Year!

    Stefan
    Stefan Behnel, Jan 1, 2008
    #7
  8. Re: ElementTree should parse string and file in the same way

    Steven D'Aprano schrieb:
    > On Tue, 01 Jan 2008 01:53:47 +0000, Peter Pei wrote:
    >
    >> You are talking shit. It is never about whether it is hard to write a
    >> wrapper. It is about bad design. I should be able to parse a string and
    >> a file in exactly same way, and that should be provided as part of the
    >> package.

    >
    > Oh my, somebody decided to start the new year with all guns blazing.
    >
    > Before abusing anyone else, have you considered asking *why* ElementTree
    > does not treat files and strings the same way? I believe the writer of
    > ElementTree, Fredrik Lundh, frequents this newsgroup.
    >
    > It may be that Fredrik doesn't agree with you that you should be able to
    > parse a string and a file the same way, in which case there's nothing you
    > can do but work around it. On the other hand, perhaps he just hasn't had
    > a chance to implement that functionality, and would welcome a patch.
    >
    > Fredrik, if you're reading this, I'm curious what your reason is. I don't
    > have an opinion on whether you should or shouldn't treat files and
    > strings the same way. Over to you...


    I think the decision is pretty clear to everybody who is a code-monkey
    and not a Peter-Pei-School-of-Excellent-And-Decent-Designers-attendant:

    when building a XML-document, you start from a Element or Elementtree
    and often do things like


    root_element = <some_element>
    for child in some_objects:
    root_element.append(XML("""<child attribute="%i"/>""" %
    child.attribute))

    Which is such a common usage-pattern that it would be extremely annoying
    to get a document from XML/fromstring and then needing to extract the
    root-element from it.

    And codemonkeys know that in python

    doc = et.parse(StringIO(string))

    is just one import away, which people who attend to
    Peter-Pei-School-of-Excellent-And-Decent-Designers may have not learned
    yet - because they are busy praising themselves and coating each other
    in edible substances before stepping out into the world and having all
    code-monkeys lick off their greatness in awe.

    http://www.youtube.com/watch?v=FM7Rpf1x7RU

    Diez
    Diez B. Roggisch, Jan 1, 2008
    #8
  9. Re: ElementTree should parse string and file in the same way

    On Tue, 01 Jan 2008 13:36:57 +0100, Diez B. Roggisch wrote:

    > And codemonkeys know that in python
    >
    > doc = et.parse(StringIO(string))
    >
    > is just one import away


    Yes, but to play devil's advocate for a moment,

    doc = et.parse(string_or_file)

    would be even simpler.

    Is there any reason why it should not behave that way? It could be as
    simple as adding a couple of lines to the parse method:

    if isinstance(arg, str):
    import StringIO
    arg = StringIO(arg)

    I'm not saying it *should*, I'm asking if there's a reason it *shouldn't*.

    "I find it aesthetically distasteful" would be a perfectly acceptable
    answer -- not one I would agree with, but I could accept it.



    --
    Steven
    Steven D'Aprano, Jan 1, 2008
    #9
  10. Re: ElementTree should parse string and file in the same way

    Steven D'Aprano wrote:
    > On Tue, 01 Jan 2008 13:36:57 +0100, Diez B. Roggisch wrote:
    >
    >> And codemonkeys know that in python
    >>
    >> doc = et.parse(StringIO(string))
    >>
    >> is just one import away

    >
    > Yes, but to play devil's advocate for a moment,
    >
    > doc = et.parse(string_or_file)
    >
    > would be even simpler.


    I assume the problem with this is that it would be ambiguous. You can
    already use either a string or a file with ``et.parse``. A string is
    interpreted as a file name, while a file object is used directly.

    How would you differentiate between a string that's supposed to be a
    file name, and a string that's supposed to be XML?

    Steve
    Steven Bethard, Jan 1, 2008
    #10
  11. Re: ElementTree should parse string and file in the same way

    On Tue, 01 Jan 2008 12:59:44 -0700, Steven Bethard wrote:

    > Steven D'Aprano wrote:
    >> On Tue, 01 Jan 2008 13:36:57 +0100, Diez B. Roggisch wrote:
    >>
    >>> And codemonkeys know that in python
    >>>
    >>> doc = et.parse(StringIO(string))
    >>>
    >>> is just one import away

    >>
    >> Yes, but to play devil's advocate for a moment,
    >>
    >> doc = et.parse(string_or_file)
    >>
    >> would be even simpler.

    >
    > I assume the problem with this is that it would be ambiguous. You can
    > already use either a string or a file with ``et.parse``. A string is
    > interpreted as a file name, while a file object is used directly.


    Ah! I wasn't aware that parse() operated on either an open file object or
    a string file name. That's an excellent reason for not treating strings
    the same as files in ElementTree.



    > How would you differentiate between a string that's supposed to be a
    > file name, and a string that's supposed to be XML?


    Well, naturally I wouldn't.

    I *could*, if I assumed that a multi-line string that started with "<"
    was XML, and a single-line string with the path separator character or
    ending in ".xml" was a file name, but that sort of Do What I Mean coding
    is foolish in a library function that can't afford to occasionally Do The
    Wrong Thing.


    --
    Steven
    Steven D'Aprano, Jan 1, 2008
    #11
  12. Peter Pei

    Peter Pei Guest

    Re: ElementTree should parse string and file in the same way

    To answer something posted deep down... It is fine with me if there are two
    functions - one to parse a file or file handler and one to parse a string,
    yet the returned objects should be consistent.
    Peter Pei, Jan 2, 2008
    #12
  13. Re: ElementTree should parse string and file in the same way

    Steven D'Aprano wrote:

    > Fredrik, if you're reading this, I'm curious what your reason is. I don't
    > have an opinion on whether you should or shouldn't treat files and
    > strings the same way. Over to you...


    as Diez shows, it's all about use cases.

    and as anyone who's used my libraries or read my code knows, I'm a big
    fan of minimalistic but highly composable object API:s and liberal use
    of short helper functions to wire them up to fit the task at hand.

    kitchen sink API design is a really bad idea, for more reasons than I
    can fit in this small editor window.

    </F>
    Fredrik Lundh, Jan 2, 2008
    #13
  14. Peter Pei

    Chris Mellon Guest

    Re: ElementTree should parse string and file in the same way

    On Jan 2, 2008 8:56 AM, Fredrik Lundh <> wrote:
    > Steven D'Aprano wrote:
    >
    > > Fredrik, if you're reading this, I'm curious what your reason is. I don't
    > > have an opinion on whether you should or shouldn't treat files and
    > > strings the same way. Over to you...

    >
    > as Diez shows, it's all about use cases.
    >
    > and as anyone who's used my libraries or read my code knows, I'm a big
    > fan of minimalistic but highly composable object API:s and liberal use
    > of short helper functions to wire them up to fit the task at hand.
    >
    > kitchen sink API design is a really bad idea, for more reasons than I
    > can fit in this small editor window.
    >


    On that note, I really don't like APIs that take either a file name or
    a file object - I can open my own files, thanks. File objects are
    fantastic abstractions and open(fname) is even shorter than
    StringIO(somedata).

    My take on the API decision in question was always that a file is
    inherently an XML *document*, while a string is inherently an XML
    *fragment*.
    Chris Mellon, Jan 2, 2008
    #14
  15. Re: ElementTree should parse string and file in the same way

    Hi,

    Chris Mellon wrote:
    > On that note, I really don't like APIs that take either a file name or
    > a file object - I can open my own files, thanks.


    .... and HTTP URLs, and FTP URLs. In lxml, there is a performance difference
    between passing an open file (which is read in Python space using the read()
    method) and passing a file name or URL, which is passed on to libxml2 (and
    thus doesn't require the GIL at parse time). That's only one reason why I like
    APIs that allow me to pass anything that points to a file - be it an open file
    object, a local file path or a URL - and they just Do The Right Thing with it.

    I find that totally pythonic.


    > open(fname) is even shorter than StringIO(somedata).


    It doesn't serve the same purpose, though.


    > My take on the API decision in question was always that a file is
    > inherently an XML *document*, while a string is inherently an XML
    > *fragment*.


    Not inherently, no. I know some people who do web processing with an XML
    document coming in as a string (from an HTTP request) and a result XML
    document going out as a string. I don't think that's an uncommon use case.

    Stefan
    Stefan Behnel, Jan 3, 2008
    #15
  16. Re: ElementTree should parse string and file in the same way

    Stefan Behnel wrote:

    >> My take on the API decision in question was always that a file is
    >> inherently an XML *document*, while a string is inherently an XML
    >> *fragment*.

    >
    > Not inherently, no. I know some people who do web processing with an XML
    > document coming in as a string (from an HTTP request) /.../


    in which case you probably want to stream the raw XML through the parser
    *as it arrives*, to reduce latency (to do that, either parse from a
    file-like object, or feed data directly to a parser instance, via the
    consumer protocol).

    also, putting large documents in a *single* Python string can be quite
    inefficient. it's often more efficient to use lists of string fragments.

    </F>
    Fredrik Lundh, Jan 3, 2008
    #16
  17. Re: ElementTree should parse string and file in the same way

    Fredrik Lundh wrote:
    > Stefan Behnel wrote:
    >
    >>> My take on the API decision in question was always that a file is
    >>> inherently an XML *document*, while a string is inherently an XML
    >>> *fragment*.

    >>
    >> Not inherently, no. I know some people who do web processing with an XML
    >> document coming in as a string (from an HTTP request) /.../

    >
    > in which case you probably want to stream the raw XML through the parser
    > *as it arrives*, to reduce latency (to do that, either parse from a
    > file-like object, or feed data directly to a parser instance, via the
    > consumer protocol).


    It depends on the abstraction the web framework provides. If it allows you to
    do that, especially in an event driven way, that's obviously the most
    efficient implementation (and both ElementTree and lxml support this use
    pattern just fine). However, some frameworks just pass the request content
    (such as a POSTed document) in a dictionary or as callback parameters, in
    which case there's little room for optimisation.


    > also, putting large documents in a *single* Python string can be quite
    > inefficient. it's often more efficient to use lists of string fragments.


    That's a pretty general statement. Do you mean in terms of reading from that
    string (which at least in lxml is a straight forward extraction of a char*/len
    pair which is passed into libxml2), constructing that string (possibly from
    partial strings, which temporarily *is* expensive) or just keeping the string
    in memory?

    At least lxml doesn't benefit from iterating over a list of strings and
    passing it to libxml2 step-by-step, compared to reading from a straight
    in-memory string. Here are some numbers:

    $$ cat listtest.py
    from lxml import etree

    # a list of strings is more memory expensive than a straight string
    doc_list = ["<root>"] + ["<a>test</a>"] * 2000 + ["</root>"]
    # document construction temporarily ~doubles memory size
    doc = "".join(doc_list)

    def readlist():
    tree = etree.fromstringlist(doc_list)

    def readdoc():
    tree = etree.fromstring(doc)

    $$ python -m timeit -s 'from listtest import readlist,readdoc' 'readdoc()'
    1000 loops, best of 3: 1.74 msec per loop

    $$ python -m timeit -s 'from listtest import readlist,readdoc' 'readlist()'
    100 loops, best of 3: 2.46 msec per loop

    The performance difference stays somewhere around 20-30% even for larger
    documents. So, as expected, there's a trade-off between temporary memory size,
    long-term memory size and parser performance here.

    Stefan
    Stefan Behnel, Jan 3, 2008
    #17
  18. Re: ElementTree should parse string and file in the same way

    Stefan Behnel wrote:

    >> also, putting large documents in a *single* Python string can be quite
    >> inefficient. it's often more efficient to use lists of string fragments.

    >
    > That's a pretty general statement. Do you mean in terms of reading from that
    > string (which at least in lxml is a straight forward extraction of a char*/len
    > pair which is passed into libxml2), constructing that string (possibly from
    > partial strings, which temporarily *is* expensive) or just keeping the string
    > in memory?


    overall I/O throughput. it's of course construction and internal
    storage that are the main issues here; every extra copy has a cost, and
    if you're working with multi-megabyte resources, the extra expenses
    quickly become noticeable.

    </F>
    Fredrik Lundh, Jan 3, 2008
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mit
    Replies:
    0
    Views:
    404
  2. Dave
    Replies:
    4
    Views:
    380
    Old Wolf
    Apr 22, 2004
  3. Replies:
    2
    Views:
    800
    Amit Khemka
    Jul 6, 2006
  4. Kee Nethery
    Replies:
    12
    Views:
    2,043
    Stefan Behnel
    Jun 27, 2009
  5. Replies:
    1
    Views:
    261
    Victor Bazarov
    Dec 21, 2012
Loading...

Share This Page