Hpricot - best way to parse based on comments

Discussion in 'Ruby' started by Jerome ---, Nov 20, 2006.

  1. Jerome ---

    Jerome --- Guest

    I am trying to parse some files that contain comments like this:

    <html>
    <body>

    <!-- BEGIN ad_content -->

    images, text, etc...

    <!-- END ad_content -->

    Interesting text of site here.

    </body>
    </html>


    I am wondering how to go about extracting the data within the comments
    block using Hpricot. I am not aware of a way to refer to commented HTML
    through CSS or XPath selectors.

    Thanks for any ideas!

    - Jerome

    --
    Posted via http://www.ruby-forum.com/.
    Jerome ---, Nov 20, 2006
    #1
    1. Advertising

  2. On 11/20/06, Jerome --- <> wrote:
    > I am trying to parse some files that contain comments like this:
    > ...
    > I am not aware of a way to refer to commented HTML
    > through CSS or XPath selectors.


    The XPath comment() selector will select all comments:

    For example (xpath after -m flag):
    keith@devel ~ $ xml sel -t -m '//comment()' -v '.' -n simple.xml
    one comment
    two comment

    keith@devel ~ $ cat simple.xml
    <simple>
    <!-- one comment -->
    <foo/>
    <!-- two comment -->
    <bar/>
    </simple>


    HTH,
    Keith
    Keith Fahlgren, Nov 20, 2006
    #2
    1. Advertising

  3. Jerome ---

    Ken Bloom Guest

    On Tue, 21 Nov 2006 07:52:12 +0900, Jerome --- wrote:

    > I am trying to parse some files that contain comments like this:
    >
    > <html>
    > <body>
    >
    > <!-- BEGIN ad_content -->
    >
    > images, text, etc...
    >
    > <!-- END ad_content -->
    >
    > Interesting text of site here.
    >
    > </body>
    > </html>
    >
    >
    > I am wondering how to go about extracting the data within the comments
    > block using Hpricot. I am not aware of a way to refer to commented HTML
    > through CSS or XPath selectors.
    >
    > Thanks for any ideas!
    >
    > - Jerome
    >


    Why not gsub out the unwanted sections before parsing with hpricot, or
    if the data you want is nested between comments, use a regexp to narrow
    down the document to only the text between the comments before parsing
    with hpricot?

    --Ken Bloom

    --
    Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
    Department of Computer Science. Illinois Institute of Technology.
    http://www.iit.edu/~kbloom1/
    Ken Bloom, Nov 21, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ehud Rosenberg
    Replies:
    2
    Views:
    141
    Ehud Rosenberg
    Nov 14, 2007
  2. K. R.

    hpricot - parse html

    K. R., Jan 2, 2008, in forum: Ruby
    Replies:
    3
    Views:
    110
    Daniel Brumbaugh Keeney
    Jan 3, 2008
  3. Adam Dullenty

    using HPricot to parse a fiddly table

    Adam Dullenty, Jan 6, 2008, in forum: Ruby
    Replies:
    2
    Views:
    117
    Adam Dullenty
    Jan 7, 2008
  4. Christiaan Venter
    Replies:
    1
    Views:
    144
    7stud --
    May 22, 2009
  5. No Uu
    Replies:
    1
    Views:
    104
    Rob Biedenharn
    May 25, 2009
Loading...

Share This Page