Hpricot elem index/position

Discussion in 'Ruby' started by henryturnerlists@googlemail.com, Oct 6, 2008.

  1. Guest

    Hey,

    Trying to find the String index of an Hpricot::Elem within its doc.
    For example..

    doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
    elem = doc.search("a")[1]
    elem.start #=> 10 ( the first '<' of the second a tag.)

    and eventually the following would be good..

    elem.length #=> 12
    elem.end #=> 21

    Any thoughts appreciated!
    Henners
    , Oct 6, 2008
    #1
    1. Advertising

  2. Mark Thomas Guest

    On Oct 6, 10:19 am, ""
    <> wrote:
    > Hey,
    >
    > Trying to find the String index of an Hpricot::Elem within its doc.
    > For example..
    >
    > doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
    > elem = doc.search("a")[1]
    > elem.start #=> 10 ( the first '<' of the second a tag.)
    >
    > and eventually the following would be good..
    >
    > elem.length #=> 12
    > elem.end #=> 21
    >
    > Any thoughts appreciated!
    > Henners


    My first thought is: Why do you want that information? Character
    position is meaningless in an XML and HTML DOM. Whitespace can change
    character positions without affecting the DOM at all.

    -- Mark.
    Mark Thomas, Oct 6, 2008
    #2
    1. Advertising

  3. Guest

    Hi Mark,

    I'm writing a broken link reporting type tool. When I find a dodgy tag
    I'd like to be able to relay the character position and or line number
    to the user. Useful for debugging.

    Thanks -h

    On Oct 6, 9:13=A0pm, Mark Thomas <> wrote:
    > On Oct 6, 10:19=A0am, ""
    >
    >
    >
    > <> wrote:
    > > Hey,

    >
    > > Trying to find the String index of an Hpricot::Elem within its doc.
    > > For example..

    >
    > > doc =3D Hpricot("<a>bob</a><a>james</a><a>dan</a>")
    > > elem =3D doc.search("a")[1]
    > > elem.start #=3D> 10 ( the first '<' of the second a tag.)

    >
    > > and eventually the following would be good..

    >
    > > elem.length #=3D> 12
    > > elem.end #=3D> 21

    >
    > > Any thoughts appreciated!
    > > Henners

    >
    > My first thought is: Why do you want that information? Character
    > position is meaningless in an XML and HTML DOM. Whitespace can change
    > character positions without affecting the DOM at all.
    >
    > -- Mark.
    , Oct 7, 2008
    #3
  4. Mark Thomas Guest

    On Oct 7, 3:58 am, ""
    <> wrote:
    > Hi Mark,
    >
    > I'm writing a broken link reporting type tool. When I find a dodgy tag
    > I'd like to be able to relay the character position and or line number
    > to the user. Useful for debugging.


    So, are you really interested in broken *links* (as in a GET does not
    return a 200 result code) or broken HTML? I have done the former via
    AJAX (jQuery sends links to a backend rails action, and if it is
    broken changes the class of the link to display a red background). The
    latter may be able to be done with libxml, which reports the character
    position of broken input.

    -- Mark.
    Mark Thomas, Oct 7, 2008
    #4
  5. Guest

    Well, I suppose there are incorrectly formatted links too... I was
    talking about correctly formatted links that point to a 400+ status
    code resource. Something libxml would not pick up since I guess you're
    talking about its syntax checking bit.

    Since the entire document is accessible from the Hpricot::Elem it
    seems plausible to count the characters up to and after the element. A
    15min look at the source didn't reveal anything obvious.. Have a nasty
    feeling that this type of thing would have to be done in the compiled
    C section of it..

    On Oct 7, 2:53=A0pm, Mark Thomas <> wrote:
    > On Oct 7, 3:58=A0am, ""
    >
    > <> wrote:
    > > Hi Mark,

    >
    > > I'm writing a broken link reporting type tool. When I find a dodgy tag
    > > I'd like to be able to relay the character position and or line number
    > > to the user. Useful for debugging.

    >
    > So, are you really interested in broken *links* (as in a GET does not
    > return a 200 result code) or broken HTML? I have done the former via
    > AJAX (jQuery sends links to a backend rails action, and if it is
    > broken changes the class of the link to display a red background). The
    > latter may be able to be done with libxml, which reports the character
    > position of broken input.
    >
    > -- Mark.
    , Oct 7, 2008
    #5
  6. Mark Thomas Guest

    On Oct 7, 10:28 am, ""
    <> wrote:
    > Well, I suppose there are incorrectly formatted links too... I was
    > talking about correctly formatted links that point to a 400+ status
    > code resource. Something libxml would not pick up since I guess you're

    talking about its syntax checking bit.

    Well, libxml stores the line number of every element. So you can
    extract all links, check them, and print out element.line_num for each
    one that fails the check.

    Here's some starter code:

    #----------------------------------------------

    require 'rubygems'
    require 'xml'

    XML::parser.default_line_numbers = true

    html = <<END_HTML
    <html>
    <head><title>test</title></head>
    <body>
    Here is a <a href="http://brok.en">broken link.</a>
    </body>
    </html>
    END_HTML

    parser = XML::parser.string html
    doc = parser.parse

    def broken?(link)
    true
    end

    doc.find("//a[@href]").each do |link|
    if broken?(link)
    puts "Broken link to #{link['href']} on line #{link.line_num}"
    end
    end
    Mark Thomas, Oct 7, 2008
    #6
  7. Mark Thomas Guest

    On Oct 7, 1:36 pm, I wrote:
    > Well, libxml stores the line number of every element. So you can
    > extract all links, check them, and print out element.line_num for each
    > one that fails the check.


    Oops, my example mistakenly used the XML parser, so replace that with
    XML::HTMLparser since you are parsing HTML.

    -- Mark.
    Mark Thomas, Oct 8, 2008
    #7
  8. Guest

    Thanks for the hint towards to libxml-ruby! I didn't even know it
    existed. Can't see anything for character position but very happy
    indeed. Will have a go at implementing it myself when poss..

    cheers
    -h

    On Oct 8, 3:57=A0am, Mark Thomas <> wrote:
    > On Oct 7, 1:36=A0pm, I wrote:
    >
    > > Well, libxml stores the line number of every element. So you can
    > > extract all links, check them, and print out element.line_num for each
    > > one that fails the check.

    >
    > Oops, my example mistakenly used the XML parser, so replace that with
    > XML::HTMLparser since you are parsing HTML.
    >
    > -- Mark.
    , Oct 8, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. nobody
    Replies:
    1
    Views:
    787
    Martin Honnen
    Jul 18, 2004
  2. insert an elem into a link list

    , Apr 4, 2006, in forum: C Programming
    Replies:
    4
    Views:
    309
    CBFalconer
    Apr 4, 2006
  3. Junkone
    Replies:
    1
    Views:
    101
    Mark Thomas
    Aug 12, 2008
  4. Christiaan Venter
    Replies:
    1
    Views:
    144
    7stud --
    May 22, 2009
  5. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    277
    Tomasz Chmielewski
    Mar 4, 2008
Loading...

Share This Page