hrpicot - cant extract what i want from page

Discussion in 'Ruby' started by Adam Akhtar, Mar 28, 2008.

  1. Adam Akhtar

    Adam Akhtar Guest

    Hi im starting to use hrpicot and im having problems extracting
    descriptions of various concert events from a page. Here is a sample of
    the html


    <p>
    <a name="concerts"/>
    <span class="heading">Concerts</span>
    <br/>
    <span class="subheading">POPULAR</span>
    <br/>
    <br/>
    <span class="textbold">Middle Field! Vol.4</span >
    <br/>
    Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
    28, 7pm, ¥2,500 (adv)/ ¥3,000 (door). Shibuya O-Nest. Tel: 03-3498-9999.
    <br/>
    <br/>
    <span class="textbold">Philip Woo featuring Brenda Vaughn</span>
    <br/>
    Japanese pianist and soul singer performing with Andy Wulf and Kaori
    Kobayashi. Mar 28 & 29, 7 & 9:30pm, ¥3,150. Cotton Club, Marunouchi.
    Tel: 03-3215-1555.
    <br/>
    ...
    ...
    ...
    etc

    I can get the artist band names fine using
    names = doc.search("//span[@class='textbold']")

    but i cant get teh descriptions. In fact the descriptions aren't
    indvidually wrapped up in any tags but rather just clumped together
    under the paragraph tab with line breaks <br/>

    So I thought id just try
    descriptions =
    doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/table/tbody/tr/td/span/p")
    but when i try to puts descriptions nothing is printed to the screen.

    How would i go about getting this info??? any tips or ideas?

    Thanks
    --
    Posted via http://www.ruby-forum.com/.
    Adam Akhtar, Mar 28, 2008
    #1
    1. Advertising

  2. Adam Akhtar

    Adam Akhtar Guest

    more info..

    the original website can be found at
    http://metropolis.co.jp/tokyo/recent/listings.asp

    i used firebug to retrieve the xpath address of the desired paragraph
    (excerpted above). When I put it in doc.search it doesnt retrieve
    anything, nothing at all????


    Does anyone know why i cant????? Im banging my head against the wall
    --
    Posted via http://www.ruby-forum.com/.
    Adam Akhtar, Mar 28, 2008
    #2
    1. Advertising

  3. Adam Akhtar

    Todd Benson Guest

    On Fri, Mar 28, 2008 at 2:11 AM, Adam Akhtar <> wrot=
    e:
    > Hi im starting to use hrpicot and im having problems extracting
    > descriptions of various concert events from a page. Here is a sample of
    > the html
    >
    >
    > <p>
    > <a name=3D"concerts"/>
    > <span class=3D"heading">Concerts</span>
    > <br/>
    > <span class=3D"subheading">POPULAR</span>
    > <br/>
    > <br/>
    > <span class=3D"textbold">Middle Field! Vol.4</span >
    > <br/>
    > Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
    > 28, 7pm, =A52,500 (adv)/ =A53,000 (door). Shibuya O-Nest. Tel: 03-3498-9=

    999.
    > <br/>
    > <br/>
    > <span class=3D"textbold">Philip Woo featuring Brenda Vaughn</span>
    > <br/>
    > Japanese pianist and soul singer performing with Andy Wulf and Kaori
    > Kobayashi. Mar 28 & 29, 7 & 9:30pm, =A53,150. Cotton Club, Marunouchi.
    > Tel: 03-3215-1555.
    > <br/>
    > ...
    > ...
    > ...
    > etc
    >
    > I can get the artist band names fine using
    > names =3D doc.search("//span[@class=3D'textbold']")
    >
    > but i cant get teh descriptions. In fact the descriptions aren't
    > indvidually wrapped up in any tags but rather just clumped together
    > under the paragraph tab with line breaks <br/>
    >
    > So I thought id just try
    > descriptions =3D
    > doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/tab=

    le/tbody/tr/td/span/p")
    > but when i try to puts descriptions nothing is printed to the screen.
    >
    > How would i go about getting this info??? any tips or ideas?
    >
    > Thanks


    Wow! It looks nice, but the html is really ugly. This would be
    pretty hard to scrape on a regular basis. For artists, there are a
    mix of <strong></strong> tags, <span class=3D"textbold"></span> tags,
    and I noticed one artist with no surrounding tags at all (Ex-press
    Ver.2).

    It can be really hard to work with inconsistent html, but I suppose it
    could be done to some degree of accuracy. Any hpricot masters out
    there? I'm sure you'd have to attack with regexps as well. Maybe
    turning into text and then parsing is a better idea after all.

    Todd
    Todd Benson, Mar 28, 2008
    #3
  4. Adam Akhtar

    Adam Akhtar Guest

    thanks tod for the reply. Yes even I thought that it was badly designed
    and I dont have any web desing experience at all. In fact i learn the
    basics of html, xml and xpath just for this.

    Although those inconsitencies will prove to be a problem in the future
    the one im having right now is getting any information at all. Surely
    when i pass the xpath address for the paragraph element which contains
    all the artists names and event descriptinos it should return something
    rather than nothing. Is that right??? Every time a try to print to
    screen the result of the search it just comes blank. Does anyone know
    why???



    --
    Posted via http://www.ruby-forum.com/.
    Adam Akhtar, Mar 28, 2008
    #4
  5. On Fri, Mar 28, 2008 at 11:42 AM, Dan Diebolt <> wrote:
    > Firebug puts in tbody's into xpath's that reach into tables even if the <tbody> tag is not in the html source. Try removing the tbody path and debug using shorter xpaths to initially address content further up in the hierarchy.
    >


    Yes, Firefox does it to make it more (X)HTML-conform. It took me a
    while to get the hang of it. You might download the page using
    open-uri and open it with your favourite editor, search the text and
    work your way up through the tags.
    Most sites don't use <tbody>, so just try it without it.
    Thomas Wieczorek, Mar 28, 2008
    #5
  6. Adam Akhtar

    Adam Akhtar Guest

    ok i have tried taking out the tbody tags completely and got some of the
    text back. Ill experiment to see if i can get all of it.

    Re: Tidy

    I installed the gem and i got the example code

    require 'tidy'
    Tidy.path = '/usr/lib/libtidy.so'
    html = '<html><title>title</title>Body</html>'
    xml = Tidy.open:)show_warnings=>true) do |tidy|
    tidy.options.output_xml = true
    puts tidy.options.show_warnings
    xml = tidy.clean(html)
    puts tidy.errors
    puts tidy.diagnostics
    xml
    end
    puts xml

    now i have to change the path to whereever the lib is...well i foudn
    tidys folder in my lib directory and changed the above to this

    Tidy.path = 'C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib'

    and its complaining saying no such file... i tried

    Tidy.path =
    'C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib.rb'

    as thats the proper extension of the tidylib file but again it wont
    work.

    I cant find any tidylib file with an extenision .so

    banging my head even more now ;-)

    --
    Posted via http://www.ruby-forum.com/.
    Adam Akhtar, Mar 28, 2008
    #6
  7. Adam Akhtar

    Adam Akhtar Guest

    Adam Akhtar, Mar 28, 2008
    #7
  8. Adam Akhtar

    daniel hoey Guest

    On Mar 28, 6:11 pm, Adam Akhtar <> wrote:
    > Hi im starting to use hrpicot and im having problems extracting
    > descriptions of various concert events from a page. Here is a sample of
    > the html
    >
    > <p>
    > <a name="concerts"/>
    > <span class="heading">Concerts</span>
    > <br/>
    > <span class="subheading">POPULAR</span>
    > <br/>
    > <br/>
    > <span class="textbold">Middle Field! Vol.4</span >
    > <br/>
    > Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
    > 28, 7pm, ¥2,500 (adv)/ ¥3,000 (door). Shibuya O-Nest. Tel: 03-3498-9999.
    > <br/>
    > <br/>
    > <span class="textbold">Philip Woo featuring Brenda Vaughn</span>
    > <br/>
    > Japanese pianist and soul singer performing with Andy Wulf and Kaori
    > Kobayashi. Mar 28 & 29, 7 & 9:30pm, ¥3,150. Cotton Club, Marunouchi.
    > Tel: 03-3215-1555.
    > <br/>
    > ..
    > ..
    > ..
    > etc
    >
    > I can get the artist band names fine using
    > names = doc.search("//span[@class='textbold']")
    >
    > but i cant get teh descriptions. In fact the descriptions aren't
    > indvidually wrapped up in any tags but rather just clumped together
    > under the paragraph tab with line breaks <br/>
    >
    > So I thought id just try
    > descriptions =
    > doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/table/tbody/tr/td/span/p")
    > but when i try to puts descriptions nothing is printed to the screen.
    >
    > How would i go about getting this info??? any tips or ideas?
    >
    > Thanks
    > --
    > Posted viahttp://www.ruby-forum.com/.


    Once you have the 'name' node you can use next_node to get the next
    elements in the document
    This method should work for your example:

    def print_names_and_descriptions(hpricot_doc)
    names = hpricot_doc.search("//span[@class='textbold']")

    names.each do |name|
    node = name.next_node
    node = node.next_node until node.text? and node.inner_text =~ /\w
    +/

    puts name.inner_text
    puts node.to_s.strip
    puts
    end
    end
    daniel hoey, Mar 31, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nagaraj
    Replies:
    1
    Views:
    838
    Lionel B
    Mar 1, 2007
  2. Hongyi Zhao
    Replies:
    2
    Views:
    1,030
    Hongyi Zhao
    Jan 29, 2009
  3. NamSa

    want extract to parenthesis

    NamSa, May 1, 2009, in forum: Perl Misc
    Replies:
    3
    Views:
    99
  4. chris
    Replies:
    3
    Views:
    71
    chris
    Oct 5, 2005
  5. pavi
    Replies:
    0
    Views:
    1,316
Loading...

Share This Page