How to use ReXML "in the wild"?

Discussion in 'Ruby' started by Kenneth McDonald, Dec 16, 2008.

  1. I'd very much like to use ReXML's XPATH features to extract info from
    Google's financial info pages, but find that Rexml chokes on the
    Javascript, here's the result of trying to read in a page with this
    bit of code:

    require "rexml/document"
    require 'net/http'
    Net::HTTP.start('finance.google.com') do |http|
    response = http.get('/finance?fstype=ii&q=NYSE:WAT')
    rdoc = REXML::Document.new(response.body)
    end

    ==========
    Output:

    /usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
    #<RuntimeError: Illegal character '&' in raw string
    " (REXML::parseException)
    (function(){
    var d=navigator.userAgent.toLowerCase().indexOf("msie")!=-1;function
    e(){var b=document.styleSheets;for(var a=b.length-1;a>=0;--a){var
    c=b[a].href;if(c)if(c.indexOf("styles/finance_")!=-1||
    c.indexOf("styles_")!=-1)return b[a]}return null}function f(){var
    b=e();if(b){var a=b.rules;return
    a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}
    function g(){if(document.scripts)for(var b=0;b">
    /usr/local/lib/ruby/1.8/rexml/text.rb:91:in `initialize'
     
    Kenneth McDonald, Dec 16, 2008
    #1
    1. Advertising

  2. Kenneth McDonald

    Guest

    Hi Kenneth,
    > I'd very much like to use ReXML's XPATH features to extract info from
    > Google's financial info pages, but find that Rexml chokes on the
    > Javascript, here's the result of trying to read in a page with this
    > bit of code:


    Don't try that ;) REXML in the wild == epic FAIL. At this level, you might
    want to try Hpricot or Nokogiri. At a bit higher level, scRUBYt!
    You can read about web scraping in Ruby here (my most succesfull article
    ever, was even mentioned in Learning Ruby from O'Reilly):

    http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

    > Is there a good way to get around this problem? If, not, I guess it's
    > back to regular expressions...


    Web scraping with regular expressions is almost never a good idea.

    Try scRUBYt!:

    require 'rubygems'
    require 'scrubyt'

    data = Scrubyt::Extractor.define do
    fetch 'http://finance.google.com/finance?fstype=ii&q=NYSE:WAT'

    body '/html/body' do
    revenue '/div[4]/div[2]/table/tr[2]' do
    ending_9_27 '/td[2]'
    ending_6_28 '/td[3]'
    end

    gross_profit '/div[4]/div[2]/table/tr[2]' do
    ending_9_27 '/td[2]'
    end
    end
    end

    puts data.to_xml

    output:

    <root>
    <body>
    <revenue>
    <ending_9_27>386.31</ending_9_27>
    <ending_6_28>398.77</ending_6_28>
    </revenue>
    <gross_profit>
    <ending_9_27>386.31</ending_9_27>
    </gross_profit>
    </body>
    </root>


    HTH,
    Peter
    ___
    http://scrubyt.org
    http://www.rubyrailways.com
     
    , Dec 16, 2008
    #2
    1. Advertising

  3. Kenneth McDonald

    Phlip Guest

    Kenneth McDonald wrote:

    > I'd very much like to use ReXML's XPATH features to extract info from
    > Google's financial info pages, but find that Rexml chokes on the
    > Javascript, here's the result of trying to read in a page with this
    > bit of code:


    I have studied REXML for many years, and I still can't figure out how to get it
    to recognize an &mdash; or similar advanced entity.

    Like the other responder said, give up while you still can. libxml-ruby is also
    stable enough to give a shot - oh yeah, except it crashes on non-tiny inputs.

    Aaaand...

    > /usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
    > #<RuntimeError: Illegal character '&' in raw string


    That's because REXML and your web browser disagree on the definition of
    well-formed. Your browser accepts a naked & inside a JavaScript tag, but REXML
    does not. REXML is technically correct, and your browser would have accepted
    &amp;&amp; here, but...

    > a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}


    ....browsers cannot correctly interpolate & appearing inside JavaScript literal
    strings, because some lowlife coder using Notepad might have actually wanted
    "&amp;" when they wrote "&amp;" - such as with document.write().

    So, because REXML cannot accept normal HTML, due to hits and misses of standards
    compliance on all sides - you are better off with a dedicated parser!

    --
    Phlip
     
    Phlip, Dec 16, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. SatishPasala

    Removing wild characters from a string

    SatishPasala, Nov 29, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    2,700
    SatishPasala
    Nov 29, 2005
  2. Alexander Walker
    Replies:
    1
    Views:
    712
    Steven Cheng[MSFT]
    Jan 23, 2006
  3. Damphyr
    Replies:
    2
    Views:
    148
    Damphyr
    Jul 16, 2003
  4. Daniel Berger

    rexml error - REXML::Validation

    Daniel Berger, Oct 12, 2004, in forum: Ruby
    Replies:
    2
    Views:
    157
    Henrik Horneber
    Oct 12, 2004
  5. Phlip
    Replies:
    0
    Views:
    149
    Phlip
    Jan 15, 2008
Loading...

Share This Page