Spidering the web to find RDF

Discussion in 'XML' started by Mark Watson, Oct 2, 2003.

  1. Mark Watson

    Mark Watson Guest

    Last year, I did an experiment of allowing a very polite
    web spider run for a few days trying to find RDF markup
    embedded in web pages. I found close to zero RDF - not
    encouraging!

    I a recent post, I compalined about not being able to
    embed RDF in XHTML (at least no standard way to do it
    and still pass th W3C XHTML validator). Another poster
    (Jeen Broekstr) provided a good example of simply
    linking to a RDF file at the same site.

    I was concerned about spiders being able to find
    links to RDF because there is no standard for this,
    then a few minutes ago I had one of those "Duh!" experiences:

    A spider looking for RDF can look for embedded RDF
    in HTML and also examine every link that is on the
    same site and see if the file extension (if there is one)
    ends in ".rdf". If such a link is found, assume that
    it decribes to the page linking it.

    Anyway, I will try my experiment again (when I have
    time to set it up) and report the results. I hope that
    lots of people link to separate RDF files on their sites
    and my results will be better than last year when I
    only looked for embedded RDF.

    -Mark
    Mark Watson, Oct 2, 2003
    #1
    1. Advertising

  2. Mark Watson

    Nick Kew Guest

    In article <>, one of infinite monkeys
    at the keyboard of (Mark Watson) wrote:

    > A spider looking for RDF can look for embedded RDF
    > in HTML and also examine every link that is on the
    > same site and see if the file extension (if there is one)
    > ends in ".rdf".


    Ahem ... the last few characters of a URL have absolutely no significance
    except by convention. A spider that did that would be broken.

    It could, however, look for links with the type="application/rdf+xml"
    attribute. It would find a couple in my pages, for instance.

    > If such a link is found, assume that
    > it decribes to the page linking it.


    Wouldn't it be better to believe the RDF concerning its own subject?

    > only looked for embedded RDF.


    I played with embedding RDF (for automatically-generated reports),
    but abandoned the idea as a nonstarter.

    --
    Nick Kew

    In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
    Nick Kew, Oct 3, 2003
    #2
    1. Advertising

  3. Nick Kew wrote:

    > In article <>,
    > one of infinite monkeys at the keyboard of
    > (Mark Watson) wrote:
    >
    > > A spider looking for RDF can look for embedded RDF
    > > in HTML and also examine every link that is on the
    > > same site and see if the file extension (if there is one)
    > > ends in ".rdf".

    >
    > Ahem ... the last few characters of a URL have absolutely no
    > significance except by convention. A spider that did that
    > would be broken.
    >
    > It could, however, look for links with the
    > type="application/rdf+xml" attribute. It would find a couple
    > in my pages, for instance.


    That would, however, only work if the web server from which the
    file is hosted is aware of this mime type. I don't know if Apache
    comes preconfigured with it these days but I'll bet that older
    versions won't spot it (for example, my rdf file would not be
    found since the department web server serves it as text/plain).

    You're right that this is the correct way of processing it, but
    for now, being slightly more opportunistic and looking for
    extensions (as well as trying to parse text/xml files) would
    probably give much better results.

    Jeen
    --
    Jeen Broekstra http://www.cs.vu.nl/~jbroeks/

    New York is real. The rest is done with mirrors.
    Jeen Broekstra, Oct 3, 2003
    #3
  4. Mark Watson

    Nick Kew Guest

    In article <>, one of infinite monkeys
    at the keyboard of Jeen Broekstra <> wrote:

    >> It could, however, look for links with the
    >> type="application/rdf+xml" attribute. It would find a couple
    >> in my pages, for instance.

    >
    > That would, however, only work if the web server from which the
    > file is hosted is aware of this mime type.



    Nope. I said attribute.
    <link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">

    > I don't know if Apache
    > comes preconfigured with it these days but I'll bet that older


    Neither do I; in any case it wouldn't do anything for the above example
    which I deliberately (and perfectly legitimately) ended with .html
    The server should of course serve it with the correct MIME type,
    but that's another issue.

    > You're right that this is the correct way of processing it, but
    > for now, being slightly more opportunistic and looking for
    > extensions (as well as trying to parse text/xml files) would
    > probably give much better results.


    Even if .rdf gets something, it'll miss out on lots of .cgi, .php,
    ..xml and other things. It's simply broken.

    Relying on the attribute will also miss out on many instances.
    It's no more than a more correct thing than ".rdf" to look for
    in (x)html links.

    --
    Nick Kew

    In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
    Nick Kew, Oct 3, 2003
    #4
  5. Mark Watson

    Mark Watson Guest

    Jeen Broekstra <> wrote in message news:<>...
    > You're right that this is the correct way of processing it, but
    > for now, being slightly more opportunistic and looking for
    > extensions (as well as trying to parse text/xml files) would
    > probably give much better results.


    It sounds like what I need to do is to roll all the ideas for spidering
    RDF together and be as opportunistic as possible in collecting RDF.

    So, I will use both Nick's and Jeen's ideas.

    Thanks,
    Mark
    Mark Watson, Oct 3, 2003
    #5
  6. Nick Kew wrote:
    > In article <>, one of
    > infinite monkeys at the keyboard of Jeen Broekstra
    > <> wrote:
    >
    > >> It could, however, look for links with the
    > >> type="application/rdf+xml" attribute. It would find a
    > >> couple in my pages, for instance.

    > >
    > > That would, however, only work if the web server from which the
    > > file is hosted is aware of this mime type.

    >
    >
    > Nope. I said attribute.
    > <link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">
    >


    Blimey. My bad, I completely misread your post.

    Jeen
    --
    Jeen Broekstra http://www.cs.vu.nl/~jbroeks/

    Write a wise saying and your name will live forever.
    -- Anonymous
    Jeen Broekstra, Oct 3, 2003
    #6
  7. Mark Watson

    Nick Kew Guest

    In article <>, one of infinite monkeys
    at the keyboard of (Mark Watson) wrote:

    > It sounds like what I need to do is to roll all the ideas for spidering
    > RDF together and be as opportunistic as possible in collecting RDF.


    My previous post was just a correction to something you said, which I
    felt called for correction because it so often leads to confusion.

    My *practical" suggestion would be to send HEAD requests from the spider
    to ascertain the type of any URL before actually fetching it. Then fetch
    HTML and XHTML pages to spider for more links, and RDF pages for your
    collection.

    I happen to have spidering software that'll do all that - among other
    things:) Though I have the feeling you may not have the budget for it,
    given the experimental nature of your task.

    --
    Nick Kew
    Nick Kew, Oct 3, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cinamon Thunder

    Google spidering & traffic

    Cinamon Thunder, Feb 20, 2007, in forum: HTML
    Replies:
    4
    Views:
    485
    Steve Pugh
    Feb 20, 2007
  2. David Waizer

    spidering script

    David Waizer, Jan 18, 2007, in forum: Python
    Replies:
    5
    Views:
    396
    William Park
    Jan 23, 2007
  3. Rusty Hill

    Web Crawling Spidering Question

    Rusty Hill, Jun 1, 2007, in forum: ASP .Net
    Replies:
    3
    Views:
    312
    Hakan Fatih YILDIRIM
    Jun 3, 2007
  4. Bill Guindon
    Replies:
    12
    Views:
    179
    Gene Tani
    Jul 1, 2005
  5. Spidering Hacks

    , Nov 26, 2007, in forum: Perl Misc
    Replies:
    11
    Views:
    254
Loading...

Share This Page