Cleaning the mess of newssite HTML

Discussion in 'HTML' started by Matej Cepl, Oct 6, 2004.

  1. Matej Cepl

    Matej Cepl Guest

    Hi,

    can anybody help me with the cleaning of really messy HTML from
    the news site into really clean XHTML, which I would like to
    then analyze with some qualitative analysis (probably exporting
    to plain ASCII in meantime, but not necessarily). I can do some
    little cleaning by hand, but when there some hundreds of
    webpages, I hoped that I could create some XSL stylesheet for
    conversion.

    I have downloaded this page
    (http://news.bostonherald.com/localRegional/view.bg?articleid=40476&format=text;
    the copy is available on
    http://www.ceplovi.cz/matej/tmp/downloaded.html). Then I run it
    through tidy (http://www.ceplovi.cz/matej/tmp/tidyfied.html). I
    would love to get some really minimal HTML2.0-like XHTML
    (something like http://www.ceplovi.cz/matej/tmp/clean.xhtml).

    Is there any tool for doing things like that? I hoped to create
    some XSL stylesheet myself, but I am quite newbie in XSL-arena,
    and there are some things, which I did not manage to do:

    1) How to say to XSL processor "skip everything between <body>
    and the blocklevel tag, which contains the same text as <title>,
    but without some constant text or regex (e.g.,
    "\s*BostonHerald.com.*:" or at least "BostonHerald.com -
    Local/Regional News:")"? Of course all remaining closing tags
    should be omitted as well.
    2) How to remove all tables without removing their content?

    Does anybody know about any such solution or at least example of
    such thing?

    Thanks for any help,

    Matej Cepl

    --
    Matej Cepl,
    GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC
    138 Highland Ave. #10, Somerville, Ma 02143, (617) 623-1488
    Matej Cepl, Oct 6, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matej Cepl

    Cleaning the mess of newssite HTML

    Matej Cepl, Oct 6, 2004, in forum: XML
    Replies:
    1
    Views:
    316
    Matej Cepl
    Oct 7, 2004
  2. edgy

    cleaning up html code

    edgy, Jul 8, 2006, in forum: HTML
    Replies:
    2
    Views:
    402
  3. Steve B.
    Replies:
    1
    Views:
    643
    Siva M
    Sep 4, 2006
  4. David R. Throop
    Replies:
    4
    Views:
    155
    Petri
    Feb 8, 2004
  5. Reinhard Glauber

    Cleaning HTML ;-)

    Reinhard Glauber, Jan 21, 2006, in forum: Perl Misc
    Replies:
    3
    Views:
    91
Loading...

Share This Page