M
Matej Cepl
Hi,
can anybody help me with the cleaning of really messy HTML from
the news site into really clean XHTML, which I would like to
then analyze with some qualitative analysis (probably exporting
to plain ASCII in meantime, but not necessarily). I can do some
little cleaning by hand, but when there some hundreds of
webpages, I hoped that I could create some XSL stylesheet for
conversion.
I have downloaded this page
(http://news.bostonherald.com/localRegional/view.bg?articleid=40476&format=text;
the copy is available on
http://www.ceplovi.cz/matej/tmp/downloaded.html). Then I run it
through tidy (http://www.ceplovi.cz/matej/tmp/tidyfied.html). I
would love to get some really minimal HTML2.0-like XHTML
(something like http://www.ceplovi.cz/matej/tmp/clean.xhtml).
Is there any tool for doing things like that? I hoped to create
some XSL stylesheet myself, but I am quite newbie in XSL-arena,
and there are some things, which I did not manage to do:
1) How to say to XSL processor "skip everything between <body>
and the blocklevel tag, which contains the same text as <title>,
but without some constant text or regex (e.g.,
"\s*BostonHerald.com.*:" or at least "BostonHerald.com -
Local/Regional Of course all remaining closing tags
should be omitted as well.
2) How to remove all tables without removing their content?
Does anybody know about any such solution or at least example of
such thing?
Thanks for any help,
Matej Cepl
can anybody help me with the cleaning of really messy HTML from
the news site into really clean XHTML, which I would like to
then analyze with some qualitative analysis (probably exporting
to plain ASCII in meantime, but not necessarily). I can do some
little cleaning by hand, but when there some hundreds of
webpages, I hoped that I could create some XSL stylesheet for
conversion.
I have downloaded this page
(http://news.bostonherald.com/localRegional/view.bg?articleid=40476&format=text;
the copy is available on
http://www.ceplovi.cz/matej/tmp/downloaded.html). Then I run it
through tidy (http://www.ceplovi.cz/matej/tmp/tidyfied.html). I
would love to get some really minimal HTML2.0-like XHTML
(something like http://www.ceplovi.cz/matej/tmp/clean.xhtml).
Is there any tool for doing things like that? I hoped to create
some XSL stylesheet myself, but I am quite newbie in XSL-arena,
and there are some things, which I did not manage to do:
1) How to say to XSL processor "skip everything between <body>
and the blocklevel tag, which contains the same text as <title>,
but without some constant text or regex (e.g.,
"\s*BostonHerald.com.*:" or at least "BostonHerald.com -
Local/Regional Of course all remaining closing tags
should be omitted as well.
2) How to remove all tables without removing their content?
Does anybody know about any such solution or at least example of
such thing?
Thanks for any help,
Matej Cepl