how do I check wellformedness of html files?

D

drgonzo120

hello,


As my first mission n my first job i have to check the wellformedness
of about 1000 html files ...

I assume there must already be some java-classes/packages/libs on the
net that do this ??? It cannot be that I am the first one who has to do
this ...


So, does anybody know any online libs that do this???


Thanks !
 
D

drgonzo120

drgonzo120 schreef:
hello,


As my first mission n my first job i have to check the wellformedness
of about 1000 html files ...

I assume there must already be some java-classes/packages/libs on the
net that do this ??? It cannot be that I am the first one who has to do
this ...


So, does anybody know any online libs that do this???


Thanks !

it will be a console program, so i need classes that accept a html a
file and check it, i guess.
 
O

Oliver Wong

drgonzo120 said:
drgonzo120 schreef:


it will be a console program, so i need classes that accept a html a
file and check it, i guess.

See hiwa's reply, and also consider JTidy.

- Oliver
 
M

Martin Gregorie

Oliver said:
See hiwa's reply, and also consider JTidy.

- Oliver
Take a look at the HTML Tidy project, http://tidy.sourceforge.net

The original HTML Tidy is a C command line utility but there are Java
and Perl versions (Jtidy is one of them), all referenced from the
project. Its worth a visit: there are other useful things too, such HTML
editors which integrate HTML Tidy.
 
D

drgonzo120

hello, it's quite simple what i need tot do:

for example: this is a sample text from the html files:

<table border=1 width="100%" >
<tr>
<td width=20%><noindex>Betreft :</noindex></td>
<td colspan=3>
<betreft><P><A NAME="b_betreft"></A>Kinderrechten: implementatie van
het VN-verdrag<BR>Jaarlijkse verslaggeving van de Vlaamse regering aan
het Vlaams Parlement en aan de kinderrechtencommissaris omtrent de
implementatie van het VN-verdrag van 20 november 1989 inzake de rechten
van het kind<BR>Tweede verslag d.d. 29 september 2000 <A
NAME="e_betreft"></A></betreft>
</td></tr>

Per html file i need to extract the contents of these special tags ...
<betreft> (and others), (& create xml files out of them), is it
possible to read a html file as a xml file and do some xpath stuff on
it ???

Or just extract tags from a simple text file ...

" JTidy provides a DOM interface to the document that is being
processed, which effectively makes you able to use JTidy as a DOM
parser for real-world HTML."
but no where i can find a good reference to jtidy ...

I still don't know how I'm gonna do it, maybe write it all myself ....


greetings
 
A

Andy Dingley

drgonzo120 said:
As my first mission n my first job i have to check the wellformedness
of about 1000 html files ...

Why use Java? The usual tool for this is HTML Tidy, which you can
drive perfectly adequately from the command line with a couple of lines
of shell script.
 
M

Martin Gregorie

drgonzo120 said:
hello, it's quite simple what i need tot do:

for example: this is a sample text from the html files:

<table border=1 width="100%" >
<tr>
<td width=20%><noindex>Betreft :</noindex></td>
<td colspan=3>
<betreft><P><A NAME="b_betreft"></A>Kinderrechten: implementatie van
het VN-verdrag<BR>Jaarlijkse verslaggeving van de Vlaamse regering aan
het Vlaams Parlement en aan de kinderrechtencommissaris omtrent de
implementatie van het VN-verdrag van 20 november 1989 inzake de rechten
van het kind<BR>Tweede verslag d.d. 29 september 2000 <A
NAME="e_betreft"></A></betreft>
</td></tr>

Per html file i need to extract the contents of these special tags ...
<betreft> (and others), (& create xml files out of them), is it
possible to read a html file as a xml file and do some xpath stuff on
it ???

Or just extract tags from a simple text file ...

" JTidy provides a DOM interface to the document that is being
processed, which effectively makes you able to use JTidy as a DOM
parser for real-world HTML."
but no where i can find a good reference to jtidy ...

I still don't know how I'm gonna do it, maybe write it all myself ....
Have you looked at the HTML, HTMLEditorKit and HTMLDocument classes?

The HTMLEditorKit contains a parser I used as the basis for a URL
checker. This extracts <A> tags from HTML pages, Sets up a URL instance
from the href attribute and sees if it is accessible. Access failures
are reported for manual examination and fixes.
 
O

Oliver Wong

drgonzo120 said:
hello, it's quite simple what i need tot do:

for example: this is a sample text from the html files:

<table border=1 width="100%" >
<tr>
<td width=20%><noindex>Betreft :</noindex></td>
<td colspan=3>
<betreft><P><A NAME="b_betreft"></A>Kinderrechten: implementatie van
het VN-verdrag<BR>Jaarlijkse verslaggeving van de Vlaamse regering aan
het Vlaams Parlement en aan de kinderrechtencommissaris omtrent de
implementatie van het VN-verdrag van 20 november 1989 inzake de rechten
van het kind<BR>Tweede verslag d.d. 29 september 2000 <A
NAME="e_betreft"></A></betreft>
</td></tr>

Per html file i need to extract the contents of these special tags ...
<betreft> (and others), (& create xml files out of them), is it
possible to read a html file as a xml file and do some xpath stuff on
it ???

This is possible if and only if the HTML file actually is an XML file
(the HTML file format and the XML file format overlap, but are not identical
to each other). Otherwise, first you'll need something like "XMLTidy" (a
fictional product I just made up) to fix the broken XML -- things like
making sure every open tag is balanced by a closing tag, etc. I noticed in
your example, the <table>, <P> and <BR> tags are never closed, for example.

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top