Convert HTML to XML

E

earth_792

Hello, All!

Does anyone have any ideas how to convert Html into XML by using
Java? I know there is java API that do opposite way(XML to XHTML).
Any ideas or links would be helpful. :)


Thank you
 
A

Andy Dingley

Does anyone have any ideas how to convert Html into XML by using
Java?

This depends on what you mean by "HTML". If it's guaranteed to be
well-formed and valid, then it's a simple matter - use an SGML or HTML
parser, then output the DOM as XML.

If it's "typical" HTML "tag soup", then this is fundamentally a much
more difficult task. You can't convert with a simple automatic
process, at times you have to infer "what the author meant" rather
than "what they wrote". I suggest reading up on HTML Tidy, which isn't
(AFAIK) ported to Java, but does discuss the problems and their
solutions.

If you're trying to embed HTML in RSS (which is usually an XML
protocol) or similar, then you don't even need to "convert HTML to
XML", you just neeed to encode the relevant entities (such as "<" and
">") into a CDATA section. That's _much_ easier, you don't even need a
HTML parser, just a simple character-by-character scan and replace.

On the whole though, I can't imagine many cases when it really is
necessary to "convert HTML to XML". Just about the only one is loading
legacy web sites into a new XML-based CMS.


If you give us more context, then you might get more relevant advice.
 
E

earth_792

This depends on what you mean by "HTML". If it's guaranteed to be
well-formed and valid, then it's a simple matter - use an SGML or HTML
parser, then output the DOM as XML.

If it's "typical" HTML "tag soup", then this is fundamentally a much
more difficult task. You can't convert with a simple automatic
process, at times you have to infer "what the author meant" rather
than "what they wrote". I suggest reading up on HTML Tidy, which isn't
(AFAIK) ported to Java, but does discuss the problems and their
solutions.

If you're trying to embed HTML in RSS (which is usually an XML
protocol) or similar, then you don't even need to "convert HTML to
XML", you just neeed to encode the relevant entities (such as "<" and
">") into a CDATA section. That's _much_ easier, you don't even need a
HTML parser, just a simple character-by-character scan and replace.

On the whole though, I can't imagine many cases when it really is
necessary to "convert HTML to XML". Just about the only one is loading
legacy web sites into a new XML-based CMS.

If you give us more context, then you might get more relevant advice.

**********************
I just want to say "Thank you very much" all of you for reply my
post. Now, I understand what I should do. My initial problem is "to
convert legacy (not well format, valid) html into a new HTML(valid,
new presentation). I don't want to cut and paste content from legacy
ones to the new ones. There have thousands of pages. So, I thought if
I can convert HTML into XML and then use XSLT to convert back to a new
HTML. :))
 
H

Hunter Gratzner

Does anyone have any ideas how to convert Html into XML by using
Java? I know there is java API that do opposite way(XML to XHTML).
Any ideas or links would be helpful. :)

XHTML is XML. So any conversion HTML -> XHTML fulfills your
requirement. Try jtidy.
 
D

Daniel Pitts

earth_792 said:
**********************
I just want to say "Thank you very much" all of you for reply my
post. Now, I understand what I should do. My initial problem is "to
convert legacy (not well format, valid) html into a new HTML(valid,
new presentation). I don't want to cut and paste content from legacy
ones to the new ones. There have thousands of pages. So, I thought if
I can convert HTML into XML and then use XSLT to convert back to a new
HTML. :))
Look into Tidy, it is a program (there is a Java interface to it too if
you don't want to use the command line). It will reformat HTML into
well-formed HTML. Modern HTML (aka XHTML) *is* XML. So you don't need to
convert it to XML and then back to XHTML.

Hope this helps,
Daniel.
 
R

Roedy Green

Does anyone have any ideas how to convert Html into XML by using
Java? I know there is java API that do opposite way(XML to XHTML).
Any ideas or links would be helpful. :)

THere is a program called HTMLTidy that converts HTML to XHTML
 
S

Sherman Pendley

Daniel Pitts said:
Look into Tidy, it is a program (there is a Java interface to it too
if you don't want to use the command line). It will reformat HTML
into well-formed HTML. Modern HTML (aka XHTML) *is* XML. So you don't
need to convert it to XML and then back to XHTML.

Agreed about Tidy.

The final output format should be HTML though, not XHTML. XHTML will not
render at all in IE6/7 when served correctly as application/xhtml+xml. IE
will render it when served as text/html, but uses its HTML engine to do
so. That being the case, it's better to give it valid HTML to work with,
then to give it XHTML that relies on the HTML engine's error handling to
parse correctly.

sherm--
 
D

Daniel Pitts

Sherman said:
Agreed about Tidy.

The final output format should be HTML though, not XHTML. XHTML will not
render at all in IE6/7 when served correctly as application/xhtml+xml. IE
will render it when served as text/html, but uses its HTML engine to do
so. That being the case, it's better to give it valid HTML to work with,
then to give it XHTML that relies on the HTML engine's error handling to
parse correctly.

sherm--
Um, what are you talking about? XHTML *is* valid HTML. If you have to
lie about the content type, thats one thing, but XHTML should be used
going forward. non-XML HTML has been deprecated, and the sooner browser
writers and content providers realize this, the better the world will be.
 
S

Sherman Pendley

Daniel Pitts said:
Um, what are you talking about? XHTML *is* valid HTML.

Not at all. XHTML is an XML application. HTML is an SGML application. The
two are not the same. For instance, this is valid XHTML, but not valid HTML:

said:
If you have to
lie about the content type, thats one thing, but XHTML should be used
going forward.

The fact that you have to lie about the content type is what makes XHTML
unusable for the WWW. You're delivering it as HTML, and it will be parsed
as such. Name spaces will not be parsed, and short-tag forms such as the
img example above will be handled as slightly-broken HTML, not as short
form XML tags.

In other words, IE6 & IE7 don't see XHTML - they see HTML with a few funny
extra slashes here and there. That being the case, why not simply deliver
the HTML correctly, without the XHTML baggage to begin with?
non-XML HTML has been deprecated

Nonsense. The W3C's HTML Work Group was resurrected, and the effort to
standardize HTML 5 was started earlier this year:

<http://www.w3.org/html/>

As explained in the "why" link, XHTML was a nice idea , but it didn't pan
out in practice because of dismal browser support.
, and the sooner
browser writers and content providers realize this, the better the
world will be.

Sometimes the latest hot ticket just doesn't work out. No sense getting
religious about it - just move on.

sherm--
 
D

Daniel Pitts

Sherman said:
Not at all. XHTML is an XML application. HTML is an SGML application. The
two are not the same. For instance, this is valid XHTML, but not valid HTML:

<img src="foo.jpg" />
Are you sure that's not valid HTML? XML is a subset of SGML, and I
would think that said:
The fact that you have to lie about the content type is what makes XHTML
unusable for the WWW. You're delivering it as HTML, and it will be parsed
as such. Name spaces will not be parsed, and short-tag forms such as the
img example above will be handled as slightly-broken HTML, not as short
form XML tags.
It's called deprecation. Tell your users that they need the latest
browsers to see your site. I know that isn't always possible, but you
can say at a certain point that you're no longer supporting Mosaic and
Netscape 3 :)
In other words, IE6 & IE7 don't see XHTML - they see HTML with a few funny
extra slashes here and there. That being the case, why not simply deliver
the HTML correctly, without the XHTML baggage to begin with?
Because you gain so much with using XHTML, including the fact that many
popular JavaScript libraries require XHTML-strict to work properly.
Next you're going to tell me that you shouldn't use CSS.
 
L

Lew

Daniel said:
Are you sure that's not valid HTML? XML is a subset of SGML, and I
would think that <shortForm /> was valid SGML as well. <br /> is valid
HTML.

I use the short form for all my img, br, input and similar tags, and it works
just fine on every browser I've tried.
 
S

Sherman Pendley

Daniel Pitts said:
Are you sure that's not valid HTML?

Certain. Look it up: said:
It's called deprecation. Tell your users that they need the latest
browsers to see your site.

No. Why would I do such a stupid thing as that?
Because you gain so much with using XHTML, including the fact that
many popular JavaScript libraries require XHTML-strict to work
properly.

Is that meant to be a joke?
Next you're going to tell me that you shouldn't use CSS.

Um - why would I tell you that?

sherm--
 
S

Sherman Pendley

Lew said:
I use the short form for all my img, br, input and similar tags, and
it works just fine on every browser I've tried.

The short form is not valid HTML; it is not allowed according to the HTML
specifications found at <http://w3c.org>.

If you'll look at the definition of the term "valid" there, I don't think
you'll find the phrase "works just fine on every browser Lew tried." :)

sherm--
 
D

Daniel Pitts

Sherman said:
Certain. Look it up: <http://w3c.org>
That's a rather large site to look up the information that says <tag />
is invalid. How about pointing me to at least the right section, eh?

As an aside, I did find this interesting.
<http://www.w3.org/QA/2007/10/shorttags.html>
Apparently there are some shortcuts available to HTML users that aren't
for XML users. For example '<p<a href="/">Some Link</> some text' is
supposedly equivalent to said:
No. Why would I do such a stupid thing as that?
Same reason people don't write Java 1.2 code anymore. If you're content
is valuable enough, people will upgrade for it.
 
L

Lew

Sherman said:
If you'll look at the definition of the term "valid" there, I don't think
you'll find the phrase "works just fine on every browser Lew tried." :)

Drat. Now I may have to go and actually learn something.
 
L

Lasse Reichstein Nielsen

Daniel Pitts said:
Sherman Pendley wrote:
Are you sure that's not valid HTML? XML is a subset of SGML, and I
would think that <shortForm /> was valid SGML as well. <br /> is
valid HTML.

It is (part of) valid HTML, because HTML, as an SGML application, has
the SHORTTAG feature enabled
(<URL:http://www.w3.org/TR/html401/sgml/sgmldecl.html>, notice
"SHORTTAG YES").

However, what it means is probably not what you think it means.
Using shorttag notation, these two paragraphs are equivalent:
<p>this is a test</p>
<p/this is a test/
Writing <p/> means that the closing ">" is not part of the tag,
but part of the text content! Luckily no widely used browser
understands shorttags on element.
<URL:http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.7>

I'm actually not sure you can use shorttag with "br" elements, as
they are empty and can have no end tag. I guess the validator could
tell, but I'll just avoid it.
It's called deprecation. Tell your users that they need the latest
browsers to see your site. I know that isn't always possible, but you
can say at a certain point that you're no longer supporting Mosaic and
Netscape 3 :)

Yes. This problem should be over when we no longer have to support IE 6.
Or is it a problem for IE 7 too?
(there is a proper solution for IE though,
Because you gain so much with using XHTML, including the fact that
many popular JavaScript libraries require XHTML-strict to work
properly.

Javascript libraries work with the DOM structure. They don't care
about the syntax of the page markup. That's entirely up to the browser
to parse.
I you were right, and since IE 6 understands XHTML as malformed HTML
anyway, then those libraries shouldn't work IE6 anyway.

If they use Ajax to request further content, then it's fine to send
XML, but it doesn't have to be XHTML at all, and probably shouldn't.

A lot of JavaScript won't work with properly sent XHTML, because using
an XML parser precludes using the document.write feature.
Next you're going to tell me that you shouldn't use CSS.
Well, IE 6 still doesn't support most of CSS2, which has been a
standard since 1997. If you need to support IE6, and you do, then
you'll have to make it work with only the supported subset.

/L
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,110
Latest member
OdetteGabb
Top