How can I correct an invalid XML?

M

Markus

At this time I'm using JDOM (but if you suggest another lib I can
change it). :)

How would you implement an XMLCorrector that can perfom the following
tasks.
* find out if XML is well-formed (this can JDOM)
--> If XML is not well-formed, find out the "corrupt" element (I
don't no such task in JDOM)
* validate XML against schema (I don't know how to do this in JDOM)
--> If XML is not valid, then find out the "corrupt" element

I think this are the mainfeatures which I want to do. :)

Example:
?<?version="1.0" encoding="UTF-8"?>
<root>
<sub>öäü</sub>
<sub>
<sub2>
</sub>
</root>

This is not a valid XML, of course because ther is a leading ?, öäü
are not valid in UTF-8 and there is no closing-tag for sub2.
I want to write a Programm which find out these errors and correct some
automaticly (like the leading ? or the encoding) and let me choose what
to do with sub2.

It should also validate against a schema a get me all invalid elements,
of couse.

Any suggestions for this problem(s)?!? :)

Kind regards

Markus
 
O

Oliver Wong

I don't know what kind of crazy voodoo magic you've done, but for
whatever reasons, I can't seem to quote your original post, so I'll
paraphrase it:

Markus wrote:
<paraphrased>
How can I detect if XML is not well-formed, and find out the "corrupt"
element and repair it?
</paraphrased>

You can't solve this problem in general. Take an mp3 of your favorite song
and name change its extension so that instead of "my_song.mp3", it's called
"my_song.xml". Now open it up with your XML editor and try to detect the
"corrupt" element, and to repair it.

If you limit your problem domain, you can start doing tricks, like using a
stack to check whether for every "open" tag, there's a corresponding
"closing" tag. But you can't solve this problem in general.

- Oliver
 
M

Markus

There is no voodoo magic, I think. :)

Of course I want to repair defekt XML-Documents.

But if I implement such a stack, i think is like to writing an own
parser because I have to load the whole file, read all Elements (and
Attributes), Comments, Processing Instructions, ... und build a new
tree with all the infos.
--> I think think is what a parser do. Or not?

Markus
 
O

Oliver Wong

Markus said:
There is no voodoo magic, I think. :)

Of course I want to repair defekt XML-Documents.

But if I implement such a stack, i think is like to writing an own
parser because I have to load the whole file, read all Elements (and
Attributes), Comments, Processing Instructions, ... und build a new
tree with all the infos.
--> I think think is what a parser do. Or not?

Yes, you are correct. I'm not too familiar with JDOM, but I have used
it, and in if my recollection is correct, it does NOT parse defective XML
documents.

You will essentially have to write your own parser, or find one already
written that does what you want. I recommend you use an existing XML
tokenizer (something like SAX) to receive a stream of tokens rather than a
stream of characters, and each step, try to determine if any errors are
present, and if so, if it is possible to fix them.

- Oliver
 
A

Andrew Thompson

Markus said:
At this time I'm using JDOM (but if you suggest another lib I can
change it). :)

As I understand. JDOM relies on valid XML, and will be little
or no use here.
How would you implement an XMLCorrector ...

See Oliver's reply. Once you have an XMLCorrector, you
may as well call it ArtificialIntelligence.
...that can perfom the following
tasks. ....
* validate XML against schema (I don't know how to do this in JDOM)
--> If XML is not valid, then find out the "corrupt" element

On the basis that many IDE's will allow validation of
an XML against a DTD, and highlight faulty lines, I
think this should be possible.

From a programattic check, you might try looking into
the Ant task that is mentioned here..
<https://screensavers.dev.java.net/servlets/ProjectForumMessageView?messageID=9252&forumID=698>

[ The core class that does validation is on the 'tip of my tongue',
but you can probably find the details from the Ant source ..or just
use the existing Ant task for the validation. ]
 
A

Abhijat Vatsyayan

Pretty vague (as others have already suggested)! It might help if you
create a list of "types of error" that you might want a program to
correct. Once you have the list, we can try going through the list and
see which ones can be corrected.

If you have in mind something like HTML (which not XML) in mind, you
can try looking up HTML (SGML?) parsers then try to translate to a
valid XML.
 
J

John C. Bollinger

Markus said:
There is no voodoo magic, I think. :)

Of course I want to repair defekt XML-Documents.

But if I implement such a stack, i think is like to writing an own
parser because I have to load the whole file, read all Elements (and
Attributes), Comments, Processing Instructions, ... und build a new
tree with all the infos.
--> I think think is what a parser do. Or not?

Yes, it is. And detecting malformed XML is 100% an XML parsing task.
As Oliver suggested in his response, you might find that SAX simplifies
your work somewhat.

As Oliver also pointed out, however, locating malformations and
invalidities is a far cry from correcting them. With the help of an
external XML DTD or Schema, there is some hope of correcting an XML
instance that has one or a few small malformations. The DTD or Schema
would tell you what element and attribute names to expect in which
contexts. In a similar way, a schema or DTD might enable you to correct
certain minor validity problems, such as misspelled element and
attribute names. On the other hand, there are classes of validity
errors which no general program can ever hope to correct; first among
these is missing required document structure. For example, if I leave a
required attribute off some element, then although a program can
determine what attribute is missing, it cannot know what value that
attribute should have.
 
M

Markus

Ok - here is what I exactly want to do:

I want to create an application that supports a user to correct an
invalid document.

At first I have to ensure that the xml is wellformed and there are no
other errors except schema errors.
If I found out that the document has such error I want to correct them
automaticly but it seems that is not really possible so the user has to
correct these errors manually in a TextArea.

At next I want to find out the errors against the schema.
Which parser supports validatoin against a schema?
For example, if I parse the document with SAX, I get the line- and
columnnumber where the error occours but as far as I know SAX doesn't
support schema-validation.

But is there a possibility to get the element, where the error happens?
--> If there is such an error I only want to show the user this element
for editing in a TextField.
Is this possible with any standard-parses?

Useful ideas are welcome. :)

Markus
 
C

Chris Uppal

Markus said:
I want to create an application that supports a user to correct an
invalid document.

It might be eaiest to start with one of the open source (not necessarily Open
Source) XML parsers. If what it provides is adequate for you then use it as
is, if not (and it probably won't be) then hack it till it is.

I'd start with XOM myself, but that's not a recommendation based on detailed
knowledge of the "competing" libraries (or of XOM, come to that).

-- chris
 
O

Oliver Wong

Markus said:
Ok - here is what I exactly want to do:

I want to create an application that supports a user to correct an
invalid document.

At first I have to ensure that the xml is wellformed and there are no
other errors except schema errors.
If I found out that the document has such error I want to correct them
automaticly but it seems that is not really possible so the user has to
correct these errors manually in a TextArea.

At next I want to find out the errors against the schema.
Which parser supports validatoin against a schema?
For example, if I parse the document with SAX, I get the line- and
columnnumber where the error occours but as far as I know SAX doesn't
support schema-validation.

You could do it in two phases. i.e. use SAX first to make sure that the
XML is well formed (perhaps making corrections until it is well-formed), and
then dump the parse tree into a string, and re-parse it using a different
parser which DOES support schema-validation. And then do a second phase of
(user assisted) corrections here.

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,265
Latest member
TodLarocca

Latest Threads

Top