libxml: is it possible not to use doctype declaration?

R

ruud grosmann

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud
 
P

Phlip

ruud said:
This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

xp = XML::HTMLParser.new()
xp.string = xhtml
XML::parser.default_pedantic_parser = false
doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.
 
R

ruud grosmann

hi Phlip,

thanks for the suggestion. The document is not an HTML document. It is
an XML document. It is something like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
<p>this is a test</p>
</test>

I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

regards, Ruud
 
P

Phlip

ruud said:
I don't want XML::Document to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.
 
R

ruud grosmann

hi Phlip,

thank you for the hint. I did it already, but I was wondering if there
is some hidden option that did it for me.

Is my assumption correct that the class not documentated very good?
After googling for some time I only found something that appeared to
be outdated. That why I eventually posted my question here.

Is using libxml the right thing to do to, or are there smarter alternatives?

thanks, Ruud
 
T

Tommy Nordgren

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud
Check wether your xml processor supports xml catalog files. They
provide a mapping from web-based
paths to local file names.
 
P

Phlip

ruud said:
Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML,
Libxml-ruby, and Hpricot), and its documentation can be very challenging. How
much of the original C Libxml documentation have you been able to read?
 
P

Phill Davies

I tried to reply to this via the ruby-talk mailing list and it didn't
work. Not sure why not, maybe someone can fill me in on that. Anyway,
here's my take:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML::parser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML::parser instead of XML::Document I think you would need to do e.g.:
parser = XML::parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.
 
P

Phill Davies

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "http://some.site.nl/dtd/test.dtd">

doesn't look like a real doctype definition, so if you can pull it out of your xml (by hand, not programmatically) before trying to parse it, I'd say that would be a good idea. That being said, there are two attributes of the XML::parser class that look like they may be of interest: default_load_external_dtd and default_validity_checking. Try setting both of those to false, unless you have a real dtd to validate against and the example above was fake.
Of course, since this is using XML::parser instead of XML::Document I think you would need to do e.g.:
parser = XML::parser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.
 
P

Phill Davies

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML::parser.default_load_external_dtd = false
LibXML::XML::parser.default_validity_checking = false

And then:
parser = LibXML::XML::parser.file(<file>)
doc = parser.parse

That seems to work with your example.
 
R

ruud grosmann

hi Phill,

I've tried it right away. I ended up with the following:

XML::parser.default_load_external_dtd = false
XML::parser.default_validity_checking = false
XML::parser.default_substitute_entities = false

parser = XML::parser.file( file)
#parser.default_substitute_entities = false
#parser.default_load_external_dtd = false
#parser.default_validity_checking = false
doc = parser.parse
node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud
 
P

Phlip

ruud said:
XML::parser.default_load_external_dtd = false
XML::parser.default_validity_checking = false
XML::parser.default_substitute_entities = false
Did I something wrong in the script?

When I was researching the difference between the normal XML parser and the HTML
parser, I also observed those variables not working. That's why I didn't bring
them up.
 
P

Phill Davies

ruud said:
hi Phill,

I've tried it right away. I ended up with the following:

XML::parser.default_load_external_dtd = false
XML::parser.default_validity_checking = false
XML::parser.default_substitute_entities = false

parser = XML::parser.file( file)
#parser.default_substitute_entities = false
#parser.default_load_external_dtd = false
#parser.default_validity_checking = false
doc = parser.parse
node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud
Hey Ruud,
Nope, I can't see that you're doing anything wrong. I guess all I
can say is if can send the actual XML so I can give it a try with it
(because when I use your original example it seems to work fine as long
as I set those class variables). Also, the error message you sent was
broken up, if you could please try to send that again it would probably
help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
<p>this is a test</p>
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd"

Thanks,
Phill
 
R

ruud grosmann

hi Mark,

thanks for this hint. I had decided libxslt was not for me because of
a probblem with garbage collection after starting to use it (see other
post).
So a good alternative is welcome. I'll check it out later this week.

regards, Ruud

Give fastxml a try. It's also a ruby interface to libxml.
http://github.com/segfault/fastxml/tree/master
http://fastxml.rubyforge.org/
--mg

ruud said:
hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML::Document.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud
 
R

Robert Klemme

2008/7/30 Phill Davies said:
Hey Ruud,
Nope, I can't see that you're doing anything wrong. I guess all I can say
is if can send the actual XML so I can give it a try with it (because when I
use your original example it seems to work fine as long as I set those class
variables). Also, the error message you sent was broken up, if you could
please try to send that again it would probably help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
<p>this is a test</p>
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd"

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process...

Kind regards

robert
 
R

ruud grosmann

thanks everybody,

I think I rather do a system call for saxon. It's just to many little
bugs and uncertainties to me. Thanks anyway for your efforts and
helping me.

Regards, Ruud
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top