libxml: is it possible not to use doctype declaration?

ruud grosmann · Jul 29, 2008

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML:

ocument.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Phlip · Jul 29, 2008

ruud said:
This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

If the doctype is an HTML, open the document like this:

xp = XML::HTMLParser.new()
xp.string = xhtml
XML:

arser.default_pedantic_parser = false
doc = xp.parse

My assertxpath gem shows how, in the method assert_libxml.

ruud grosmann · Jul 29, 2008

hi Phlip,

thanks for the suggestion. The document is not an HTML document. It is
an XML document. It is something like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
this is a test
</test>

I don't want XML:

ocument to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

regards, Ruud

Phlip · Jul 29, 2008

ruud said:
I don't want XML:ocument to resolve the URL and waiting for a
timeout. I couldn't find anything in the documentation on this.

Use string surgery to yank out the DOCTYPE.

ruud grosmann · Jul 29, 2008

hi Phlip,

thank you for the hint. I did it already, but I was wondering if there
is some hidden option that did it for me.

Is my assumption correct that the class not documentated very good?
After googling for some time I only found something that appeared to
be outdated. That why I eventually posted my question here.

Is using libxml the right thing to do to, or are there smarter alternatives?

thanks, Ruud

Tommy Nordgren · Jul 29, 2008

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML:ocument.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Check wether your xml processor supports xml catalog files. They
provide a mapping from web-based
paths to local file names.

Phlip · Jul 29, 2008

ruud said:
Is using libxml the right thing to do to, or are there smarter alternatives?

Libxml-ruby is the most complete & accurate parser of the big three (REXML,
Libxml-ruby, and Hpricot), and its documentation can be very challenging. How
much of the original C Libxml documentation have you been able to read?

Phill Davies · Jul 29, 2008

I tried to reply to this via the ruby-talk mailing list and it didn't
work. Not sure why not, maybe someone can fill me in on that. Anyway,
here's my take:

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">

doesn't look like a real doctype definition, so if you can pull it out
of your xml (by hand, not programmatically) before trying to parse it,
I'd say that would be a good idea. That being said, there are two
attributes of the XML:

arser class that look like they may be of
interest: default_load_external_dtd and default_validity_checking. Try
setting both of those to false, unless you have a real dtd to validate
against and the example above was fake. Of course, since this is using
XML:

arser instead of XML:

ocument I think you would need to do e.g.:
parser = XML:

arser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phill Davies · Jul 29, 2008

To start, the rdoc documentation can be found at
http://libxml.rubyforge.org/rdoc/index.html. Now I don't know this for
sure, but

<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL" "http://some.site.nl/dtd/test.dtd">

doesn't look like a real doctype definition, so if you can pull it out of your xml (by hand, not programmatically) before trying to parse it, I'd say that would be a good idea. That being said, there are two attributes of the XML:

arser class that look like they may be of interest: default_load_external_dtd and default_validity_checking. Try setting both of those to false, unless you have a real dtd to validate against and the example above was fake.
Of course, since this is using XML:

arser instead of XML:

ocument I think you would need to do e.g.:
parser = XML:

arser.file(<file>)
parser.default_load_external_dtd = false
parser.default_validity_checking = false
doc = parser.parse

... and then go from there.
Phill D.

Phill Davies · Jul 30, 2008

Whoops, those were supposed to be class variables. What you really want
to do (I think) is more like:

LibXML::XML:

arser.default_load_external_dtd = false
LibXML::XML:

arser.default_validity_checking = false

And then:
parser = LibXML::XML:

arser.file(<file>)
doc = parser.parse

That seems to work with your example.

ruud grosmann · Jul 30, 2008

hi Phill,

I've tried it right away. I ended up with the following:

XML:

arser.default_load_external_dtd = false
XML:

arser.default_validity_checking = false
XML:

arser.default_substitute_entities = false

parser = XML:

arser.file( file)
#parser.default_substitute_entities = false
#parser.default_load_external_dtd = false
#parser.default_validity_checking = false
doc = parser.parse
node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

Phlip · Jul 30, 2008

ruud said:
XML:arser.default_load_external_dtd = false
XML:arser.default_validity_checking = false
XML:arser.default_substitute_entities = false

Did I something wrong in the script?

When I was researching the difference between the normal XML parser and the HTML
parser, I also observed those variables not working. That's why I didn't bring
them up.

Phill Davies · Jul 30, 2008

ruud said:
hi Phill,

I've tried it right away. I ended up with the following:

XML:arser.default_load_external_dtd = false
XML:arser.default_validity_checking = false
XML:arser.default_substitute_entities = false

parser = XML:arser.file( file)
#parser.default_substitute_entities = false
#parser.default_load_external_dtd = false
#parser.default_validity_checking = false
doc = parser.parse
node = doc.find( xpath).first

But the script still tries to resolve the entity. The doctype
definition is a slightly changed real one. The message I get with the
above code is:

Operation in progress./tmp/ut21.uit:3: I/O warning : failed to load
external entity "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"
e publicaties 1.0//NL" "http://ruud.grosmann.nl/op/dtd/publicatie.dtd"

You were right that the methods are not instance methods, although I
am not sure how to conclude that from the documentation.

Did I something wrong in the script?

regards, Ruud

Hey Ruud,
Nope, I can't see that you're doing anything wrong. I guess all I
can say is if can send the actual XML so I can give it a try with it
(because when I use your original example it seems to work fine as long
as I set those class variables). Also, the error message you sent was
broken up, if you could please try to send that again it would probably
help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
this is a test
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd"

Thanks,
Phill

ruud grosmann · Jul 31, 2008

hi Mark,

thanks for this hint. I had decided libxslt was not for me because of
a probblem with garbage collection after starting to use it (see other
post).
So a good alternative is welcome. I'll check it out later this week.

regards, Ruud

ruud said:
Give fastxml a try. It's also a ruby interface to libxml.
http://github.com/segfault/fastxml/tree/master
http://fastxml.rubyforge.org/
--mg

ruud said:

hi all,

I have tried to find elements in XML documents with xpath expression
support in libxml:

require 'xml/libxml'
doc = XML:ocument.file( file)
node = doc.find_first( 'doc/p[@att]/@att')

This works fine, but not if the document contains a doctype
declaration with a system identifier. For some reason, libxml tries to
resolve it. Leading to significant performances issues.

Is there a way to tell the Document-object that it should ignore the
doctype declaration if present? Or should I first remove the
declaration from the document before calling new?

regard, Ruud

Click to expand...

Robert Klemme · Aug 1, 2008

2008/7/30 Phill Davies said:
Hey Ruud,
Nope, I can't see that you're doing anything wrong. I guess all I can say
is if can send the actual XML so I can give it a try with it (because when I
use your original example it seems to work fine as long as I set those class
variables). Also, the error message you sent was broken up, if you could
please try to send that again it would probably help. Here's what I'm using:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd">
<test>
this is a test
</test>

And here's the error I get when I don't set those class variables:

test.xml:2:
I/O
warning :
failed to load HTTP resource
TYPE test PUBLIC "-//FARAWAY//DTD-verweg//NL"
"http://some.site.nl/dtd/test.dtd"

Hm, Java XML parsers I know have a special callback that you can set
that will deal with resolving external entities. I could not find
anything similar in libxml documentation but maybe I just looked in
the wrong places. With that you could load the file just once (or
even fetch it from some internal memory or file system). Also, I find
it a bit strange that those flags are global - this can introduce
weird bugs when using an application which parses XML concurrently and
needs different flags for each process...

Kind regards

robert

ruud grosmann · Aug 2, 2008

thanks everybody,

I think I rather do a system call for saxon. It's just to many little
bugs and uncertainties to me. Thanks anyway for your efforts and
helping me.

Regards, Ruud

libxml-ruby namespace problem	1	Oct 26, 2007
validation of XML document which does not contain DOCTYPE declaration	0	Apr 16, 2009
[libxml]: Can't find nodes using XPath, namespaces mess	7	Jul 31, 2009
libxml utf-8 locale	2	Mar 7, 2006
is it possible to create a hash dynamically?	4	Jun 30, 2008
[ANN] nokogiri 1.4.5 Released	0	Jun 16, 2011
LibXML UTF8 - Input is not proper UTF-8, indicate encoding !	2	Mar 5, 2005
XPATH evaluation with libxml	0	Jun 14, 2005

libxml: is it possible not to use doctype declaration?

ruud grosmann

Phlip

ruud grosmann

Phlip

ruud grosmann

Tommy Nordgren

Phlip

Phill Davies

Phill Davies

Phill Davies

ruud grosmann

Phlip

Phill Davies

ruud grosmann

Robert Klemme

ruud grosmann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads