resolving an entity

Dean A. Hoover · Dec 6, 2003

I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., &copy

in
the program.

When I run the following against a chunk of xml
containing ©, I get the following:

org.xml.sax.SAXParseException: Reference to undefined entity "©".
at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3182)
at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3176)
at
org.apache.crimson.parser.Parser2.expandEntityInContent(Parser2.java:2513)
at
org.apache.crimson.parser.Parser2.maybeReferenceInContent(Parser2.java:2422)
at org.apache.crimson.parser.Parser2.content(Parser2.java:1833)
at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
at org.apache.crimson.parser.Parser2.content(Parser2.java:1779)
at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
at org.apache.crimson.parser.Parser2.content(Parser2.java:1779)
at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:500)
at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:281)
at Article.main(Article.java:18)

What can I do to catch these references in my code and output replacement
text for it?

Thanks.
Dean Hoover

Here's the two java files:
---
import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class Article
{
public static void main(String argv[])
{
String file = argv[0];
PrintWriter pw = new PrintWriter(System.out);
DefaultHandler handler = new LoadXML(pw, LoadXML.TYPE_HTML);
SAXParserFactory factory = SAXParserFactory.newInstance();

try
{
SAXParser reader = factory.newSAXParser();
reader.parse(new File(file), handler);
}
catch (Exception e)
{
e.printStackTrace();
return;
}

pw.flush();
}
}
---
import java.io.*;
import java.util.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class LoadXML extends DefaultHandler
{
public static final int TYPE_HTML = 1;
public static final int TYPE_TEXT = 2;

public LoadXML
(
java.io.Writer writer,
int type
)
{
elements_ = new Stack();
writer_ = writer;
type_ = type;
}

public InputSource resolveEntity
(
String publicId,
String systemId
) throws SAXException
{
String s = "stuff";
return new InputSource(new CharArrayReader(s.toCharArray()));
}

public void startDocument() throws SAXException
{
}

public void endDocument() throws SAXException
{
}

public void startElement
(
String uri,
String localName,
String qName,
Attributes attributes
) throws SAXException
{
String elementName = qName;
elements_.push(elementName);

try
{
if (elementName.equals("p"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
else if (elementName.equals("title"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
else if (elementName.equals("by"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
else if (elementName.equals("copyright"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
}
catch (IOException e)
{
throw new SAXException(e);
}
}

public void endElement
(
String uri,
String localName,
String qName
) throws SAXException
{
String elementName = qName;
elements_.pop();

try
{
if (type_ == TYPE_HTML)
{
if (elementName.equals("p") || elementName.equals("title") ||
elementName.equals("by") || elementName.equals("copyright"))
{
writer_.write("\n");
}
else if (elementName.equals("br"))
{
writer_.write(" \n");
}
}
}
catch (IOException e)
{
throw new SAXException(e);
}
}

public void characters
(
char[] ch,
int start,
int length
) throws SAXException
{
try
{
String content = new String(ch, start, length);
String top = (String)elements_.peek();
String text =
content.replaceAll("\n", " ").replaceAll(" +", " ").trim();

if (text.length() == 0)
return;

if (type_ == TYPE_HTML)
{
if (top.equals("p") || top.equals("title") ||
top.equals("by") || top.equals("copyright"))
writer_.write(text);
}
}
catch (IOException e)
{
throw new SAXException(e);
}
}

private Stack elements_;
private java.io.Writer writer_;
private int type_;
}

Maarten Wiltink · Dec 6, 2003

Dean A. Hoover said:
I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., &copy in
the program.

As I understand it, that's quite impossible. The case is defined
in the spec, and without a DTD you don't get to choose what
entities are defined or not.

But DTD may not mean what you think it does. Would it be permissible
for this document to have an internal DTD subset?

<?xml version="1.0"?>
<!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
<root>©</root>

A quick reading of the XML spec suggests (but I may have missed
something) that this is a correct construction in XML.

Groetjes,
Maarten Wiltink

Dean A. Hoover · Dec 7, 2003

Maarten said:
I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., &copy in
the program.

Click to expand...

As I understand it, that's quite impossible. The case is defined
in the spec, and without a DTD you don't get to choose what
entities are defined or not.

But DTD may not mean what you think it does. Would it be permissible
for this document to have an internal DTD subset?

<?xml version="1.0"?>
<!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
<root>©</root>

A quick reading of the XML spec suggests (but I may have missed
something) that this is a correct construction in XML.

I really don't want any DTD in the document at all. I am writing
some code that will parse an xml document and output either html
or plain text depending on a parameter. In the case of HTML it
would output "©", in the case of plain text it would output
"(c)". I have other similar context based entities to handle as
well.

Dean

Martin Honnen · Dec 7, 2003

Dean said:
Maarten said:

I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., &copy in
the program.

Click to expand...

As I understand it, that's quite impossible. The case is defined
in the spec, and without a DTD you don't get to choose what
entities are defined or not.

But DTD may not mean what you think it does. Would it be permissible
for this document to have an internal DTD subset?

<?xml version="1.0"?>
<!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
<root>©</root>

A quick reading of the XML spec suggests (but I may have missed
something) that this is a correct construction in XML.

Click to expand...

I really don't want any DTD in the document at all. I am writing
some code that will parse an xml document and output either html
or plain text depending on a parameter. In the case of HTML it
would output "©", in the case of plain text it would output
"(c)". I have other similar context based entities to handle as
well.

Well, if you write your own parser then you can of course parse
something alike XML but with references to undefined entities. But then
don't attempt to parse it with an XML parser which expects entities to
be defined.

Maarten Wiltink · Dec 7, 2003

Dean A. Hoover said:
[...]
I really don't want any DTD in the document at all. I am writing
some code that will parse an xml document and output either html
or plain text depending on a parameter. In the case of HTML it
would output "©", in the case of plain text it would output
"(c)". I have other similar context based entities to handle as
well.

That's reasonable, but entities simply aren't the solution.
Would using processing instructions instead be acceptable?

In XSLT, you could even source in the transformation itself
with document('') and switch treatment of <?copy?> based on
the output method.

I'm working under the assumption that you want the source to
be well-formed XML, valid if possible.

Groetjes,
Maarten Wiltink

Richard Tobin · Dec 8, 2003

I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., &copy in
the program.

Well, this is not *real* XML.

The simplest thing to do would be to read the file into a string and
prepend an internal subset that declares the entities in question.
This will be easy if you know that there isn't an XML declaration or
DOCTYPE declaration in the file and you know the file's encoding.
Otherwise it will be more tedious.

-- Richard

The distinction between a java applet and an application	1	Jan 4, 2023
Error with server	3	Nov 20, 2022
XML help.	1	Jul 29, 2004
Client Application for Apache Axis Web Services	0	Sep 13, 2005
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Connected SQLite to my java program but information are not submitted	2	Aug 2, 2022
validate xml with sax?	0	Nov 30, 2003
Picture Comparison Code Not Working Properly	1	Jul 24, 2021

resolving an entity

Dean A. Hoover

Maarten Wiltink

Dean A. Hoover

Martin Honnen

Maarten Wiltink

Richard Tobin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads