resolving an entity

Discussion in 'XML' started by Dean A. Hoover, Dec 6, 2003.

  1. I am writing a parser for xml that will not have
    an associated DTD. I want to be able to handle
    certain character references (e.g., ©) in
    the program.

    When I run the following against a chunk of xml
    containing ©, I get the following:

    org.xml.sax.SAXParseException: Reference to undefined entity "©".
    at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3182)
    at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3176)
    at
    org.apache.crimson.parser.Parser2.expandEntityInContent(Parser2.java:2513)
    at
    org.apache.crimson.parser.Parser2.maybeReferenceInContent(Parser2.java:2422)
    at org.apache.crimson.parser.Parser2.content(Parser2.java:1833)
    at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
    at org.apache.crimson.parser.Parser2.content(Parser2.java:1779)
    at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
    at org.apache.crimson.parser.Parser2.content(Parser2.java:1779)
    at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1507)
    at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:500)
    at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
    at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:281)
    at Article.main(Article.java:18)

    What can I do to catch these references in my code and output replacement
    text for it?

    Thanks.
    Dean Hoover

    Here's the two java files:
    ---
    import java.io.*;
    import javax.xml.parsers.*;
    import org.xml.sax.*;
    import org.xml.sax.helpers.*;

    public class Article
    {
    public static void main(String argv[])
    {
    String file = argv[0];
    PrintWriter pw = new PrintWriter(System.out);
    DefaultHandler handler = new LoadXML(pw, LoadXML.TYPE_HTML);
    SAXParserFactory factory = SAXParserFactory.newInstance();

    try
    {
    SAXParser reader = factory.newSAXParser();
    reader.parse(new File(file), handler);
    }
    catch (Exception e)
    {
    e.printStackTrace();
    return;
    }

    pw.flush();
    }
    }
    ---
    import java.io.*;
    import java.util.*;
    import javax.xml.parsers.*;
    import org.xml.sax.*;
    import org.xml.sax.helpers.*;

    public class LoadXML extends DefaultHandler
    {
    public static final int TYPE_HTML = 1;
    public static final int TYPE_TEXT = 2;

    public LoadXML
    (
    java.io.Writer writer,
    int type
    )
    {
    elements_ = new Stack();
    writer_ = writer;
    type_ = type;
    }

    public InputSource resolveEntity
    (
    String publicId,
    String systemId
    ) throws SAXException
    {
    String s = "stuff";
    return new InputSource(new CharArrayReader(s.toCharArray()));
    }

    public void startDocument() throws SAXException
    {
    }

    public void endDocument() throws SAXException
    {
    }

    public void startElement
    (
    String uri,
    String localName,
    String qName,
    Attributes attributes
    ) throws SAXException
    {
    String elementName = qName;
    elements_.push(elementName);

    try
    {
    if (elementName.equals("p"))
    {
    if (type_ == TYPE_HTML)
    writer_.write("<p class=\"article-text\">");
    }
    else if (elementName.equals("title"))
    {
    if (type_ == TYPE_HTML)
    writer_.write("<p class=\"article-title\">");
    }
    else if (elementName.equals("by"))
    {
    if (type_ == TYPE_HTML)
    writer_.write("<p class=\"article-by\">");
    }
    else if (elementName.equals("copyright"))
    {
    if (type_ == TYPE_HTML)
    writer_.write("<p class=\"article-copyright\">");
    }
    }
    catch (IOException e)
    {
    throw new SAXException(e);
    }
    }

    public void endElement
    (
    String uri,
    String localName,
    String qName
    ) throws SAXException
    {
    String elementName = qName;
    elements_.pop();

    try
    {
    if (type_ == TYPE_HTML)
    {
    if (elementName.equals("p") || elementName.equals("title") ||
    elementName.equals("by") || elementName.equals("copyright"))
    {
    writer_.write("</p>\n");
    }
    else if (elementName.equals("br"))
    {
    writer_.write("<br/>\n");
    }
    }
    }
    catch (IOException e)
    {
    throw new SAXException(e);
    }
    }

    public void characters
    (
    char[] ch,
    int start,
    int length
    ) throws SAXException
    {
    try
    {
    String content = new String(ch, start, length);
    String top = (String)elements_.peek();
    String text =
    content.replaceAll("\n", " ").replaceAll(" +", " ").trim();

    if (text.length() == 0)
    return;

    if (type_ == TYPE_HTML)
    {
    if (top.equals("p") || top.equals("title") ||
    top.equals("by") || top.equals("copyright"))
    writer_.write(text);
    }
    }
    catch (IOException e)
    {
    throw new SAXException(e);
    }
    }

    private Stack elements_;
    private java.io.Writer writer_;
    private int type_;
    }
    Dean A. Hoover, Dec 6, 2003
    #1
    1. Advertising

  2. "Dean A. Hoover" <> wrote in message
    news:4qqAb.189389$...
    > I am writing a parser for xml that will not have
    > an associated DTD. I want to be able to handle
    > certain character references (e.g., &copy;) in
    > the program.


    As I understand it, that's quite impossible. The case is defined
    in the spec, and without a DTD you don't get to choose what
    entities are defined or not.

    But DTD may not mean what you think it does. Would it be permissible
    for this document to have an internal DTD subset?

    <?xml version="1.0"?>
    <!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
    <root>&copy;</root>

    A quick reading of the XML spec suggests (but I may have missed
    something) that this is a correct construction in XML.

    Groetjes,
    Maarten Wiltink
    Maarten Wiltink, Dec 6, 2003
    #2
    1. Advertising

  3. Maarten Wiltink wrote:
    > "Dean A. Hoover" <> wrote in message
    > news:4qqAb.189389$...
    >
    >>I am writing a parser for xml that will not have
    >>an associated DTD. I want to be able to handle
    >>certain character references (e.g., &copy;) in
    >>the program.

    >
    >
    > As I understand it, that's quite impossible. The case is defined
    > in the spec, and without a DTD you don't get to choose what
    > entities are defined or not.
    >
    > But DTD may not mean what you think it does. Would it be permissible
    > for this document to have an internal DTD subset?
    >
    > <?xml version="1.0"?>
    > <!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
    > <root>&copy;</root>
    >
    > A quick reading of the XML spec suggests (but I may have missed
    > something) that this is a correct construction in XML.
    >

    I really don't want any DTD in the document at all. I am writing
    some code that will parse an xml document and output either html
    or plain text depending on a parameter. In the case of HTML it
    would output "&copy;", in the case of plain text it would output
    "(c)". I have other similar context based entities to handle as
    well.

    Dean
    Dean A. Hoover, Dec 7, 2003
    #3
  4. Dean A. Hoover wrote:

    > Maarten Wiltink wrote:
    >
    >> "Dean A. Hoover" <> wrote in message
    >> news:4qqAb.189389$...
    >>
    >>> I am writing a parser for xml that will not have
    >>> an associated DTD. I want to be able to handle
    >>> certain character references (e.g., &copy;) in
    >>> the program.

    >>
    >>
    >>
    >> As I understand it, that's quite impossible. The case is defined
    >> in the spec, and without a DTD you don't get to choose what
    >> entities are defined or not.
    >>
    >> But DTD may not mean what you think it does. Would it be permissible
    >> for this document to have an internal DTD subset?
    >>
    >> <?xml version="1.0"?>
    >> <!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
    >> <root>&copy;</root>
    >>
    >> A quick reading of the XML spec suggests (but I may have missed
    >> something) that this is a correct construction in XML.
    >>

    > I really don't want any DTD in the document at all. I am writing
    > some code that will parse an xml document and output either html
    > or plain text depending on a parameter. In the case of HTML it
    > would output "&copy;", in the case of plain text it would output
    > "(c)". I have other similar context based entities to handle as
    > well.


    Well, if you write your own parser then you can of course parse
    something alike XML but with references to undefined entities. But then
    don't attempt to parse it with an XML parser which expects entities to
    be defined.

    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Dec 7, 2003
    #4
  5. "Dean A. Hoover" <> wrote in message
    news:uLvAb.190449$...
    > Maarten Wiltink wrote:
    >> "Dean A. Hoover" <> wrote in message
    >> news:4qqAb.189389$...


    >>> I am writing a parser for xml that will not have
    >>> an associated DTD. I want to be able to handle
    >>> certain character references (e.g., &copy;) in
    >>> the program.

    [...]
    > I really don't want any DTD in the document at all. I am writing
    > some code that will parse an xml document and output either html
    > or plain text depending on a parameter. In the case of HTML it
    > would output "&copy;", in the case of plain text it would output
    > "(c)". I have other similar context based entities to handle as
    > well.


    That's reasonable, but entities simply aren't the solution.
    Would using processing instructions instead be acceptable?

    In XSLT, you could even source in the transformation itself
    with document('') and switch treatment of <?copy?> based on
    the output method.

    I'm working under the assumption that you want the source to
    be well-formed XML, valid if possible.

    Groetjes,
    Maarten Wiltink
    Maarten Wiltink, Dec 7, 2003
    #5
  6. In article <4qqAb.189389$>,
    Dean A. Hoover <> wrote:
    >I am writing a parser for xml that will not have
    >an associated DTD. I want to be able to handle
    >certain character references (e.g., &copy;) in
    >the program.


    Well, this is not *real* XML.

    The simplest thing to do would be to read the file into a string and
    prepend an internal subset that declares the entities in question.
    This will be easy if you know that there isn't an XML declaration or
    DOCTYPE declaration in the file and you know the file's encoding.
    Otherwise it will be more tedious.

    -- Richard
    --
    Spam filter: to mail me from a .com/.net site, put my surname in the headers.

    FreeBSD rules!
    Richard Tobin, Dec 8, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,579
    Jukka K. Korpela
    Feb 24, 2007
  2. markla
    Replies:
    1
    Views:
    522
    Steven Cheng
    Oct 6, 2008
  3. Norm
    Replies:
    3
    Views:
    2,656
  4. ThatsIT.net.au

    Entity, problem with entity key

    ThatsIT.net.au, Sep 6, 2009, in forum: ASP .Net
    Replies:
    1
    Views:
    1,164
    ThatsIT.net.au
    Sep 7, 2009
  5. David Karr
    Replies:
    1
    Views:
    164
    David Karr
    May 24, 2013
Loading...

Share This Page