non SGML character escape

Srini · Mar 13, 2009

I have some typographical/special characters in our database which
comes from user input by pasting from documents. I have to take that
data and create xml file. Run the xml through W3C xml validator, it is
failing and saying that

"Line 37231, Column 135: non SGML character number 25

You have used an illegal character in your text. HTML uses the
standard UNICODE Consortium character repertoire, and it leaves
undefined (among others) 65 character codes (0 to 31 inclusive and 127
to 159 inclusive) that ...... and so on"

I am using Apache Commons Lang package escape utils class
StringEscapeUtils.escapeXml() method and I also tried using
StringEscapeUtils.escapeHtml() methods. Which both of them are failed
to escape these characters.

Can some one point me in the right direction, is there an utility that
I can use for this???
Even though XML Validator fails can XSLT validation by pass these
characters when it parse this xml??

Thanks - Srini.

Tom Anderson · Mar 13, 2009

I have some typographical/special characters in our database which
comes from user input by pasting from documents. I have to take that
data and create xml file. Run the xml through W3C xml validator, it is
failing and saying that

"Line 37231, Column 135: non SGML character number 25

You have used an illegal character in your text. HTML uses the
standard UNICODE Consortium character repertoire, and it leaves
undefined (among others) 65 character codes (0 to 31 inclusive and 127
to 159 inclusive) that ...... and so on"

I am using Apache Commons Lang package escape utils class
StringEscapeUtils.escapeXml() method and I also tried using
StringEscapeUtils.escapeHtml() methods. Which both of them are failed
to escape these characters.

I think what the error report is saying is that there's no way to escape
the characters, because they're characters that just don't exist in
unicode. It's just like if you had Klingon characters in your database.

Your solution is to remove the characters, and either replace them with
something equivalent that is in unicode, or forget about them. ASCII
character 25 is EM, 'end of medium' - what does that mean in your system?
How on earth are your users entering it?

Can some one point me in the right direction, is there an utility that
I can use for this???
Even though XML Validator fails can XSLT validation by pass these
characters when it parse this xml??

It's likely but not certain that XML parsers will choke on the characters
(a standards-compliant parser will), and since parsing is a prerequisite
for XSLT processing, you can't rely on that being possible.

tom

Srini · Mar 16, 2009

I think what the error report is saying is that there's no way to escape
the characters, because they're characters that just don't exist in
unicode. It's just like if you had Klingon characters in your database.

Your solution is to remove the characters, and either replace them with
something equivalent that is in unicode, or forget about them. ASCII
character 25 is EM, 'end of medium' - what does that mean in your system?
How on earth are your users entering it?

It's likely but not certain that XML parsers will choke on the characters
(a standards-compliant parser will), and since parsing is a prerequisite
for XSLT processing, you can't rely on that being possible.

tom

I believe these are the characters coming from users doing copy/paste
from applications like word documents. So the solution would be just
ignore that particular element when parser chokes?? and asking user
not to do cut and past from word processor?? but how can you control
users???

Lew · Mar 16, 2009

Mark Space said:
The technique I'm familiar with is to validate before it gets to the
database. If the validation fails, kick it back to the user with a big
red X and the error message "No dice."

More generally, always validate input.

Srini · Mar 17, 2009

More generally, always validate input.

We can not really validate and ask the user to remove those because
user can copy from word directly into textarea. In that case how do we
validate?..... apache commons escapeHtml or Xml does not do that job
what is the workaround though?? This seem pretty common issue to me.

Martin Gregorie · Mar 17, 2009

We can not really validate and ask the user to remove those because user
can copy from word directly into textarea. In that case how do we
validate?..... apache commons escapeHtml or Xml does not do that job
what is the workaround though?? This seem pretty common issue to me.

It sounds to me that about all you can do is to scan the text area when
the user requests it to be written to the database. If you find any non-
SGML characters, replace each one by something valid but obvious, such as
a tilde [~], and ask the user to delete these characters. It doesn't
matter whether the user deletes them or just resubmits because either way
your generated XML will contain only SGML characters.

Srini · Mar 17, 2009

We can not really validate and ask the user to remove those because user
can copy from word directly into textarea. In that case how do we
validate?..... apache commons escapeHtml or Xml does not do that job
what is the workaround though?? This seem pretty common issue to me.

Click to expand...

It sounds to me that about all you can do is to scan the text area when
the user requests it to be written to the database. If you find any non-
SGML characters, replace each one by something valid but obvious, such as
a tilde [~], and ask the user to delete these characters. It doesn't
matter whether the user deletes them or just resubmits because either way
your generated XML will contain only SGML characters.

I tried to scan the string but unable to catch that culprit characters
to replace with obvious. I tried with apache commons lang package with
"CharUtils.isAscii(chr)" class but those are not getting caught. They
are appearing in textpad like . In browser they are appearing as
"diamond brackets, or chevrons".

How do we scan that text and replace though for those non SGML chars??
appreciate any pointers.

Lew · Mar 17, 2009

Please do not quote sigs.

I tried to scan the string but unable to catch that culprit characters
to replace with obvious. I tried with apache commons lang package with
"CharUtils.isAscii(chr)" class but those are not getting caught. They

Can you show us an SSCCE? What you describe does not reveal the
error.
<http://sscce.org/>

Just show an example that demonstrates the unwanted characters "not
getting caught". What you tell us looks like it should have worked.

How do we scan that text and replace though for those non SGML chars??

One question mark suffices to indicate an interrogative.

It's hard to say what you should have done differently given the
information you have provided so far. An SSCCE will help a lot.

Srini · Mar 18, 2009

That problem can be "solved" by switching to UTF-8.

Or, at the *input* place, add validation/conversion. We had similar
problems in the past with the copy-pasting from MS word to the html
form, and invalid characters got through. Even though the database
should have rejected these (BLOB oddity I suspect). Major headache until
we used UTF-8 in everyhting.

Another possibility is to instruct users to paste to notepad first, then
copy-paste from there. They may or may not do it.

Some of the characters are not being escaped.... I suspect and these
are created in db when users simply copy and past from word or any
news web site.
Ex: "employee bonuses that members of Congress — and much of the
American public — find indefensible"
Characters like thick vertical lines in that above message causing
this error. (in textpad they appear like thick vertical lines but they
appear as diamond brackets in html page )

Lew · Mar 18, 2009

Srini said:
Ex: "employee bonuses that members of Congress — and much of the
American public — find indefensible"
Characters like thick vertical lines in that above message causing
this error. (in textpad they appear like thick vertical lines but they
appear as diamond brackets in html page )

Neither of which appear here, which makes your example a tad hard to
follow.

RedGrittyBrick · Mar 18, 2009

Lew said:
Neither of which appear here, which makes your example a tad hard to
follow.

I suspect that when Srini writes "thick vertical lines", he/she means
"broad horizontal lines".

Between "Congress " and " and" is 0x97 which is a character in the
Windows-Latin-1 character set. This character is *not* in ISO 8859-1
Latin 1 and is not present at that code-point in Unicode. I think it's
an em-dash so it could be translated into a Unicode encoding.

RedGrittyBrick · Mar 18, 2009

Srini said:
I have some typographical/special characters in our database which
comes from user input by pasting from documents. I have to take that
data and create xml file. Run the xml through W3C xml validator, it is
failing and saying that

"Line 37231, Column 135: non SGML character number 25

You have used an illegal character in your text. HTML uses the
standard UNICODE Consortium character repertoire, and it leaves
undefined (among others) 65 character codes (0 to 31 inclusive and 127
to 159 inclusive) that ...... and so on"

I am using Apache Commons Lang package escape utils class
StringEscapeUtils.escapeXml() method and I also tried using
StringEscapeUtils.escapeHtml() methods. Which both of them are failed
to escape these characters.

Can some one point me in the right direction, is there an utility that
I can use for this???
Even though XML Validator fails can XSLT validation by pass these
characters when it parse this xml??

Your problem is that you are accepting characters from a source in
Windows-Latin-1 encoding (AKA Cp1252) and treating that data as if it
were in some other encoding (e.g. ISO 8859-1 Latin 1).

I expect you need to perform a transformation from one encoding to the
other.

Elsewhere you mention users pasting into a "textarea". If that is a
JTextArea then you probably need to do something like use String's
support for charsetNames to perform an appropriate transformation.

Roedy Green · Mar 19, 2009

Can some one point me in the right direction, is there an utility that
I can use for this???

See http://mindprod.com/products1.html#ENTITIES
there are classes for converting Unicode to &xxxx;-encoded HTML and
back.

There are also classes for dealing with XML entities. So hook them up
back to back and you will have your utilitiy.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"In the central North Pacific, plastic outweighs surface zooplankton 6 to 1."
~ Thomas M. Kostigen

wc validator says: non SGML character number ...	2	Oct 13, 2005
PEP 383: Non-decodable Bytes in System Character Interfaces	1	Apr 22, 2009
turning a non-ASCII character into a XML entity with REXML?	3	Oct 16, 2004
Musatov's 'Mode/Code' Primary method call	4	Oct 31, 2009
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007

non SGML character escape

Srini

Tom Anderson

Srini

Lew

Srini

Martin Gregorie

Srini

Lew

Srini

Lew

RedGrittyBrick

RedGrittyBrick

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads