Special characters and validation

JD · Jan 29, 2009

I frequently receive website copy in the form of Word documents. If I
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML character
number n" errors.

I decided to write a little tool in C that reads in the copy and
substitutes character entity references for any characters that will
cause the above error. However, I'm confused about what to include in
this program and what to leave out. For example, even though there's an
entity reference for the copyright symbol, I've found I can put this
symbol directly in the source and the page still validates. In that
case, why use the entity reference at all?

Is there a definitive list somewhere of which characters need to be
encoded and which do not?

I use the HTML 4.01 Strict doctype and my documents have ISO-8859-1
encoding according to 'Page Info' in FF3.

rf · Jan 29, 2009

JD said:
I frequently receive website copy in the form of Word documents. If I
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML
character number n" errors.

This stuff is usually because of words "smart quotes" feature, and others.
All such "helpfull" features can be turned off.

Zach · Jan 29, 2009

Is there a definitive list somewhere of which characters need to be
encoded and which do not?

space
! !
" " "
# #
$ $
% %
& & &
' '
( (
) )
* *
+ +
, ,
- -
. .
/ /
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
: :
; ;
< < <
= =

> > >

? ?
@ @
A A
B B
C C
D D
E E
F F
G G
H H
I I
J J
K K
L L
M M
N N
O O
P P
Q Q
R R
S S
T T
U U
V V
W W
X X
Y Y
Z Z
[ [
\ \
] ]
^ ^
_ _
` `
a a
b b
c c
d d
e e
f f
g g
h h
i i
j j
k k
l l
m m
n n
o o
p p
q q
r r
s s
t t
u u
v v
w w
x x
y y
z z
{ {
| |
} }
~ ~

, ‚
f ƒ
" „
. …
? †
? ‡
^ ˆ
? ‰
S Š
< ‹
O Œ
' ‘
' ’
" “
" ”
. •
- –
- —
~ ˜
T ™
s &353;

> ›

o œ
Y Ÿ
 
¡ ¡ ¡
¢ ¢ ¢
£ £ £
¤ ¤ ¤
¥ ¥ ¥
¦ ¦ ¦
§ § §
¨ ¨ ¨
© © ©
ª ª ª
« « «
¬ ¬ ¬

® ® ®
¯ ¯ ¯
° ° °
± ± ±
² ² ²
³ ³ ³
´ ´ ´
µ µ µ
¶ ¶ ¶
· · ·
¸ ¸ ¸
¹ ¹ ¹
º º º
» » »
¼ ¼ ¼
½ ½ ½
¾ ¾ ¾
¿ ¿ ¿
À À À
Á Á Á
Â Â Â
Ã Ã Ã
Ä Ä Ä
Å Å Å
Æ Æ Æ
Ç Ç Ç
È È È
É É É
Ê Ê Ê
Ë Ë Ë
Ì Ì Ì
Í Í Í
Î Î Î
Ï Ï Ï
Ð Ð Ð
Ñ Ñ Ñ
Ò Ò Ò
Ó Ó Ó
Ô Ô Ô
Õ Õ Õ
Ö Ö Ö
× × ×
Ø Ø Ø
Ù Ù Ù
Ú Ú Ú
Û Û Û
Ü Ü Ü
Ý Ý Ý
Þ Þ Þ
ß ß ß
à à à
á á á
â â â
ã ã ã
ä ä ä
å å å
æ æ æ
ç ç ç
è è è
é é é
ê ê ê
ë ë ë
ì ì ì
í í í
î î î
ï ï ï
ð ð ð
ñ ñ ñ
ò ò ò
ó ó ó
ô ô ô
õ õ õ
ö ö ö
÷ ÷ ÷
ø ø ø
ù ù ù
ú ú ú
û û û
ü ü ü
ý ý ý
þ þ þ
ÿ ÿ ÿ
? € €

Jukka K. Korpela · Jan 29, 2009

Zach said:
space

Of course, stuff copied from somewhere without any citation and without even
say how it is supposed to answer the question ranks you as Very Clueless.

Please do not stop using the same forged "identity" before you get a clue.
Thank you in advance.

Jukka K. Korpela · Jan 29, 2009

rf said:
This stuff is usually because of words "smart quotes" feature, and
others. All such "helpfull" features can be turned off.

The only reason to quote the word "helpful" here is that you misspelled it.

"Smart quotes" are the correct quotes. What's wrong here is their encoding,
as opposite to the declared or implied encoding of the page, but that's not
a reason to convert correct characters to something incorrect or at least
inferior.

Zach · Jan 29, 2009

Jukka K. Korpela said:
Of course, stuff copied from somewhere without any citation and without
even say how it is supposed to answer the question ranks you as Very
Clueless.

Please do not stop using the same forged "identity" before you get a clue.
Thank you in advance.

I answered the guy's question.

Zach,

JD · Jan 30, 2009

Zach said:
I answered the guy's question.

How, by supplying an indiscriminate list of character entity references?
That's like giving somebody the entire alphabet when they ask which
letters are vowels.

Zach · Jan 30, 2009

How, by supplying an indiscriminate list of character entity references?
That's like giving somebody the entire alphabet when they ask which
letters are vowels.

oooooooooooooooooooooooooooooooooooooooooooooooooo

Oh. Oh. If a response isn't to your liking, then say so politely.

oooooooooooooooooooooooooooooooooooooooooooooooooo

You wrote: "Is there a definitive list somewhere of which characters need to
be
encoded and which do not?"

I would:
1. transform the text into an array of characters
2. see what the accii value is of each character
3. see if the acii value < or > certain values
4. if so, see whether it is contained in the list I gave you
5. if it is, substitute

Zach.

Zach · Jan 30, 2009

Ben C said:
It might not have an ASCII value (nor even an ISO-8859-1 value) which is
the whole problem.

If all the characters have ASCII values, then it is not necessary to
check if they are outside any particular range-- the OP was using
ISO-8859-1 of which ASCII is a subset.

Then any character whose unicode value is outside the range that
ISO-8859-1 can encode needs to be substituted. There's no other list to
check them against, unless you are thinking of using e.g. " " instead
of
" ", which is more readable. In that case I suppose you get the
list from http://www.w3.org/TR/REC-html40/sgml/entities.html.

"the OP was using ISO-8859-1 "
Re: http://htmlhelp.com/reference/charset/
Sorry, I don't understand why character for character converting wouldn't
work.

Zach.

Harlan Messinger · Jan 30, 2009

Zach said:
"the OP was using ISO-8859-1 "
Re: http://htmlhelp.com/reference/charset/
Sorry, I don't understand why character for character converting wouldn't
work.

If the source is not encoded as ASCII and contains non-ASCII characters,
then an application that reads the source as though it *were* encoded as
ASCII *will not correctly read the non-ASCII characters". It can't
convert them to anything if it can't read them.

The list you gave happens to have very little to do with the question
that was asked. It includes characters that part of the ASCII encoding.
It also includes characters that aren't part of the ASCII encoding. It
also omits thousands of characters that aren't part of the ASCII
encoding. If the encoding to be used to store or transmit them is ASCII,
then all of them numbered above 127 have to be converted to an &
reference. If the encoding to be used is UTF-8 then none of them has to
be. For other encodings, the consequences vary.

Jukka K. Korpela · Jan 30, 2009

Zach said:
I answered the guy's question.

No, you didn't. You didn't even give a wrong answer, though your posting
would have been a wrong answer to virtually any question, if it had
addressed a question.

Thank you for following my advice of continuing the use of clueslessly
forged From field as long as you remain clueless!

Zach · Jan 30, 2009

Ben C said:
It would.

ASCII and ISO-8859-1 are both encodings. ASCII is a subset of
ISO-8859-1. The OP's destination encoding is ISO-8859-1 and his source
encoding is presumably a superset of ISO-8859-1 (perhaps UTF-8).

So we need to decode the source, character for character, and output it
in the destination encoding, using &# thingies for any characters that
aren't in ISO-8859-1.

What we're not doing is decoding ASCII source and outputting it to some
encoding that's a subset of ASCII (if there is such a thing). But that's
what your method seemed to be describing.

ooooooooooooooooooooooooooooooooooooooooooooooooooooo
Great, this defines what needs to be done then.
The guy need two lists
(1.) an ISO-8859-1 list
(2.) a thingies list.

If the char isn't in (1.) then the char must be
converted, using (2.). No big deal then.

Zach.

Zach · Jan 30, 2009

Jukka K. Korpela said:
No, you didn't. You didn't even give a wrong answer, though your posting
would have been a wrong answer to virtually any question, if it had
addressed a question.

Thank you for following my advice of continuing the use of clueslessly
forged From field as long as you remain clueless!

Zach.

Zach · Jan 30, 2009

Ben C said:
2 isn't a list (assuming you mean &# things)-- those are just numbers.
But you might convert some characters to HTML entities like   and
so you might have a list of those.

Aren't these your thingies?
http://www.avenue-it.com/html/asciialphabet.html

Neredbojias · Jan 30, 2009

oooooooooooooooooooooooooooooooooooooooooooooooooo

Oh. Oh. If a response isn't to your liking, then say so politely.

Dear Sir,

Your list sucked the big one.

With warm regards,
JD

Zach · Jan 31, 2009

Neredbojias said:
Dear Sir,

Your list sucked the big one.

With warm regards,
JD

--
Neredbojias
http://www.neredbojias.org/
http://www.neredbojias.net/
The road to Heaven is paved with bad intentions.

Lol!

Zach.

Zach · Jan 31, 2009

Neredbojias said:
Dear Sir,

Your list sucked the big one.

With warm regards,
JD

--
Neredbojias
http://www.neredbojias.org/
http://www.neredbojias.net/
The road to Heaven is paved with bad intentions.

Lol!

Zach.

Zach · Jan 31, 2009

Ben C said:
Sort of, but ignore the first 128 entries of the table-- obviously
there's no need to replace 'i' with i in any encoding anyone's
likely to be using these days.

In fact, if I have to replace 5 with 5 it's not clear how the
browser's going to understand the '5' in "5".

And I think it's likely to be a requirement of an HTML parser that it at
least understand ASCII. Korpela would know but he has already stormed
off in disgust.

The second problem is that that table appears to list only the
characters in Latin 1 (aka ISO-8859-1) although I haven't checked it
thoroughly.

Since the OP's destination encoding was ISO-8859-1, he wouldn't need to
make subsitutions for any of the characters in that table.

But he might need to make some for characters outside it-- for example
if his text contains U+1401 Canadian Syllabics E, or U+2207 Nabla, or
any of the many other characters that aren't in Latin 1.

ooooooooooooooooooooooooooooooooooooooooo
Thank you. I have learned a few things.
Zach.

Create and Preview HTML & PDF with Custom Encryption and Micro Cloud Storage	0	Nov 11, 2024
string contains and special characters	7	Oct 9, 2012
Issue: special characters	0	Jul 15, 2011
Even McMahon fails validation	21	Nov 17, 2011
Telnetlib and special quit characters with Ctrl, oh my!	1	Dec 19, 2012
Special characters in attributes	2	Sep 19, 2007
C++ and Game Makers chat (Google)	0	Mar 15, 2025
How to convert MS Word special characters to HTML codes?	1	Mar 31, 2012

Special characters and validation

JD

rf

Zach

Jukka K. Korpela

Jukka K. Korpela

Zach

JD

Zach

Zach

Harlan Messinger

Jukka K. Korpela

Zach

Zach

Zach

Neredbojias

Zach

Zach

Zach

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads