Special characters and validation

J

JD

I frequently receive website copy in the form of Word documents. If I
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML character
number n" errors.

I decided to write a little tool in C that reads in the copy and
substitutes character entity references for any characters that will
cause the above error. However, I'm confused about what to include in
this program and what to leave out. For example, even though there's an
entity reference for the copyright symbol, I've found I can put this
symbol directly in the source and the page still validates. In that
case, why use the entity reference at all?

Is there a definitive list somewhere of which characters need to be
encoded and which do not?

I use the HTML 4.01 Strict doctype and my documents have ISO-8859-1
encoding according to 'Page Info' in FF3.
 
R

rf

JD said:
I frequently receive website copy in the form of Word documents. If I
copy and paste the content directly from Word into my text editor, I
often find that my web pages fail to validate due to "non SGML
character number n" errors.

This stuff is usually because of words "smart quotes" feature, and others.
All such "helpfull" features can be turned off.
 
Z

Zach

Is there a definitive list somewhere of which characters need to be
encoded and which do not?

space
! !
" " "
# #
$ $
% %
& & &
' '
( (
) )
* *
+ +
, ,
- -
. .
/ /
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
: :
; ;
< < &lt;
= =
? ?
@ @
A A
B B
C C
D D
E E
F F
G G
H H
I I
J J
K K
L L
M M
N N
O O
P P
Q Q
R R
S S
T T
U U
V V
W W
X X
Y Y
Z Z
[ [
\ \
] ]
^ ^
_ _
` `
a a
b b
c c
d d
e e
f f
g g
h h
i i
j j
k k
l l
m m
n n
o o
p p
q q
r r
s s
t t
u u
v v
w w
x x
y y
z z
{ {
| |
} }
~ ~
 
, ‚ ‚
f ƒ ƒ
" „ „
. … …
? † †
? ‡ ‡
^ ˆ ˆ
? ‰ ‰
S Š Š
< ‹ ‹
O Œ Œ
' ‘ ‘
' ’ ’
" “ “
" ” ”
. • •
- – –
- — —
~ ˜ ˜
T ™ ™
s š &353;
o œ œ
Y Ÿ Ÿ
  &nbsp;
¡ ¡ &iexcl;
¢ ¢ &cent;
£ £ &pound;
¤ ¤ &curren;
¥ ¥ &yen;
¦ ¦ &brvbar;
§ § &sect;
¨ ¨ &uml;
© © &copy;
ª ª &ordf;
« « &laquo;
¬ ¬ &not;
­ ­ &shy;
® ® &reg;
¯ ¯ &macr;
° ° &deg;
± ± &plusmn;
² ² &sup2;
³ ³ &sup3;
´ ´ &acute;
µ µ &micro;
¶ ¶ &para;
· · &middot;
¸ ¸ &cedil;
¹ ¹ &sup1;
º º &ordm;
» » &raquo;
¼ ¼ &frac14;
½ ½ &frac12;
¾ ¾ &frac34;
¿ ¿ &iquest;
À À &Agrave;
Á Á &Aacute;
  &Acirc;
à à &Atilde;
Ä Ä &Auml;
Å Å &Aring;
Æ Æ &AElig;
Ç Ç &Ccedil;
È È &Egrave;
É É &Eacute;
Ê Ê &Ecirc;
Ë Ë &Euml;
Ì Ì &Igrave;
Í Í &Iacute;
Î Î &Icirc;
Ï Ï &Iuml;
Ð Ð &ETH;
Ñ Ñ &Ntilde;
Ò Ò &Ograve;
Ó Ó &Oacute;
Ô Ô &Ocirc;
Õ Õ &Otilde;
Ö Ö &Ouml;
× × &times;
Ø Ø &Oslash;
Ù Ù &Ugrave;
Ú Ú &Uacute;
Û Û &Ucirc;
Ü Ü &Uuml;
Ý Ý &Yacute;
Þ Þ &THORN;
ß ß &szlig;
à à &agrave;
á á &aacute;
â â &acirc;
ã ã &atilde;
ä ä &auml;
å å &aring;
æ æ &aelig;
ç ç &ccedil;
è è &egrave;
é é &eacute;
ê ê &ecirc;
ë ë &euml;
ì ì &igrave;
í í &iacute;
î î &icirc;
ï ï &iuml;
ð ð &eth;
ñ ñ &ntilde;
ò ò &ograve;
ó ó &oacute;
ô ô &ocirc;
õ õ &otilde;
ö ö &ouml;
÷ ÷ &divide;
ø ø &oslash;
ù ù &ugrave;
ú ú &uacute;
û û &ucirc;
ü ü &uuml;
ý ý &yacute;
þ þ &thorn;
ÿ ÿ &yuml;
? € &euro;
 
J

Jukka K. Korpela

Zach said:

Of course, stuff copied from somewhere without any citation and without even
say how it is supposed to answer the question ranks you as Very Clueless.

Please do not stop using the same forged "identity" before you get a clue.
Thank you in advance.
 
J

Jukka K. Korpela

rf said:
This stuff is usually because of words "smart quotes" feature, and
others. All such "helpfull" features can be turned off.

The only reason to quote the word "helpful" here is that you misspelled it.

"Smart quotes" are the correct quotes. What's wrong here is their encoding,
as opposite to the declared or implied encoding of the page, but that's not
a reason to convert correct characters to something incorrect or at least
inferior.
 
Z

Zach

Jukka K. Korpela said:
Of course, stuff copied from somewhere without any citation and without
even say how it is supposed to answer the question ranks you as Very
Clueless.

Please do not stop using the same forged "identity" before you get a clue.
Thank you in advance.

I answered the guy's question.

Zach,
 
J

JD

Zach said:
I answered the guy's question.

How, by supplying an indiscriminate list of character entity references?
That's like giving somebody the entire alphabet when they ask which
letters are vowels.
 
Z

Zach

How, by supplying an indiscriminate list of character entity references?
That's like giving somebody the entire alphabet when they ask which
letters are vowels.

oooooooooooooooooooooooooooooooooooooooooooooooooo

Oh. Oh. If a response isn't to your liking, then say so politely.

oooooooooooooooooooooooooooooooooooooooooooooooooo

You wrote: "Is there a definitive list somewhere of which characters need to
be
encoded and which do not?"

I would:
1. transform the text into an array of characters
2. see what the accii value is of each character
3. see if the acii value < or > certain values
4. if so, see whether it is contained in the list I gave you
5. if it is, substitute

Zach.
 
Z

Zach

Ben C said:
It might not have an ASCII value (nor even an ISO-8859-1 value) which is
the whole problem.


If all the characters have ASCII values, then it is not necessary to
check if they are outside any particular range-- the OP was using
ISO-8859-1 of which ASCII is a subset.


Then any character whose unicode value is outside the range that
ISO-8859-1 can encode needs to be substituted. There's no other list to
check them against, unless you are thinking of using e.g. "&nbsp;" instead
of
" ", which is more readable. In that case I suppose you get the
list from http://www.w3.org/TR/REC-html40/sgml/entities.html.



"the OP was using ISO-8859-1 "
Re: http://htmlhelp.com/reference/charset/
Sorry, I don't understand why character for character converting wouldn't
work.

Zach.
 
H

Harlan Messinger

Zach said:
"the OP was using ISO-8859-1 "
Re: http://htmlhelp.com/reference/charset/
Sorry, I don't understand why character for character converting wouldn't
work.

If the source is not encoded as ASCII and contains non-ASCII characters,
then an application that reads the source as though it *were* encoded as
ASCII *will not correctly read the non-ASCII characters". It can't
convert them to anything if it can't read them.

The list you gave happens to have very little to do with the question
that was asked. It includes characters that part of the ASCII encoding.
It also includes characters that aren't part of the ASCII encoding. It
also omits thousands of characters that aren't part of the ASCII
encoding. If the encoding to be used to store or transmit them is ASCII,
then all of them numbered above 127 have to be converted to an &
reference. If the encoding to be used is UTF-8 then none of them has to
be. For other encodings, the consequences vary.
 
J

Jukka K. Korpela

Zach said:
I answered the guy's question.

No, you didn't. You didn't even give a wrong answer, though your posting
would have been a wrong answer to virtually any question, if it had
addressed a question.

Thank you for following my advice of continuing the use of clueslessly
forged From field as long as you remain clueless!
 
Z

Zach

Ben C said:
It would.

ASCII and ISO-8859-1 are both encodings. ASCII is a subset of
ISO-8859-1. The OP's destination encoding is ISO-8859-1 and his source
encoding is presumably a superset of ISO-8859-1 (perhaps UTF-8).

So we need to decode the source, character for character, and output it
in the destination encoding, using &# thingies for any characters that
aren't in ISO-8859-1.

What we're not doing is decoding ASCII source and outputting it to some
encoding that's a subset of ASCII (if there is such a thing). But that's
what your method seemed to be describing.
ooooooooooooooooooooooooooooooooooooooooooooooooooooo
Great, this defines what needs to be done then.
The guy need two lists
(1.) an ISO-8859-1 list
(2.) a thingies list.

If the char isn't in (1.) then the char must be
converted, using (2.). No big deal then.

Zach.
 
Z

Zach

Jukka K. Korpela said:
No, you didn't. You didn't even give a wrong answer, though your posting
would have been a wrong answer to virtually any question, if it had
addressed a question.

Thank you for following my advice of continuing the use of clueslessly
forged From field as long as you remain clueless!

:(

Zach.
 
N

Neredbojias

oooooooooooooooooooooooooooooooooooooooooooooooooo

Oh. Oh. If a response isn't to your liking, then say so politely.

Dear Sir,

Your list sucked the big one.

With warm regards,
JD
 
Z

Zach

Ben C said:
Sort of, but ignore the first 128 entries of the table-- obviously
there's no need to replace 'i' with i in any encoding anyone's
likely to be using these days.

In fact, if I have to replace 5 with 5 it's not clear how the
browser's going to understand the '5' in "5".

And I think it's likely to be a requirement of an HTML parser that it at
least understand ASCII. Korpela would know but he has already stormed
off in disgust.

The second problem is that that table appears to list only the
characters in Latin 1 (aka ISO-8859-1) although I haven't checked it
thoroughly.

Since the OP's destination encoding was ISO-8859-1, he wouldn't need to
make subsitutions for any of the characters in that table.

But he might need to make some for characters outside it-- for example
if his text contains U+1401 Canadian Syllabics E, or U+2207 Nabla, or
any of the many other characters that aren't in Latin 1.
ooooooooooooooooooooooooooooooooooooooooo
Thank you. I have learned a few things.
Zach.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top