Special characters and validation

Discussion in 'HTML' started by JD, Jan 29, 2009.

  1. JD

    JD Guest

    I frequently receive website copy in the form of Word documents. If I
    copy and paste the content directly from Word into my text editor, I
    often find that my web pages fail to validate due to "non SGML character
    number n" errors.

    I decided to write a little tool in C that reads in the copy and
    substitutes character entity references for any characters that will
    cause the above error. However, I'm confused about what to include in
    this program and what to leave out. For example, even though there's an
    entity reference for the copyright symbol, I've found I can put this
    symbol directly in the source and the page still validates. In that
    case, why use the entity reference at all?

    Is there a definitive list somewhere of which characters need to be
    encoded and which do not?

    I use the HTML 4.01 Strict doctype and my documents have ISO-8859-1
    encoding according to 'Page Info' in FF3.
     
    JD, Jan 29, 2009
    #1
    1. Advertising

  2. JD

    rf Guest

    JD wrote:
    > I frequently receive website copy in the form of Word documents. If I
    > copy and paste the content directly from Word into my text editor, I
    > often find that my web pages fail to validate due to "non SGML
    > character number n" errors.


    This stuff is usually because of words "smart quotes" feature, and others.
    All such "helpfull" features can be turned off.
     
    rf, Jan 29, 2009
    #2
    1. Advertising

  3. JD

    Zach Guest

    "JD" <> wrote in message
    news:...
    <...>
    > Is there a definitive list somewhere of which characters need to be
    > encoded and which do not?
    >


    space
    ! !
    " " &quot;
    # #
    $ $
    % %
    & & &amp;
    ' '
    ( (
    ) )
    * *
    + +
    , ,
    - -
    . .
    / /
    0 0
    1 1
    2 2
    3 3
    4 4
    5 5
    6 6
    7 7
    8 8
    9 9
    : :
    ; ;
    < < &lt;
    = =
    > > &gt;

    ? ?
    @ @
    A A
    B B
    C C
    D D
    E E
    F F
    G G
    H H
    I I
    J J
    K K
    L L
    M M
    N N
    O O
    P P
    Q Q
    R R
    S S
    T T
    U U
    V V
    W W
    X X
    Y Y
    Z Z
    [ [
    \ \
    ] ]
    ^ ^
    _ _
    ` `
    a a
    b b
    c c
    d d
    e e
    f f
    g g
    h h
    i i
    j j
    k k
    l l
    m m
    n n
    o o
    p p
    q q
    r r
    s s
    t t
    u u
    v v
    w w
    x x
    y y
    z z
    { {
    | |
    } }
    ~ ~
     
    , ‚ ‚
    f ƒ ƒ
    " „ „
    . … …
    ? † †
    ? ‡ ‡
    ^ ˆ ˆ
    ? ‰ ‰
    S Š Š
    < ‹ ‹
    O Œ Œ
    ' ‘ ‘
    ' ’ ’
    " “ “
    " ” ”
    . • •
    - – –
    - — —
    ~ ˜ ˜
    T ™ ™
    s š &353;
    > › ›

    o œ œ
    Y Ÿ Ÿ
      &nbsp;
    ¡ ¡ &iexcl;
    ¢ ¢ &cent;
    £ £ &pound;
    ¤ ¤ &curren;
    ¥ ¥ &yen;
    ¦ ¦ &brvbar;
    § § &sect;
    ¨ ¨ &uml;
    © © &copy;
    ª ª &ordf;
    « « &laquo;
    ¬ ¬ &not;
    ­ ­ &shy;
    ® ® &reg;
    ¯ ¯ &macr;
    ° ° &deg;
    ± ± &plusmn;
    ² ² &sup2;
    ³ ³ &sup3;
    ´ ´ &acute;
    µ µ &micro;
    ¶ ¶ &para;
    · · &middot;
    ¸ ¸ &cedil;
    ¹ ¹ &sup1;
    º º &ordm;
    » » &raquo;
    ¼ ¼ &frac14;
    ½ ½ &frac12;
    ¾ ¾ &frac34;
    ¿ ¿ &iquest;
    À À &Agrave;
    Á Á &Aacute;
    Â Â &Acirc;
    Ã Ã &Atilde;
    Ä Ä &Auml;
    Å Å &Aring;
    Æ Æ &AElig;
    Ç Ç &Ccedil;
    È È &Egrave;
    É É &Eacute;
    Ê Ê &Ecirc;
    Ë Ë &Euml;
    Ì Ì &Igrave;
    Í Í &Iacute;
    Î Î &Icirc;
    Ï Ï &Iuml;
    Ð Ð &ETH;
    Ñ Ñ &Ntilde;
    Ò Ò &Ograve;
    Ó Ó &Oacute;
    Ô Ô &Ocirc;
    Õ Õ &Otilde;
    Ö Ö &Ouml;
    × × &times;
    Ø Ø &Oslash;
    Ù Ù &Ugrave;
    Ú Ú &Uacute;
    Û Û &Ucirc;
    Ü Ü &Uuml;
    Ý Ý &Yacute;
    Þ Þ &THORN;
    ß ß &szlig;
    à à &agrave;
    á á &aacute;
    â â &acirc;
    ã ã &atilde;
    ä ä &auml;
    å å &aring;
    æ æ &aelig;
    ç ç &ccedil;
    è è &egrave;
    é é &eacute;
    ê ê &ecirc;
    ë ë &euml;
    ì ì &igrave;
    í í &iacute;
    î î &icirc;
    ï ï &iuml;
    ð ð &eth;
    ñ ñ &ntilde;
    ò ò &ograve;
    ó ó &oacute;
    ô ô &ocirc;
    õ õ &otilde;
    ö ö &ouml;
    ÷ ÷ &divide;
    ø ø &oslash;
    ù ù &ugrave;
    ú ú &uacute;
    û û &ucirc;
    ü ü &uuml;
    ý ý &yacute;
    þ þ &thorn;
    ÿ ÿ &yuml;
    ? € &euro;
     
    Zach, Jan 29, 2009
    #3
  4. Zach wrote:

    >> Is there a definitive list somewhere of which characters need to be
    >> encoded and which do not?
    >>

    >
    > space


    Of course, stuff copied from somewhere without any citation and without even
    say how it is supposed to answer the question ranks you as Very Clueless.

    Please do not stop using the same forged "identity" before you get a clue.
    Thank you in advance.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
     
    Jukka K. Korpela, Jan 29, 2009
    #4
  5. rf wrote:
    > JD wrote:
    >> I frequently receive website copy in the form of Word documents. If I
    >> copy and paste the content directly from Word into my text editor, I
    >> often find that my web pages fail to validate due to "non SGML
    >> character number n" errors.

    >
    > This stuff is usually because of words "smart quotes" feature, and
    > others. All such "helpfull" features can be turned off.


    The only reason to quote the word "helpful" here is that you misspelled it.

    "Smart quotes" are the correct quotes. What's wrong here is their encoding,
    as opposite to the declared or implied encoding of the page, but that's not
    a reason to convert correct characters to something incorrect or at least
    inferior.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
     
    Jukka K. Korpela, Jan 29, 2009
    #5
  6. JD

    Zach Guest

    "Jukka K. Korpela" <> wrote in message
    news:YMlgl.125771$...
    > Zach wrote:
    >
    >>> Is there a definitive list somewhere of which characters need to be
    >>> encoded and which do not?
    >>>

    >>
    >> space

    >
    > Of course, stuff copied from somewhere without any citation and without
    > even say how it is supposed to answer the question ranks you as Very
    > Clueless.
    >
    > Please do not stop using the same forged "identity" before you get a clue.
    > Thank you in advance.
    >
    > --
    > Yucca, http://www.cs.tut.fi/~jkorpela/


    I answered the guy's question.

    Zach,
     
    Zach, Jan 29, 2009
    #6
  7. JD

    JD Guest

    Zach wrote:
    > "Jukka K. Korpela" <> wrote in message
    > news:YMlgl.125771$...
    >> Zach wrote:
    >>
    >>>> Is there a definitive list somewhere of which characters need to be
    >>>> encoded and which do not?
    >>>>
    >>> space

    >> Of course, stuff copied from somewhere without any citation and without
    >> even say how it is supposed to answer the question ranks you as Very
    >> Clueless.
    >>
    >> Please do not stop using the same forged "identity" before you get a clue.
    >> Thank you in advance.
    >>
    >> --
    >> Yucca, http://www.cs.tut.fi/~jkorpela/

    >
    > I answered the guy's question.


    How, by supplying an indiscriminate list of character entity references?
    That's like giving somebody the entire alphabet when they ask which
    letters are vowels.
     
    JD, Jan 30, 2009
    #7
  8. JD

    Zach Guest

    "JD" <> wrote in message
    news:...

    << snipped >>

    >> I answered the guy's question.

    >
    > How, by supplying an indiscriminate list of character entity references?
    > That's like giving somebody the entire alphabet when they ask which
    > letters are vowels.


    oooooooooooooooooooooooooooooooooooooooooooooooooo

    Oh. Oh. If a response isn't to your liking, then say so politely.

    oooooooooooooooooooooooooooooooooooooooooooooooooo

    You wrote: "Is there a definitive list somewhere of which characters need to
    be
    encoded and which do not?"

    I would:
    1. transform the text into an array of characters
    2. see what the accii value is of each character
    3. see if the acii value < or > certain values
    4. if so, see whether it is contained in the list I gave you
    5. if it is, substitute

    Zach.
     
    Zach, Jan 30, 2009
    #8
  9. JD

    Zach Guest

    "Ben C" <> wrote in message
    news:...
    > On 2009-01-30, Zach <> wrote:
    >>
    >> "JD" <> wrote in message
    >> news:...
    >>
    >><< snipped >>
    >>
    >>>> I answered the guy's question.
    >>>
    >>> How, by supplying an indiscriminate list of character entity references?
    >>> That's like giving somebody the entire alphabet when they ask which
    >>> letters are vowels.

    >>
    >> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>
    >> Oh. Oh. If a response isn't to your liking, then say so politely.
    >>
    >> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>
    >> You wrote: "Is there a definitive list somewhere of which characters need
    >> to
    >> be
    >> encoded and which do not?"
    >>
    >> I would:
    >> 1. transform the text into an array of characters
    >> 2. see what the accii value is of each character

    >
    > It might not have an ASCII value (nor even an ISO-8859-1 value) which is
    > the whole problem.
    >
    >> 3. see if the acii value < or > certain values

    >
    > If all the characters have ASCII values, then it is not necessary to
    > check if they are outside any particular range-- the OP was using
    > ISO-8859-1 of which ASCII is a subset.
    >
    >> 4. if so, see whether it is contained in the list I gave you
    >> 5. if it is, substitute

    >
    > Then any character whose unicode value is outside the range that
    > ISO-8859-1 can encode needs to be substituted. There's no other list to
    > check them against, unless you are thinking of using e.g. "&nbsp;" instead
    > of
    > " ", which is more readable. In that case I suppose you get the
    > list from http://www.w3.org/TR/REC-html40/sgml/entities.html.




    "the OP was using ISO-8859-1 "
    Re: http://htmlhelp.com/reference/charset/
    Sorry, I don't understand why character for character converting wouldn't
    work.

    Zach.
     
    Zach, Jan 30, 2009
    #9
  10. Zach wrote:
    > "Ben C" <> wrote in message
    > news:...
    >> On 2009-01-30, Zach <> wrote:
    >>> "JD" <> wrote in message
    >>> news:...
    >>>
    >>> << snipped >>
    >>>
    >>>>> I answered the guy's question.
    >>>> How, by supplying an indiscriminate list of character entity references?
    >>>> That's like giving somebody the entire alphabet when they ask which
    >>>> letters are vowels.
    >>> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>>
    >>> Oh. Oh. If a response isn't to your liking, then say so politely.
    >>>
    >>> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>>
    >>> You wrote: "Is there a definitive list somewhere of which characters need
    >>> to
    >>> be
    >>> encoded and which do not?"
    >>>
    >>> I would:
    >>> 1. transform the text into an array of characters
    >>> 2. see what the accii value is of each character

    >> It might not have an ASCII value (nor even an ISO-8859-1 value) which is
    >> the whole problem.
    >>
    >>> 3. see if the acii value < or > certain values

    >> If all the characters have ASCII values, then it is not necessary to
    >> check if they are outside any particular range-- the OP was using
    >> ISO-8859-1 of which ASCII is a subset.
    >>
    >>> 4. if so, see whether it is contained in the list I gave you
    >>> 5. if it is, substitute

    >> Then any character whose unicode value is outside the range that
    >> ISO-8859-1 can encode needs to be substituted. There's no other list to
    >> check them against, unless you are thinking of using e.g. "&nbsp;" instead
    >> of
    >> " ", which is more readable. In that case I suppose you get the
    >> list from http://www.w3.org/TR/REC-html40/sgml/entities.html.

    >
    >
    >
    > "the OP was using ISO-8859-1 "
    > Re: http://htmlhelp.com/reference/charset/
    > Sorry, I don't understand why character for character converting wouldn't
    > work.


    If the source is not encoded as ASCII and contains non-ASCII characters,
    then an application that reads the source as though it *were* encoded as
    ASCII *will not correctly read the non-ASCII characters". It can't
    convert them to anything if it can't read them.

    The list you gave happens to have very little to do with the question
    that was asked. It includes characters that part of the ASCII encoding.
    It also includes characters that aren't part of the ASCII encoding. It
    also omits thousands of characters that aren't part of the ASCII
    encoding. If the encoding to be used to store or transmit them is ASCII,
    then all of them numbered above 127 have to be converted to an &
    reference. If the encoding to be used is UTF-8 then none of them has to
    be. For other encodings, the consequences vary.
     
    Harlan Messinger, Jan 30, 2009
    #10
  11. Zach wrote:

    > I answered the guy's question.


    No, you didn't. You didn't even give a wrong answer, though your posting
    would have been a wrong answer to virtually any question, if it had
    addressed a question.

    Thank you for following my advice of continuing the use of clueslessly
    forged From field as long as you remain clueless!

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
     
    Jukka K. Korpela, Jan 30, 2009
    #11
  12. JD

    Zach Guest

    "Ben C" <> wrote in message
    news:...
    > On 2009-01-30, Zach <> wrote:
    >>
    >> "Ben C" <> wrote in message
    >> news:...
    >>> On 2009-01-30, Zach <> wrote:
    >>>>
    >>>> "JD" <> wrote in message
    >>>> news:...
    >>>>
    >>>><< snipped >>
    >>>>
    >>>>>> I answered the guy's question.
    >>>>>
    >>>>> How, by supplying an indiscriminate list of character entity
    >>>>> references?
    >>>>> That's like giving somebody the entire alphabet when they ask which
    >>>>> letters are vowels.
    >>>>
    >>>> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>>>
    >>>> Oh. Oh. If a response isn't to your liking, then say so politely.
    >>>>
    >>>> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>>>
    >>>> You wrote: "Is there a definitive list somewhere of which characters
    >>>> need
    >>>> to
    >>>> be
    >>>> encoded and which do not?"
    >>>>
    >>>> I would:
    >>>> 1. transform the text into an array of characters
    >>>> 2. see what the accii value is of each character
    >>>
    >>> It might not have an ASCII value (nor even an ISO-8859-1 value) which is
    >>> the whole problem.
    >>>
    >>>> 3. see if the acii value < or > certain values
    >>>
    >>> If all the characters have ASCII values, then it is not necessary to
    >>> check if they are outside any particular range-- the OP was using
    >>> ISO-8859-1 of which ASCII is a subset.
    >>>
    >>>> 4. if so, see whether it is contained in the list I gave you
    >>>> 5. if it is, substitute
    >>>
    >>> Then any character whose unicode value is outside the range that
    >>> ISO-8859-1 can encode needs to be substituted. There's no other list to
    >>> check them against, unless you are thinking of using e.g. "&nbsp;"
    >>> instead
    >>> of
    >>> " ", which is more readable. In that case I suppose you get the
    >>> list from http://www.w3.org/TR/REC-html40/sgml/entities.html.

    >>
    >>
    >>
    >> "the OP was using ISO-8859-1 "
    >> Re: http://htmlhelp.com/reference/charset/
    >> Sorry, I don't understand why character for character converting wouldn't
    >> work.

    >
    > It would.
    >
    > ASCII and ISO-8859-1 are both encodings. ASCII is a subset of
    > ISO-8859-1. The OP's destination encoding is ISO-8859-1 and his source
    > encoding is presumably a superset of ISO-8859-1 (perhaps UTF-8).
    >
    > So we need to decode the source, character for character, and output it
    > in the destination encoding, using &# thingies for any characters that
    > aren't in ISO-8859-1.
    >
    > What we're not doing is decoding ASCII source and outputting it to some
    > encoding that's a subset of ASCII (if there is such a thing). But that's
    > what your method seemed to be describing.

    ooooooooooooooooooooooooooooooooooooooooooooooooooooo
    Great, this defines what needs to be done then.
    The guy need two lists
    (1.) an ISO-8859-1 list
    (2.) a thingies list.

    If the char isn't in (1.) then the char must be
    converted, using (2.). No big deal then.

    Zach.
     
    Zach, Jan 30, 2009
    #12
  13. JD

    Zach Guest

    "Jukka K. Korpela" <> wrote in message
    news:OQHgl.126223$...
    > Zach wrote:
    >
    >> I answered the guy's question.

    >
    > No, you didn't. You didn't even give a wrong answer, though your posting
    > would have been a wrong answer to virtually any question, if it had
    > addressed a question.
    >
    > Thank you for following my advice of continuing the use of clueslessly
    > forged From field as long as you remain clueless!
    >
    > --
    > Yucca, http://www.cs.tut.fi/~jkorpela/


    :(

    Zach.
     
    Zach, Jan 30, 2009
    #13
  14. JD

    Zach Guest

    "Ben C" <> wrote in message
    news:...
    > 2 isn't a list (assuming you mean &# things)-- those are just numbers.
    > But you might convert some characters to HTML entities like &nbsp; and
    > so you might have a list of those.


    Aren't these your thingies?
    http://www.avenue-it.com/html/asciialphabet.html
     
    Zach, Jan 30, 2009
    #14
  15. JD

    Neredbojias Guest

    On 30 Jan 2009, "Zach" <> wrote:

    >
    > "JD" <> wrote in message
    > news:...
    >
    > << snipped >>
    >
    >>> I answered the guy's question.

    >>
    >> How, by supplying an indiscriminate list of character entity
    >> references? That's like giving somebody the entire alphabet when
    >> they ask which letters are vowels.

    >
    > oooooooooooooooooooooooooooooooooooooooooooooooooo
    >
    > Oh. Oh. If a response isn't to your liking, then say so politely.


    Dear Sir,

    Your list sucked the big one.

    With warm regards,
    JD

    --
    Neredbojias
    http://www.neredbojias.org/
    http://www.neredbojias.net/
    The road to Heaven is paved with bad intentions.
     
    Neredbojias, Jan 31, 2009
    #15
  16. JD

    Zach Guest

    "Neredbojias" <> wrote in message
    news:...
    > On 30 Jan 2009, "Zach" <> wrote:
    >
    >>
    >> "JD" <> wrote in message
    >> news:...
    >>
    >> << snipped >>
    >>
    >>>> I answered the guy's question.
    >>>
    >>> How, by supplying an indiscriminate list of character entity
    >>> references? That's like giving somebody the entire alphabet when
    >>> they ask which letters are vowels.

    >>
    >> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>
    >> Oh. Oh. If a response isn't to your liking, then say so politely.

    >
    > Dear Sir,
    >
    > Your list sucked the big one.
    >
    > With warm regards,
    > JD
    >
    > --
    > Neredbojias
    > http://www.neredbojias.org/
    > http://www.neredbojias.net/
    > The road to Heaven is paved with bad intentions.
    >> oooooooooooooooooooooooooooooooooooooooooooooooooo

    Lol!

    Zach.
     
    Zach, Jan 31, 2009
    #16
  17. JD

    Zach Guest

    "Neredbojias" <> wrote in message
    news:...
    > On 30 Jan 2009, "Zach" <> wrote:
    >
    >>
    >> "JD" <> wrote in message
    >> news:...
    >>
    >> << snipped >>
    >>
    >>>> I answered the guy's question.
    >>>
    >>> How, by supplying an indiscriminate list of character entity
    >>> references? That's like giving somebody the entire alphabet when
    >>> they ask which letters are vowels.

    >>
    >> oooooooooooooooooooooooooooooooooooooooooooooooooo
    >>
    >> Oh. Oh. If a response isn't to your liking, then say so politely.

    >
    > Dear Sir,
    >
    > Your list sucked the big one.
    >
    > With warm regards,
    > JD
    >
    > --
    > Neredbojias
    > http://www.neredbojias.org/
    > http://www.neredbojias.net/
    > The road to Heaven is paved with bad intentions.
    >> oooooooooooooooooooooooooooooooooooooooooooooooooo

    Lol!

    Zach.
     
    Zach, Jan 31, 2009
    #17
  18. JD

    Zach Guest

    "Ben C" <> wrote in message
    news:...
    > On 2009-01-30, Zach <> wrote:
    >> "Ben C" <> wrote in message
    >> news:...
    >>> 2 isn't a list (assuming you mean &# things)-- those are just numbers.
    >>> But you might convert some characters to HTML entities like &nbsp; and
    >>> so you might have a list of those.

    >>
    >> Aren't these your thingies?
    >> http://www.avenue-it.com/html/asciialphabet.html

    >
    > Sort of, but ignore the first 128 entries of the table-- obviously
    > there's no need to replace 'i' with i in any encoding anyone's
    > likely to be using these days.
    >
    > In fact, if I have to replace 5 with 5 it's not clear how the
    > browser's going to understand the '5' in "5".
    >
    > And I think it's likely to be a requirement of an HTML parser that it at
    > least understand ASCII. Korpela would know but he has already stormed
    > off in disgust.
    >
    > The second problem is that that table appears to list only the
    > characters in Latin 1 (aka ISO-8859-1) although I haven't checked it
    > thoroughly.
    >
    > Since the OP's destination encoding was ISO-8859-1, he wouldn't need to
    > make subsitutions for any of the characters in that table.
    >
    > But he might need to make some for characters outside it-- for example
    > if his text contains U+1401 Canadian Syllabics E, or U+2207 Nabla, or
    > any of the many other characters that aren't in Latin 1.

    ooooooooooooooooooooooooooooooooooooooooo
    Thank you. I have learned a few things.
    Zach.
     
    Zach, Jan 31, 2009
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    3
    Views:
    33,140
    Stefan Mueller
    Jul 23, 2006
  2. ronrsr
    Replies:
    1
    Views:
    563
    Justin Ezequiel
    Feb 15, 2007
  3. Replies:
    2
    Views:
    1,114
    Ingo Menger
    May 31, 2007
  4. rvino
    Replies:
    0
    Views:
    4,680
    rvino
    Aug 14, 2007
  5. majna
    Replies:
    4
    Views:
    703
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...

Share This Page