Tidy using unicode does not validate

groups2 · Mar 15, 2007

If I get clean up a page with tidy (the firefox validator version)
using a unicode charactar set, and then go to the W3C validator , the
W3C validator finds invalid characters. How do I clean it up using
unicode in tidy and then validate it with W3C ?

If I use ascii in Tidy, W3C accepts it, but I don't want to use
ascii.

Tidy in firefox has an option for iso-8859 but it doesn't do anything-
the only choices in the actual cleanup are ascii and unicode. What's
up with that ?

groups2 · Mar 15, 2007

If I get clean up a page with tidy (the firefox validator version)
using a unicode charactar set, and then go to the W3C validator , the
W3C validator finds invalid characters. How do I clean it up using
unicode in tidy and then validate it with W3C ?

If I use ascii in Tidy, W3C accepts it, but I don't want to use
ascii.

Tidy in firefox has an option for iso-8859 but it doesn't do anything-
the only choices in the actual cleanup are ascii and unicode. What's
up with that ?

If I tidy the document using ascii encoding and then change the
stated encoding to utf:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
It validates.

My guess is that tidy has it backwords and the w3 validation is
correct, but that seems to simple so I'm still wondering what the heck
is going on.

Andy Dingley · Mar 15, 2007

If I get clean up a page

Which page? Does it have a URL?

There is simply no point in discussing encoding issues like this unles
we can see the live page (including HTTP headers).

with tidy (the firefox validator version)

Are you using the 0.8.3.* version of Gueury's FF HTML Validator with a
full DTD validator built in too?

groups2 · Mar 15, 2007

Which page? Does it have a URL?

There is simply no point in discussing encoding issues like this unles
we can see the live page (including HTTP headers).

Are you using the 0.8.3.* version of Gueury's FF HTML Validator with a
full DTD validator built in too?

Yes Thats right. Here are some simple examples

http://reenie.org/test/ascii.htm
cleaned up by tidy with ascii encoding - has html entities for mdash
and reg
passes w3c validation

http://reenie.org/test/unicode.htm
The same file cleaned up by tidy with utf encoding
Source has an mdash and reg symbol (not html entities) which only show
as question marks.
Does not pass validation: "one or more bytes that I cannot interpret
as utf-8"

http://reenie.org/test/unicode2.htm
The same as the first ascii file, has html entities for mdash and reg
but I replaced
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
with
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Passes w3 validation

As far as I can tell tidy does can not produce the third file, which
makes me wonder if the third file with utf encoding and html entites
is really valid.
If it is, how do I use tidy to produce a utf-8 file that validates ?

Also, another issue, Tidy has only 2 options for cleaning code,
unicode and ascii. There are more in view/charactar encoding but not
in the cleanup options. Why no ISO-8859-1? I would settle for that.

Andy Dingley · Mar 15, 2007

http://reenie.org/test/unicode.htm
The same file cleaned up by tidy with utf encoding
Source has an mdash and reg symbol (not html entities) which only show
as question marks.
Does not pass validation: "one or more bytes that I cannot interpret
as utf-8"

My tools are currently broken so I can't be sure, but that looks like
an ISO-8859-* file, being served as UTF-8

groups2 · Mar 15, 2007

My tools are currently broken so I can't be sure, but that looks like
an ISO-8859-* file, being served as UTF-8

I'm afraid I don't understand what this means.
Do you mean the page will be served as utf-8 regardless of this in the
header:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Can you explain why it doesn't validate ?

groups2 · Mar 15, 2007

I'm afraid I don't understand what this means.
Do you mean the page will be served as utf-8 regardless of this in the
header:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Can you explain why it doesn't validate ?

I'll rephrase that. Why is Tidy giving me a document that does not
validate ?
It is because the server is somehow serving the file wrong ?

Andy Dingley · Mar 15, 2007

I'll rephrase that. Why is Tidy giving me a document that does not
validate ?

I don't think Tidy fixes errors that are caused by impossible
characters, arising from embedding uninterpretable byte sequences in
documents that conflict with the assumed encoding for that file. If
they're broken, I think they just stay broken.

It is because the server is somehow serving the file wrong ?

I think your server is trying to serve these documents correctly, but
you still haven't shown us the original. Once an encoding error creeps
in it's sometimes impossible to reverse it without knowing what it was
originally supposed to be, and this is beyond an automatic tool that
tries to work from the document alone.

What's the _original_ document that you're asking Tidy to work on?

Tidy can certainly take a docuemnt containing correctly encoded non-
ASCII characters, then process it by "Clean up" to produce a well-
formed, valid and correctly encoded UTF-8 document. If you then serve
this as UTF-8, all remains well.

groups2 · Mar 15, 2007

I don't think Tidy fixes errors that are caused by impossible
characters, arising from embedding uninterpretable byte sequences in
documents that conflict with the assumed encoding for that file. If
they're broken, I think they just stay broken.

I think your server is trying to serve these documents correctly, but
you still haven't shown us the original. Once an encoding error creeps
in it's sometimes impossible to reverse it without knowing what it was
originally supposed to be, and this is beyond an automatic tool that
tries to work from the document alone.

What's the _original_ document that you're asking Tidy to work on?

as I said before, http://reenie.org/test/unicode.htm
is http://reenie.org/test/ascii.htm cleaned by tidy with utf
encoding.
so that would make http://reenie.org/test/ascii.htm the original file.

Andy Dingley · Mar 16, 2007

as I said before,http://reenie.org/test/unicode.htm
is http://reenie.org/test/ascii.htmcleaned by tidy with utf
encoding.

Ok, I think I understand what you've done now.

http://reenie.org/test/unicode.htm is broken. It appears to have a
ISO-8859-1 character in the file being served as a UTF-8 document.

Tidy didn't make this file. AFAIK, the Tidy you're using takes its
input from Firefox and doesn't have any "output to file" feature. You
must have taken its output from the clipboard, pasted it into your
choice of editor and saved it from there. At this point, I can only
assume that the file was a correctly-encoded ISO-8859 file.

The web server then gets to it and serves it up, with UTF-8 encoding
headers or embedded metas in it. Things go wrong _at_this_point_. File
is good (but not UTF-8), web document is bad (mis-labelled and thus
unreadable).

I suggest you try the "Tidy cleanup" process again, but this time make
sure that your editor's save setting is utf-8. jEdit is a well-behaved
editor here, some others (e.g. Eclipse) aren't. Watch out for Windows
editors, as they often say "Unicode" and mean UTF-16, which isn't
what's wanted at all. Look for a specific UTF-8 option.

=?ISO-8859-1?Q?G=E9rard_Talbot?= · Mar 16, 2007

Andy Dingley wrote :

I don't think Tidy fixes errors that are caused by impossible
characters,

Correct. It's up to the web author to know
- in which character set the HTML document was written
- how to choose the correspondent character set

Tidy will not fix impossible task, well beyond its capabilities and
scope. Tidy is best used to fix well-known HTML coding practices.

arising from embedding uninterpretable byte sequences in

documents that conflict with the assumed encoding for that file. If
they're broken, I think they just stay broken.

I think your server is trying to serve these documents correctly, but
you still haven't shown us the original.

Correct. It is useless to start talking about fixing a webpage when we
can't see the page with its actual HTTP headers.

A long time ago, you asked:

Which page? Does it have a URL?

There is simply no point in discussing encoding issues like this unles
we can see the live page (including HTTP headers).

Gérard

groups2 · Mar 16, 2007

Ok, I think I understand what you've done now.

http://reenie.org/test/unicode.htmis broken. It appears to have a
ISO-8859-1 character in the file being served as a UTF-8 document.

Tidy didn't make this file. AFAIK, the Tidy you're using takes its
input from Firefox and doesn't have any "output to file" feature. You
must have taken its output from the clipboard, pasted it into your
choice of editor and saved it from there. At this point, I can only
assume that the file was a correctly-encoded ISO-8859 file.

The web server then gets to it and serves it up, with UTF-8 encoding
headers or embedded metas in it. Things go wrong _at_this_point_. File
is good (but not UTF-8), web document is bad (mis-labelled and thus
unreadable).

I suggest you try the "Tidy cleanup" process again, but this time make
sure that your editor's save setting is utf-8. jEdit is a well-behaved
editor here, some others (e.g. Eclipse) aren't. Watch out for Windows
editors, as they often say "Unicode" and mean UTF-16, which isn't
what's wanted at all. Look for a specific UTF-8 option.

Right Right Right.
I just figured out the same thing before I saw your message.
My editor is UltraEdit.
http://reenie.org/test/unicode.htm is Saved as DOS and shows question
marks for the characters in Firefox, and doesn' t validate with w3.

http://reenie.org/test/unicode3.htm is exactly the same except it is
saved as DOS-UTF8. It shows the characters correctly and validates.

Now when I validate it, W3 gives me a warning:

Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known
to cause problems for some text editors and older browsers. You may
want to consider avoiding its use until it is better supported.

Should I be worried about this ?
It seems that only way to avoid this problem is to leave the file in
Dos and tidy the file in ascii. Is this correct ?

I am about to to edit quit a few pages so I want to do whatever will
be most common and most recommended in the future. Am I safe in
assuming that will be utf-8 ?

Jukka K. Korpela · Mar 16, 2007

Scripsit (e-mail address removed):

Byte-Order Mark found in UTF-8 File. - -
Should I be worried about this ?

Not much. The document
http://www.w3.org/International/questions/qa-utf8-bom
is a bit vague and doesn't list down the software that doesn't grok the BOM,
but the symptoms it mentions (an extra line or the funny characters ï»¿
aren't really catastrophic.

It seems that only way to avoid this problem is to leave the file in
Dos and tidy the file in ascii. Is this correct ?

I don't know about the specific software, but leaving the file in ASCII,
presumably with the software presenting any non-ASCII characters as
character or entity references like – or &ndash:, is a good option, if
you have relatively few non-ASCII characters, so that it's not significant
in terms of amount of data.

I am about to to edit quit a few pages so I want to do whatever will
be most common and most recommended in the future. Am I safe in
assuming that will be utf-8 ?

UTF-8 is clearly favored in Internet protocol development, and there's no
reason to expect this to change. But of course ASCII is still slightly
better supported.

groups2 · Mar 16, 2007

Scripsit (e-mail address removed):

Not much. The documenthttp://www.w3.org/International/questions/qa-utf8-bom
is a bit vague and doesn't list down the software that doesn't grok the BOM,
but the symptoms it mentions (an extra line or the funny characters ï»¿
aren't really catastrophic.

You don't work for the people I work for !
With every little non catastrophic "mistake" I see my job slipping
away to India where everyone works for 8 dollars an hour and these
things never happen.

I don't know about the specific software, but leaving the file in ASCII,
presumably with the software presenting any non-ASCII characters as
character or entity references like – or &ndash:, is a good option, if
you have relatively few non-ASCII characters, so that it's not significant
in terms of amount of data.

OK thanks.

My problem right now is that any when I clean up a page in tidy using
utf encoding
it takes out all the $mdash; characters and ads long dashes. That's
fine but when I
validate the page with the w3 validator, it says the long dashes
cannot be interpreted as utf-8. It just doesn't make any sense. I'm
guessing tidy is wrong to replace — but I'm only guessing. .

groups2 · Mar 16, 2007

My problem right now is that any when I clean up a page in tidy using
utf encoding
it takes out all the $mdash; characters and ads long dashes. That's
fine but when I
validate the page with the w3 validator, it says the long dashes
cannot be interpreted as utf-8. It just doesn't make any sense. I'm
guessing tidy is wrong to replace — but I'm only guessing. .

oops, my bad, the page had slipped back into non utf convervsion. Not
sure how that happened. Once I converted it back to utf editing in my
editor, the page validated with the non — long dashes.

Andy Dingley · Mar 19, 2007

Byte-Order Mark found in UTF-8 File.

There are two UTF-8 encodings: with and without a BOM at the start of
the file.

With (sometimes described as "UTF-8Y" in some Windows tools) is
_obviously_ UTF-8 and so is easier for capable tools to recognise and
deal with unambiguously.

However you should remember that files in ASCII, ISO-8859-* or UTF-8
are all equal until you start using non-ASCII characters. If you add a
BOM to a UTF-8 file, then it is no longer ASCII or ISO-8859-* at all,
no matter what characters it contains. For this reason it's often
advised against it, because it will confuse older non-UTF-8-aware
editors.

I use UTF-8 throughout, and I don't use BOMs. I also try to impose
this on our team with a literal clue of iron. If I started actually
poking a few of them with it, I might even stop them re-encoding my
source in UTF-16 or Windows wibble when I'm not looking....

This is one of those problems that's not difficult, but isn't well
understood because you can get a long way relying on the tools and not
understanding any of it yourself. In the end though, it's worth
putting the small amount of effort in to understand it, then it just
ceases to be a problem. Until of course the minions with their UTF-16
defaults sneak back in...

India where [...] these things never happen.

If you would like a megabyte of cheap Indian Java source where these
things _certainly_ happen, then I've got plenty of it.

Jukka K. Korpela · Mar 19, 2007

Scripsit Andy Dingley:

There are two UTF-8 encodings: with and without a BOM at the start of
the file.

No, adding a BOM at the start of a UTF-8 data stream does not turn it into
another encoding, any more than adding some other character does. As the
Unicode FAQ says, it's just a matter of UTF-8 encoded data starting with a
BOM:
http://unicode.org/faq/utf_bom.html#29

With (sometimes described as "UTF-8Y" in some Windows tools) is
_obviously_ UTF-8 and so is easier for capable tools to recognise and
deal with unambiguously.

The practical point is that UTF-8 encoded BOM is a sequence of octets that
is extremely unlikely to arise from anything else than representing BOM in
UTF-8. So, yes, it is an almost certain and a very simple way of recognizing
a file as UTF-8 encoded. If you use a text editor for a UTF-8 file without
BOM, the poor editor has hard time in guessing the encoding and it may have
to ask the user, which generally has no idea of character encodings.

However you should remember that files in ASCII, ISO-8859-* or UTF-8
are all equal until you start using non-ASCII characters. If you add a
BOM to a UTF-8 file, then it is no longer ASCII or ISO-8859-* at all,

That's part of the other side of the "BOM in UTF-8" coin, yes. Besides, if
your document ever gets processed by some simplistic software that expects
everything to be 8-bit characters, it could get rather confused and not
recognize the data as HTML at all. A UTF-8 encoded HTML document without BOM
can be processed smoothly by such software, except of course that it cannot
correctly interpret the _content_ (but it gets all the markup right, except
perhaps CDATA attribute values).

tidy not quoting attributes	0	Jan 30, 2008
tidy to convert google scholar page in xml	1	Oct 8, 2012
Unicode Normalization Form C?	5	Apr 4, 2013
Unicode subscripts not displaying correctly	13	Jan 29, 2011
Thinking Unicode	0	Aug 8, 2013
Unicode questions	17	Oct 19, 2010
Tidy configuration	3	Jun 24, 2003
Unicode	20	Dec 16, 2012

Tidy using unicode does not validate

groups2

groups2

Andy Dingley

groups2

Andy Dingley

groups2

groups2

Andy Dingley

groups2

Andy Dingley

=?ISO-8859-1?Q?G=E9rard_Talbot?=

groups2

Jukka K. Korpela

groups2

groups2

Andy Dingley

Jukka K. Korpela

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads