accept-charset in forms

Taras_96 · Jan 7, 2007

Hi everyone,

I'm trying to write a webpage in Chinese (with someone that knows
Chinese doing the appropriate translation). I'm using PHP and MySQL. As
such, I wish to use UTF-8 (since it is supported by PHP's multibyte
string functions). However, the official Chinese character encoding is
GB, and I'm pretty sure that Windows uses GB encoding as well (since it
is listed as a MAC in the Windows regional options, plus I tried to
type some character's into notepad2 with the character encoding set on
UTF-8 and all that came out was boxes).

Because I want everything internal to the website to be in UTF-8, I
intend on specifying the accept-charset property in my forms as UTF-8.

What happens when someone either a) types in Chinese (which I assume is
stored in memory/RAM as GB) or b) copies and pastes some Chinese
characters from a document that does not use UTF-8 encoding and posts
the form? Does the browser somehow convert from GB (or any other
encoding that was used) to UTF-8 before sending the data to the server?
If this is the behaviour, then do all (or the majority of) browsers do
this?

Thanks

Taras

Jukka K. Korpela · Jan 8, 2007

Scripsit Taras_96:

the official Chinese character encoding is GB,

You probably mean GB2312, which is a national standard defined in the
People's Republic of China. If you live under Chinese jurisdiction, you may
need to check whether the standard or some other rule really imposes that
encoding on your pages.

I doubt that, though. Standards are usually not enforced by laws.

and I'm pretty sure that Windows uses GB encoding as well

"Windows" is a trade mark for a wide range of operating systems, which may
each use different encodings. Why would that matter.

(since
it is listed as a MAC in the Windows regional options, plus I tried to
type some character's into notepad2 with the character encoding set on
UTF-8 and all that came out was boxes).

So? Whatever that means in detail, why would it matter in HTML authoring?

Because I want everything internal to the website to be in UTF-8, I
intend on specifying the accept-charset property in my forms as UTF-8.

That would be unsafe, since the accept-charset attribute is poorly
documented, and there does not seem to be much reliable information on its
support in browsers.

The safest way is to make the page (containing the form) UTF-8 encoded,
expect the data to arrive in UTF-8 encoding, and check this using some
heuristics like a hidden field containing unusual characters.

See also "FORM submission and i18n",
http://ppewww.physics.gla.ac.uk/~flavell/charset/form-i18n.html

What happens when someone either a) types in Chinese (which I assume
is stored in memory/RAM as GB)

Modern Windows systems use internally UTF-16, no matter what encodings are
used in particular programs. But this doesn't matter; what matters is what
the input method produces and how the browser deals with it. In general,
there is not much you can do about it as an author.

or b) copies and pastes some Chinese
characters from a document that does not use UTF-8 encoding and posts
the form?

The browser is supposed to do the conversion or, rather, the copy & paste
functionality should handle this.

One reason for using UTF-8 is that ultimately users can produce _any_
Unicode character if they just know how to do that. This means that they
can, for example, insert characters that have no representation in GB2312.
What happens then if the page's encoding (and hence the form's encoding) is
GB2312? The specifications are silent. In practice, browsers tend to do odd
things like insert references. You can handle them in your form
handler, but it's easier to use UTF-8 so that the problem does not arise.

Taras_96 · Jan 10, 2007

Hi,

Firstly, does the document imply that POST should be used over GET
because POST can specify the incoming character encoding (although it
says that some agents might get confused by the specification). This of
course takes into account the fact that POST may resultingly be used
incorrectly (it may be used for transactions that are indempotent).

You probably mean GB2312, which is a national standard defined in the
People's Republic of China. If you live under Chinese jurisdiction, you may
need to check whether the standard or some other rule really imposes that
encoding on your pages.

It's not conformance to standards I'm worried about. The ubiquitous
encoding in China is GB2312 - that's what I'm worried about. I've read
the page you linked before (a while ago), and I remember that this
paragraph caught my attention:

"In addition to these considerations, some users may be typing-in or
pasting-in text from an application that uses their local character
coding (practical examples being macRoman on a Mac; or MS-DOS CP850
being copied out of a DOS window on an MS Windows PC), into a text
field of a document that used the author's - different - character
encoding (let's say for the simplest example, iso-8859-1): the user
might then submit the form, disregarding that what they are seeing in
the text field is not what they intended to send. From anecdotal
evidence it appears that some folks analyzing survey responses expected
%xx-representations of 8-bit-coded characters, but sometimes got
clusters of %xx-representations which turned out to be utf-8 instead:
whether this would have been evident or not to the person doing the
submitting was unclear.

Another commonly observed behaviour on Windows platforms is using a
form which is in an iso-8859 coding, but the user pasting in characters
(such as clever-quotes, trademark, euro sign etc.) which only exist in
the corresponding Windows coding, e.g for Latin-1 the codings would be
respectively iso-8859-1 and Windows-1252; in the iso-8859 encodings,
these character positions do not represent displayable characters (they
are in a range reserved for control functions). Some browsers disregard
the mismatch and simply submit the character as the corresponding %xx
code in the range %80-%9F, as if the browser thought it was handling
the Windows coding instead: some replace these inappropriate characters
by some kind of useful (e.g clever-quote replaced by plain quote) or
useless (e.g all unrepresentable characters replaced by question-mark)
substitute; for MSIE5's surprising behaviour see later in this page. "

This implies to me that, as of current, copying and pasting into text
documents (which I'm assuming users will do) from say, a word document,
into a browser text field, can create problems. To avoid/mitigate these
problems, I was thinking of matching the form's encoding to the
encoding that it used in say word documents, to minimise the risk of
some type of conversion mistake. This is why I was interested to see
what encoding Windows uses, and seeing that it wasn't UTF-8, and GB2312
was mentioned as the standard in China (and as I mentioned, the
majority of websites in China are delivered using GB2312), I guessed
that the encoding used in Windows for Chinese characters might be
GB2312. Thus, if a Chinese user copied and pasted from Word (which, in
this hypothetical situation, is using GB2312) into a browser whose form
is encoded in GB2312, then the possibility of some kind of error
occurring is minimised. As you have noted GB2312 has a couple of
problems. Firstly, it doesn't cover all of unicode. For this I was
thinking of using GB18030, as this is a UTF, and is comptable with GBK,
which is an extension of GB2312. I am not sure about GB18030, as I
haven't found a clear reference to whether it is a code table or an
encoding (many sources refer to GB2312 as an encoding, including
Mozilla FF, even though it seems to be a code table), and the encoding
would have to be the same as GB2312 for characters that are present in
both repertoires, in the same way the encoding for UTF-8 and ASCII are
the same for the characters that are present in both sets. However,
another problem with using a GB character set (and associated encoding)
is that PHP does not support these encodings internally. To fix this I
was going to use PHP's http input/output conversion functions, storing
everything internally as UTF-8, and only converting upon output.

A problem with UTF-8 is that it isn't supported everywhere by, say for
instance, mobile phones at the moment. The risk of this higher in
China, where the official standards is GB18030, and most people seem to
be using GB2312.

"Windows" is a trade mark for a wide range of operating systems, which may
each use different encodings. Why would that matter.
....

So? Whatever that means in detail, why would it matter in HTML authoring?

See above

That would be unsafe, since the accept-charset attribute is poorly
documented, and there does not seem to be much reliable information on its
support in browsers.

OK

Modern Windows systems use internally UTF-16, no matter what encodings are
used in particular programs. But this doesn't matter; what matters is what
the input method produces and how the browser deals with it. In general,
there is not much you can do about it as an author.

The browser is supposed to do the conversion or, rather, the copy & paste
functionality should handle this.

So if I copy and paste from a Windows document, or from a text document
encoded in UTF-16 for example, into a form whose encoding is UTF-8,
will:
a) the copy and paste function do the conversion
b) the browser do the conversion when the data is sent
c) the conversion not occur
?

One reason for using UTF-8 is that ultimately users can produce _any_
Unicode character if they just know how to do that. This means that they
can, for example, insert characters that have no representation in GB2312.
What happens then if the page's encoding (and hence the form's encoding) is
GB2312? The specifications are silent. In practice, browsers tend to do odd
things like insert references. You can handle them in your form
handler, but it's easier to use UTF-8 so that the problem does not arise.

That's why I was going to use GB18030 (if the encoding is the same as
those characters in GB1232)

I may be on the wrong track with my ideas, but this is what I've pieced
together from the resources out there.

Taras

Jukka K. Korpela · Jan 10, 2007

Scripsit Taras_96:

Firstly, does the document imply that POST should be used over GET
because POST can specify the incoming character encoding

If we take the HTML specifications at their face value, we should stop using
the GET method altogether, since its functionality is defined for ASCII data
only, and we cannot even guarantee that user data does not contain non-ASCII
characters.

In practice, people keep using the GET method and get away with it, for the
most of it.

It's not conformance to standards I'm worried about. The ubiquitous
encoding in China is GB2312 - that's what I'm worried about.

The question is whether people's browsers in China can handle GB2312 but not
UTF-8. I really can't tell, but I'd be rather surprised if that were the
case.

If the browsers can handle UTF-8, too, the only reason for using GB2312 for
your pages would be efficiency. But then you would have problems with
browsers (outside China, but used by Chinese people or people who can read
Chinese) that handle UTF-8 but not GB2312. Using content negotation (i.e.
checking, from HTTP headers, what the browser claims to handle and sending
the page in different encodings isn't very practical, since popular browsers
fail to tell such information (Accept-Charset header).

This implies to me that, as of current, copying and pasting into text
documents (which I'm assuming users will do) from say, a word
document, into a browser text field, can create problems.

It can, but the document you quoted discusses problems that arise when some
pasted characters have no representation in the encoding in use. Such things
cannot happen when UTF-8 is used.

That's why I was going to use GB18030 (if the encoding is the same as
those characters in GB1232)

That would imply serious problems, since e.g. Internet Explorer does not
seem to support GB18030.

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
stmplib MIMEText charset weirdness	3	Feb 26, 2013
CGI (read multipart form): Accept-Charset encoding error (CGI::InvalidEncoding)	2	Feb 21, 2010
Document Encoding/Charset	2	Jun 21, 2007
javascript charset <> page charset	2	Aug 14, 2008
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
How to display chinese character in 65001 in pytohn?	0	Apr 10, 2014
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023

accept-charset in forms

Taras_96

Jukka K. Korpela

Taras_96

Jukka K. Korpela

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads