accept-charset in forms

Discussion in 'HTML' started by Taras_96, Jan 7, 2007.

  1. Taras_96

    Taras_96 Guest

    Hi everyone,

    I'm trying to write a webpage in Chinese (with someone that knows
    Chinese doing the appropriate translation). I'm using PHP and MySQL. As
    such, I wish to use UTF-8 (since it is supported by PHP's multibyte
    string functions). However, the official Chinese character encoding is
    GB, and I'm pretty sure that Windows uses GB encoding as well (since it
    is listed as a MAC in the Windows regional options, plus I tried to
    type some character's into notepad2 with the character encoding set on
    UTF-8 and all that came out was boxes).

    Because I want everything internal to the website to be in UTF-8, I
    intend on specifying the accept-charset property in my forms as UTF-8.

    What happens when someone either a) types in Chinese (which I assume is
    stored in memory/RAM as GB) or b) copies and pastes some Chinese
    characters from a document that does not use UTF-8 encoding and posts
    the form? Does the browser somehow convert from GB (or any other
    encoding that was used) to UTF-8 before sending the data to the server?
    If this is the behaviour, then do all (or the majority of) browsers do
    this?

    Thanks

    Taras
    Taras_96, Jan 7, 2007
    #1
    1. Advertising

  2. Scripsit Taras_96:

    > the official Chinese character encoding is GB,


    You probably mean GB2312, which is a national standard defined in the
    People's Republic of China. If you live under Chinese jurisdiction, you may
    need to check whether the standard or some other rule really imposes that
    encoding on your pages.

    I doubt that, though. Standards are usually not enforced by laws.

    > and I'm pretty sure that Windows uses GB encoding as well


    "Windows" is a trade mark for a wide range of operating systems, which may
    each use different encodings. Why would that matter.

    > (since
    > it is listed as a MAC in the Windows regional options, plus I tried to
    > type some character's into notepad2 with the character encoding set on
    > UTF-8 and all that came out was boxes).


    So? Whatever that means in detail, why would it matter in HTML authoring?

    > Because I want everything internal to the website to be in UTF-8, I
    > intend on specifying the accept-charset property in my forms as UTF-8.


    That would be unsafe, since the accept-charset attribute is poorly
    documented, and there does not seem to be much reliable information on its
    support in browsers.

    The safest way is to make the page (containing the form) UTF-8 encoded,
    expect the data to arrive in UTF-8 encoding, and check this using some
    heuristics like a hidden field containing unusual characters.

    See also "FORM submission and i18n",
    http://ppewww.physics.gla.ac.uk/~flavell/charset/form-i18n.html

    > What happens when someone either a) types in Chinese (which I assume
    > is stored in memory/RAM as GB)


    Modern Windows systems use internally UTF-16, no matter what encodings are
    used in particular programs. But this doesn't matter; what matters is what
    the input method produces and how the browser deals with it. In general,
    there is not much you can do about it as an author.

    > or b) copies and pastes some Chinese
    > characters from a document that does not use UTF-8 encoding and posts
    > the form?


    The browser is supposed to do the conversion or, rather, the copy & paste
    functionality should handle this.

    One reason for using UTF-8 is that ultimately users can produce _any_
    Unicode character if they just know how to do that. This means that they
    can, for example, insert characters that have no representation in GB2312.
    What happens then if the page's encoding (and hence the form's encoding) is
    GB2312? The specifications are silent. In practice, browsers tend to do odd
    things like insert references. You can handle them in your form
    handler, but it's easier to use UTF-8 so that the problem does not arise.

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Jan 8, 2007
    #2
    1. Advertising

  3. Taras_96

    Taras_96 Guest

    Hi,

    Firstly, does the document imply that POST should be used over GET
    because POST can specify the incoming character encoding (although it
    says that some agents might get confused by the specification). This of
    course takes into account the fact that POST may resultingly be used
    incorrectly (it may be used for transactions that are indempotent).

    > You probably mean GB2312, which is a national standard defined in the
    > People's Republic of China. If you live under Chinese jurisdiction, you may
    > need to check whether the standard or some other rule really imposes that
    > encoding on your pages.


    It's not conformance to standards I'm worried about. The ubiquitous
    encoding in China is GB2312 - that's what I'm worried about. I've read
    the page you linked before (a while ago), and I remember that this
    paragraph caught my attention:

    "In addition to these considerations, some users may be typing-in or
    pasting-in text from an application that uses their local character
    coding (practical examples being macRoman on a Mac; or MS-DOS CP850
    being copied out of a DOS window on an MS Windows PC), into a text
    field of a document that used the author's - different - character
    encoding (let's say for the simplest example, iso-8859-1): the user
    might then submit the form, disregarding that what they are seeing in
    the text field is not what they intended to send. From anecdotal
    evidence it appears that some folks analyzing survey responses expected
    %xx-representations of 8-bit-coded characters, but sometimes got
    clusters of %xx-representations which turned out to be utf-8 instead:
    whether this would have been evident or not to the person doing the
    submitting was unclear.

    Another commonly observed behaviour on Windows platforms is using a
    form which is in an iso-8859 coding, but the user pasting in characters
    (such as clever-quotes, trademark, euro sign etc.) which only exist in
    the corresponding Windows coding, e.g for Latin-1 the codings would be
    respectively iso-8859-1 and Windows-1252; in the iso-8859 encodings,
    these character positions do not represent displayable characters (they
    are in a range reserved for control functions). Some browsers disregard
    the mismatch and simply submit the character as the corresponding %xx
    code in the range %80-%9F, as if the browser thought it was handling
    the Windows coding instead: some replace these inappropriate characters
    by some kind of useful (e.g clever-quote replaced by plain quote) or
    useless (e.g all unrepresentable characters replaced by question-mark)
    substitute; for MSIE5's surprising behaviour see later in this page. "

    This implies to me that, as of current, copying and pasting into text
    documents (which I'm assuming users will do) from say, a word document,
    into a browser text field, can create problems. To avoid/mitigate these
    problems, I was thinking of matching the form's encoding to the
    encoding that it used in say word documents, to minimise the risk of
    some type of conversion mistake. This is why I was interested to see
    what encoding Windows uses, and seeing that it wasn't UTF-8, and GB2312
    was mentioned as the standard in China (and as I mentioned, the
    majority of websites in China are delivered using GB2312), I guessed
    that the encoding used in Windows for Chinese characters might be
    GB2312. Thus, if a Chinese user copied and pasted from Word (which, in
    this hypothetical situation, is using GB2312) into a browser whose form
    is encoded in GB2312, then the possibility of some kind of error
    occurring is minimised. As you have noted GB2312 has a couple of
    problems. Firstly, it doesn't cover all of unicode. For this I was
    thinking of using GB18030, as this is a UTF, and is comptable with GBK,
    which is an extension of GB2312. I am not sure about GB18030, as I
    haven't found a clear reference to whether it is a code table or an
    encoding (many sources refer to GB2312 as an encoding, including
    Mozilla FF, even though it seems to be a code table), and the encoding
    would have to be the same as GB2312 for characters that are present in
    both repertoires, in the same way the encoding for UTF-8 and ASCII are
    the same for the characters that are present in both sets. However,
    another problem with using a GB character set (and associated encoding)
    is that PHP does not support these encodings internally. To fix this I
    was going to use PHP's http input/output conversion functions, storing
    everything internally as UTF-8, and only converting upon output.

    A problem with UTF-8 is that it isn't supported everywhere by, say for
    instance, mobile phones at the moment. The risk of this higher in
    China, where the official standards is GB18030, and most people seem to
    be using GB2312.

    >
    > "Windows" is a trade mark for a wide range of operating systems, which may
    > each use different encodings. Why would that matter.
    >

    ....
    >
    > So? Whatever that means in detail, why would it matter in HTML authoring?
    >


    See above

    > That would be unsafe, since the accept-charset attribute is poorly
    > documented, and there does not seem to be much reliable information on its
    > support in browsers.
    >


    OK

    >
    > Modern Windows systems use internally UTF-16, no matter what encodings are
    > used in particular programs. But this doesn't matter; what matters is what
    > the input method produces and how the browser deals with it. In general,
    > there is not much you can do about it as an author.
    >


    >
    > The browser is supposed to do the conversion or, rather, the copy & paste
    > functionality should handle this.
    >


    So if I copy and paste from a Windows document, or from a text document
    encoded in UTF-16 for example, into a form whose encoding is UTF-8,
    will:
    a) the copy and paste function do the conversion
    b) the browser do the conversion when the data is sent
    c) the conversion not occur
    ?

    > One reason for using UTF-8 is that ultimately users can produce _any_
    > Unicode character if they just know how to do that. This means that they
    > can, for example, insert characters that have no representation in GB2312.
    > What happens then if the page's encoding (and hence the form's encoding) is
    > GB2312? The specifications are silent. In practice, browsers tend to do odd
    > things like insert references. You can handle them in your form
    > handler, but it's easier to use UTF-8 so that the problem does not arise.
    >


    That's why I was going to use GB18030 (if the encoding is the same as
    those characters in GB1232)

    I may be on the wrong track with my ideas, but this is what I've pieced
    together from the resources out there.

    Taras
    Taras_96, Jan 10, 2007
    #3
  4. Scripsit Taras_96:

    > Firstly, does the document imply that POST should be used over GET
    > because POST can specify the incoming character encoding


    If we take the HTML specifications at their face value, we should stop using
    the GET method altogether, since its functionality is defined for ASCII data
    only, and we cannot even guarantee that user data does not contain non-ASCII
    characters.

    In practice, people keep using the GET method and get away with it, for the
    most of it.

    > It's not conformance to standards I'm worried about. The ubiquitous
    > encoding in China is GB2312 - that's what I'm worried about.


    The question is whether people's browsers in China can handle GB2312 but not
    UTF-8. I really can't tell, but I'd be rather surprised if that were the
    case.

    If the browsers can handle UTF-8, too, the only reason for using GB2312 for
    your pages would be efficiency. But then you would have problems with
    browsers (outside China, but used by Chinese people or people who can read
    Chinese) that handle UTF-8 but not GB2312. Using content negotation (i.e.
    checking, from HTTP headers, what the browser claims to handle and sending
    the page in different encodings isn't very practical, since popular browsers
    fail to tell such information (Accept-Charset header).

    > This implies to me that, as of current, copying and pasting into text
    > documents (which I'm assuming users will do) from say, a word
    > document, into a browser text field, can create problems.


    It can, but the document you quoted discusses problems that arise when some
    pasted characters have no representation in the encoding in use. Such things
    cannot happen when UTF-8 is used.

    > That's why I was going to use GB18030 (if the encoding is the same as
    > those characters in GB1232)


    That would imply serious problems, since e.g. Internet Explorer does not
    seem to support GB18030.

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Jan 10, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Efy.
    Replies:
    2
    Views:
    1,091
  2. Sasha Shevelev

    Re: Displaying Charset windows-1251

    Sasha Shevelev, Jul 2, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    3,111
    Sasha Shevelev
    Jul 2, 2003
  3. Eric
    Replies:
    2
    Views:
    501
  4. Stefan Fischer
    Replies:
    2
    Views:
    262
    Stefan Fischer
    Feb 23, 2010
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    270
    optimistx
    Aug 15, 2008
Loading...

Share This Page