Re: short int always 16 bits or not?

Discussion in 'C Programming' started by Shriramana Sharma, Apr 20, 2013.

  1. Hello and thanks people for your clarifications. My mistake for not reading the introductory text correctly.

    BTW despite my name, I am not really that good at Hindi (it's not my mother tongue), so (auto-?)translating to it doesn't really facilitate anything. (Thanks for trying though!)
    Shriramana Sharma, Apr 20, 2013
    #1
    1. Advertising

  2. Shriramana Sharma

    Eric Sosman Guest

    On 4/19/2013 10:30 PM, Shriramana Sharma wrote:
    > Hello and thanks people for your clarifications. My mistake for not reading the introductory text correctly.
    >
    > BTW despite my name, I am not really that good at Hindi (it's not my mother tongue), so (auto-?)translating to it doesn't really facilitate anything. (Thanks for trying though!)


    Your English is much better than my Hindi! ;-) If you like,
    auto-translate the question from Hindi to Pinyin, or to Thai, or
    to Arabic, or to Russian, or to ... and I think the question about
    wanting a wider-than-8-bit `char' will answer itself.

    Besides: Fashions come, and fashions go, and this is a fashion-
    driven industry. In my first few years of writing programs, I used
    systems with 8-bit, 6-bit, 9-bit, and (yes!) 6.644-bit characters.
    Some of these were bitextual: A machine with 36-bit words holding
    either six 6-bit or four 9-bit characters, another with 48-bit
    words that did eight 6's or six 8's. The 8-bit character is very
    common today, but ... Πάντα ῥεῖ καὶ οá½Î´á½²Î½ μένει, as the fellow said.

    --
    Eric Sosman
    d
    Eric Sosman, Apr 20, 2013
    #2
    1. Advertising

  3. Shriramana Sharma

    James Kuyper Guest

    On 04/19/2013 11:11 PM, Eric Sosman wrote:
    > On 4/19/2013 10:30 PM, Shriramana Sharma wrote:
    >> Hello and thanks people for your clarifications. My mistake for not reading the introductory text correctly.
    >>
    >> BTW despite my name, I am not really that good at Hindi (it's not my mother tongue), so (auto-?)translating to it doesn't really facilitate anything. (Thanks for trying though!)

    >
    > Your English is much better than my Hindi! ;-) If you like,
    > auto-translate the question from Hindi to Pinyin, or to Thai, or
    > to Arabic, or to Russian, or to ... and I think the question about
    > wanting a wider-than-8-bit `char' will answer itself.


    Not really. Your message containing Hindi text displayed as actual
    characters on my system, not the rectangles it shows when it doesn't
    know how to display a character. I presume that they were the correct
    characters - Google translate translates them back to English as "Why
    anyone would want to be لما four other 8-bit", which I presume is close
    to what you wanted to say, aside from the character it refused to
    translate. However, it arrived at my newsreader with the following headers:

    Content-Type: text/plain; charset=UTF-8; format=flowed
    Content-Transfer-Encoding: 8bit

    The use of UTF-9, UTF-16, UTF-18, UTF-32, UCS2, or UCS4 would have been
    evidence of a need for chars wider than 8 bits, but UTF-8 is actually an
    argument against that being necessary.
    --
    James Kuyper
    James Kuyper, Apr 20, 2013
    #3
  4. Shriramana Sharma

    James Kuyper Guest

    On 04/20/2013 06:45 AM, James Kuyper wrote:
    ....
    > Not really. Your message containing Hindi text displayed as actual
    > characters on my system, not the rectangles it shows when it doesn't
    > know how to display a character. I presume that they were the correct
    > characters - Google translate translates them back to English as "Why
    > anyone would want to be لما four other 8-bit", which I presume is close
    > to what you wanted to say, aside from the character it refused to
    > translate.


    I just realized that the Hindi character it refused to translate back to
    English is probably the one that was produced to (mis?)translate "char".
    --
    James Kuyper
    James Kuyper, Apr 20, 2013
    #4
  5. Shriramana Sharma

    Eric Sosman Guest

    On 4/20/2013 6:45 AM, James Kuyper wrote:
    > On 04/19/2013 11:11 PM, Eric Sosman wrote:
    >> On 4/19/2013 10:30 PM, Shriramana Sharma wrote:
    >>> Hello and thanks people for your clarifications. My mistake for not reading the introductory text correctly.
    >>>
    >>> BTW despite my name, I am not really that good at Hindi (it's not my mother tongue), so (auto-?)translating to it doesn't really facilitate anything. (Thanks for trying though!)

    >>
    >> Your English is much better than my Hindi! ;-) If you like,
    >> auto-translate the question from Hindi to Pinyin, or to Thai, or
    >> to Arabic, or to Russian, or to ... and I think the question about
    >> wanting a wider-than-8-bit `char' will answer itself.

    >
    > Not really. Your message containing Hindi text displayed as actual
    > characters on my system, not the rectangles it shows when it doesn't
    > know how to display a character. I presume that they were the correct
    > characters - Google translate translates them back to English as "Why
    > anyone would want to be لما four other 8-bit", which I presume is close
    > to what you wanted to say, aside from the character it refused to
    > translate. However, it arrived at my newsreader with the following headers:
    >
    > Content-Type: text/plain; charset=UTF-8; format=flowed
    > Content-Transfer-Encoding: 8bit
    >
    > The use of UTF-9, UTF-16, UTF-18, UTF-32, UCS2, or UCS4 would have been
    > evidence of a need for chars wider than 8 bits, but UTF-8 is actually an
    > argument against that being necessary.


    Do you, yourself, ever find yourself wanting something that
    isn't strictly necessary? Perhaps because something beyond the
    bare minimum might be more convenient, or more pleasant? Those
    white spaces in your source code: Are they all necessary?

    Multi-byte character encodings are possible, but clumsy.
    Library functions like strchr() do not deal well with them,
    programmers who must write (or should have written!) calls to
    wcstombs() and the like do not deal well with them, fseek()
    does not deal well with them, ... Wouldn't It Be Nicer if a
    single "atom" of data could encode an entire character, without
    relying on surrounding context to allow decoding?

    I do not offer the existence of rich glyph sets as evidence
    that `char' *must* be made wider, only as evidence that someone
    might have reason to *want* it wider (the O.P.'s question).

    --
    Eric Sosman
    d
    Eric Sosman, Apr 20, 2013
    #5
  6. On Saturday, April 20, 2013 1:48:04 PM UTC+1, Eric Sosman wrote:
    > On 4/20/2013 6:45 AM, James Kuyper wrote:
    >
    > Do you, yourself, ever find yourself wanting something that
    > isn't strictly necessary? Perhaps because something beyond the
    > bare minimum might be more convenient, or more pleasant? Those
    > white spaces in your source code: Are they all necessary?
    >
    > Multi-byte character encodings are possible, but clumsy.
    > Library functions like strchr() do not deal well with them,
    > programmers who must write (or should have written!) calls to
    > wcstombs() and the like do not deal well with them, fseek()
    > does not deal well with them, ... Wouldn't It Be Nicer if a
    > single "atom" of data could encode an entire character, without
    > relying on surrounding context to allow decoding?
    >
    > I do not offer the existence of rich glyph sets as evidence
    > that `char' *must* be made wider, only as evidence that someone
    > might have reason to *want* it wider (the O.P.'s question).
    >
    >

    There are problems with big alphabets, however.

    One is that keyboards won't enter them. Whilst you can fix this with a virtual
    keyboard of some description, it's very difficult.
    Then it becomes impossible to provide glyphs for every character, unless you
    are a corporation with massive resources.
    Another problem is that the vast majority of symbols are meaningless to the
    vast majority of programmers. So if there's a spurious squiggly X-like thing,
    the programmer doesn't even know the name of the symbol causing the bug,
    much less its unicode or what it might represent.

    Often it's better to say that computers speak English. If you want another
    language, it's built on top of English, e.g using sequences like α to
    represent non-English letters.

    --
    Malcolm's website
    http://www.malcommclean.site11.com/www
    Malcolm McLean, Apr 20, 2013
    #6
  7. Shriramana Sharma

    BGB Guest

    On 4/20/2013 7:48 AM, Eric Sosman wrote:
    > On 4/20/2013 6:45 AM, James Kuyper wrote:
    >> On 04/19/2013 11:11 PM, Eric Sosman wrote:
    >>> On 4/19/2013 10:30 PM, Shriramana Sharma wrote:
    >>>> Hello and thanks people for your clarifications. My mistake for not
    >>>> reading the introductory text correctly.
    >>>>
    >>>> BTW despite my name, I am not really that good at Hindi (it's not my
    >>>> mother tongue), so (auto-?)translating to it doesn't really
    >>>> facilitate anything. (Thanks for trying though!)
    >>>
    >>> Your English is much better than my Hindi! ;-) If you like,
    >>> auto-translate the question from Hindi to Pinyin, or to Thai, or
    >>> to Arabic, or to Russian, or to ... and I think the question about
    >>> wanting a wider-than-8-bit `char' will answer itself.

    >>
    >> Not really. Your message containing Hindi text displayed as actual
    >> characters on my system, not the rectangles it shows when it doesn't
    >> know how to display a character. I presume that they were the correct
    >> characters - Google translate translates them back to English as "Why
    >> anyone would want to be لما four other 8-bit", which I presume is close
    >> to what you wanted to say, aside from the character it refused to
    >> translate. However, it arrived at my newsreader with the following
    >> headers:
    >>
    >> Content-Type: text/plain; charset=UTF-8; format=flowed
    >> Content-Transfer-Encoding: 8bit
    >>
    >> The use of UTF-9, UTF-16, UTF-18, UTF-32, UCS2, or UCS4 would have been
    >> evidence of a need for chars wider than 8 bits, but UTF-8 is actually an
    >> argument against that being necessary.

    >
    > Do you, yourself, ever find yourself wanting something that
    > isn't strictly necessary? Perhaps because something beyond the
    > bare minimum might be more convenient, or more pleasant? Those
    > white spaces in your source code: Are they all necessary?
    >
    > Multi-byte character encodings are possible, but clumsy.
    > Library functions like strchr() do not deal well with them,
    > programmers who must write (or should have written!) calls to
    > wcstombs() and the like do not deal well with them, fseek()
    > does not deal well with them, ... Wouldn't It Be Nicer if a
    > single "atom" of data could encode an entire character, without
    > relying on surrounding context to allow decoding?
    >
    > I do not offer the existence of rich glyph sets as evidence
    > that `char' *must* be made wider, only as evidence that someone
    > might have reason to *want* it wider (the O.P.'s question).
    >


    part of the issue with C in these regards is that it aliases characters
    and bytes.

    having a separate char and byte types could have made more sense.
    granted... 'wchar_t'...

    in this case, 'char' more typically represents a byte, and it could make
    more sense simply to nail down the byte size as 8 bits, and by
    extension, 'char'.


    many of us have good results using UTF-8 for nearly everything. those
    things which don't work well in UTF-8, can typically use UTF-16 or UTF-32.

    generally, UTF-32 is often unnecessary:
    it is rare to find text using any characters outside the BMP;
    it is also rare to find fonts which support it (hard, actually, even to
    find fonts which effectively support most of the Unicode BMP);
    ....

    so, while some people may object, naive UCS2 may actually work pretty
    well in many cases involving internationalized text.



    however, a person can use 32-bit characters, and treat the high bits as
    formatting data (text color and style).

    for example, in my case I have a tweaked character encoding which uses
    32-bits per character:
    if the character fits in 16 bits, then the high 16-bits are used as
    formatting;
    if the character does not, it has 20 bits, loses its background color
    (the background color comes from a prior character).

    some combinations of formatting options are also assumed to be mutually
    exclusive to help save bits (such as
    superscript/subscript/strikethrough, ...).

    in the source-text form (UTF-8), this information is generally
    represented using ANSI-codes (though other options could be possible).


    FWIW (OT):
    in my own (scripting) language, it more goes the route of making bytes
    and characters semantically different types:
    byte/sbyte/ubyte: bytes, defined as always 8 bits (sbyte = signed byte);
    char: default character (*1);
    cchar: C character, defined as being (by default) 8 bits;
    char8: explicit 8-bit character;
    char16: explicit 16-bit character;
    char32: explicit 32-bit character.

    *1: generally, it is 16-bits in storage (arrays or structs), but 32-bits
    when in 'working' forms (in variables or function arguments). elsewhere,
    it will try to align with 'wchar_t'.

    they also differ partly in that they represent different parts of the
    numeric tower (byte and friends are part of the integer tower, with
    'char' and friends as a partially disjoint character tower, where casts
    are used to convert between them).

    within the FFI (C <-> BS):
    'char' <-> 'cchar';
    'unsigned char' <-> 'byte/ubyte';
    'signed char' <-> 'sbyte'.
    'wchar_t' <-> 'char';
    ....

    note that, sizes of byte/short/int/long/... are explicitly defined as
    8/16/32/64 bits. in targets where C differs, they will not line up with
    their C name-equivalents (for example, a hypothetical implementation on
    a 16-bit target would still use a 32-bit 'int', even if C were using a
    16-bit 'int').

    ....
    BGB, Apr 20, 2013
    #7
  8. Shriramana Sharma

    Nobody Guest

    On Sat, 20 Apr 2013 15:01:59 -0500, BGB wrote:

    > (hard, actually, even to find fonts which effectively support most of the
    > Unicode BMP); ...


    Technically, it's impossible.

    A "font" (or, outside the US, "fount") is a complete set of type in a
    particular style and size (the term "scalable font" is an oxymoron; a set
    of glyphs in a common style without any particular size is a "typeface",
    not a "font").

    The inherent differences between scripts mean that it's impossible for
    as set of glyphs for the Latin script to have the same "style" as a set of
    glyphs for e.g. an Arabic or Han script, so distinct scripts cannot be
    part of the same "font".

    You might get multiple fonts for multiple scripts in a single TTF file,
    but that's not the same thing as a font. It's also not a particularly good
    idea, as it requires making completely arbitrary choices as to which
    typeface to use for each script. It's normally done as a workaround for
    software which expects the user to choose a single "font" for all text
    regardless of the scripts which are used.
    Nobody, Apr 20, 2013
    #8
  9. On Saturday, April 20, 2013 10:25:56 PM UTC+1, Nobody wrote:
    > On Sat, 20 Apr 2013 15:01:59 -0500, BGB wrote:
    >
    > The inherent differences between scripts mean that it's
    > impossible for as set of glyphs for the Latin script to have the
    > same "style" as a set of glyphs for e.g. an Arabic or Han script,
    > so distinct scripts cannot be part of the same "font".
    >

    Also some languages have multiple scripts for the same alphabet. Hebrew has an archaic paleoHebrew script, which you might still
    want for scholarly purposes, the Masoretic script which is used for
    modern printed matter and handwritten religious texts, and a
    simpler handwritten script which is used for everyday note taking.

    In English of course we have upper and lower case letters, but
    computers treat those as different "characters".
    Malcolm McLean, Apr 21, 2013
    #9
  10. On Sat, 20 Apr 2013 08:48:04 -0400, Eric Sosman wrote:

    > Do you, yourself, ever find yourself wanting something that
    > isn't strictly necessary? Perhaps because something beyond the bare
    > minimum might be more convenient, or more pleasant? Those white spaces
    > in your source code: Are they all necessary?
    >
    > Multi-byte character encodings are possible, but clumsy.
    > Library functions like strchr() do not deal well with them, programmers
    > who must write (or should have written!) calls to wcstombs() and the
    > like do not deal well with them, fseek() does not deal well with them,
    > ... Wouldn't It Be Nicer if a single "atom" of data could encode an
    > entire character, without relying on surrounding context to allow
    > decoding?


    Welcome to the world of Unicode, where *all* encodings (even UCS4/UTF-32)
    should be considered to be variable-width (which is even worse than multi-
    byte) thanks to the existence of combining characters and the like.

    Bart v Ingen Schenau
    Bart van Ingen Schenau, Apr 23, 2013
    #10
  11. Shriramana Sharma

    Ken Brody Guest

    On 4/19/2013 10:30 PM, Shriramana Sharma wrote:
    > Hello and thanks people for your clarifications. My mistake for not
    > reading the introductory text correctly.
    >
    > BTW despite my name, I am not really that good at Hindi (it's not my
    > mother tongue), so (auto-?)translating to it doesn't really facilitate
    > anything. (Thanks for trying though!)


    I believe his point was that "ا कà¥à¤¯à¥‹à¤‚ किसी को भी चार अनà¥à¤¯ 8 बिट होना चाहेगा"
    doesn't fit well into 8-bit characters, which IMHO facilitated answering
    both of your questions:

    why would anyone want char to be other than 8 bits?
    *Is* char on any platform *not* 8 bits?
    Ken Brody, Apr 25, 2013
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Barry Schwarz

    Re: short int always 16 bits or not?

    Barry Schwarz, Apr 20, 2013, in forum: C Programming
    Replies:
    2
    Views:
    195
    glen herrmannsfeldt
    Apr 22, 2013
  2. Eric Sosman

    Re: short int always 16 bits or not?

    Eric Sosman, Apr 20, 2013, in forum: C Programming
    Replies:
    0
    Views:
    195
    Eric Sosman
    Apr 20, 2013
  3. James Kuyper

    Re: short int always 16 bits or not?

    James Kuyper, Apr 20, 2013, in forum: C Programming
    Replies:
    2
    Views:
    178
    James Kuyper
    Apr 22, 2013
  4. Les Cargill

    Re: short int always 16 bits or not?

    Les Cargill, Apr 20, 2013, in forum: C Programming
    Replies:
    0
    Views:
    180
    Les Cargill
    Apr 20, 2013
  5. Keith Thompson

    Re: short int always 16 bits or not?

    Keith Thompson, Apr 20, 2013, in forum: C Programming
    Replies:
    0
    Views:
    185
    Keith Thompson
    Apr 20, 2013
Loading...

Share This Page