Re: New utf8string design may make UTF-8 the superior encoding

Discussion in 'C++' started by Öö Tiib, May 16, 2010.

  1. Öö Tiib

    Öö Tiib Guest

    On 16 mai, 15:34, "Peter Olcott" <> wrote:
    > Since the reason for using other encodings than UTF-8 is
    > speed and ease of use, a string that is as fast and easy to
    > use (as the strings of other encodings) that often takes
    > less space would be superior to alternative strings.


    If you care so much ... perhaps throw together your utf8string and let
    us to see it. Perhaps test & profile it first to compare with
    Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html

    I suspect UTF8 fades gradually into history. Reasons are similar like
    256 color video-modes and raster-graphic formats went. GUI-s are
    already often made with java or C# (for lack of C++ devs) and these
    use UTF16 internally. Notice that modern processor architectures are
    already optimized in the way that byte-level operations are often
    slower.
     
    Öö Tiib, May 16, 2010
    #1
    1. Advertising

  2. Öö Tiib

    Öö Tiib Guest

    On 16 mai, 17:46, Peter Olcott <> wrote:
    >
    > UTF-8 is the best Unicode data-interchange format because it works
    > exactly the same way across every machine architecture without the need
    > for separate adaptations. It also stores the entire ASCII character set
    > in a single byte per code point.


    Similarly is Portable Network Graphics good format to interchange
    raster graphics. Gimp, Photoshop etc. however do not use such packed
    format for graphics manipulation internally. They use their own
    internal format to achieve manipulation speed and flexibility. You
    insist using interchange format for manipulation. It may be good or
    bad idea depends on context.

    > I will put it together because it will become one of my standard tools.
    > The design is now essentially complete. Coding this updated design will
    > go very quickly. I will put it on my website and provide a free license
    > for any use as long as the copyright notice remains in the source code.


    Great.
     
    Öö Tiib, May 16, 2010
    #2
    1. Advertising

  3. Öö Tiib

    James Kanze Guest

    On 16 May, 14:51, Öö Tiib <> wrote:
    > On 16 mai, 15:34, "Peter Olcott" <> wrote:


    > I suspect UTF8 fades gradually into history. Reasons are
    > similar like 256 color video-modes and raster-graphic formats
    > went. GUI-s are already often made with java or C# (for lack
    > of C++ devs) and these use UTF16 internally. Notice that
    > modern processor architectures are already optimized in the
    > way that byte-level operations are often slower.


    The network is still 8 bits UTF-8. As are the disks; using
    UTF-16 on an external support simply doesn't work.

    Also, UTF-8 may result in less memory use, and thus less paging.

    If all you're doing are simple operations, searching for a few
    ASCII delimiters and copying the delimited substrings, for
    example, UTF-8 will probably be significantly faster: the CPU
    will always read a word at a time, even if you access it byte by
    byte, and you'll usually get more characters per word using
    UTF-8.

    If you need full and complete support, as in an editor, for
    example, UTF-32 is the best general solution. For a lot of
    things in between, UTF-16 is a good compromise.

    But the trade-offs only concern internal representation.
    Externally, the world is 8 bits, and UTF-8 is the only solution.

    --
    James Kanze
     
    James Kanze, May 18, 2010
    #3
  4. Öö Tiib

    Öö Tiib Guest

    On 18 mai, 17:18, James Kanze <> wrote:
    > The network is still 8 bits UTF-8.  As are the disks; using
    > UTF-16 on an external support simply doesn't work.
    >
    > Also, UTF-8 may result in less memory use, and thus less paging.
    >
    > If all you're doing are simple operations, searching for a few
    > ASCII delimiters and copying the delimited substrings, for
    > example, UTF-8 will probably be significantly faster: the CPU
    > will always read a word at a time, even if you access it byte by
    > byte, and you'll usually get more characters per word using
    > UTF-8.
    >
    > If you need full and complete support, as in an editor, for
    > example, UTF-32 is the best general solution.  For a lot of
    > things in between, UTF-16 is a good compromise.
    >
    > But the trade-offs only concern internal representation.
    > Externally, the world is 8 bits, and UTF-8 is the only solution.


    I would be honestly extremely glad if it was the only solution. Real
    life applications throw in texts in all possible forms also they await
    responses in all possible forms. For example texts in financial
    transactions done in most Northern Europe assume that "/\{}[]" means
    something like "ÄäÅåÖö" (i do not remember correct order, but
    something like that).

    I prefer to convert incoming texts into std::wstring. Outgoing texts i
    convert back to whatever they await (UTF-8 is really relaxing news
    there, true). All what i need is a set of conversion functions. If it
    is going to user interface then std::wstring goes and it is business
    of UI to convert it further into CString or QString or whatever they
    enjoy there and sort it out for user.

    I perhaps have too low experience with sophisticated text processing.
    Simple std::sort(), wide char literals of C++ and boost::wformat plus
    full set of conversion functions is all i need really. Peter Olcott
    raises lot of noise around it and so it makes me a bit
    interested. :)
     
    Öö Tiib, May 19, 2010
    #4
  5. Öö Tiib

    Mihai N. Guest


    > I perhaps have too low experience with sophisticated text processing.
    > Simple std::sort(), wide char literals of C++ and boost::wformat plus
    > full set of conversion functions is all i need really.


    It depends a lot what you need.

    Sorting is locale-sensitive (German, Swedish, French, Spanish, all
    have different sorting rules).
    The CRT (and STL, and boost) are pretty dumb when dealing with things
    in a locale sensitive way (meaning that they usualy don't :)


    --
    Mihai Nita [Microsoft MVP, Visual C++]
    http://www.mihai-nita.net
    ------------------------------------------
    Replace _year_ with _ to get the real email
     
    Mihai N., May 19, 2010
    #5
  6. Öö Tiib

    Öö Tiib Guest

    On May 19, 8:24 am, "Mihai N." <> wrote:
    > > I perhaps have too low experience with sophisticated text processing.
    > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
    > > full set of conversion functions is all i need really.

    >
    > It depends a lot what you need.
    >
    > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
    > have different sorting rules).
    > The CRT (and STL, and boost) are pretty dumb when dealing with things
    > in a locale sensitive way (meaning that they usualy don't :)


    Yes, sorting in real alphabetic order for user is perhaps business of
    GUI. GUI has to display it. GUI however usually has its WxStrings or
    FooStrings anyway. I hate when someone leaks these weirdos to
    application mechanics layer. Internal application logic is often best
    made totally locale-agnostic and not caring about positioning in GUI
    and if the end-users write from up to down or from right to left.

    So text in electronic interfaces layer are bytes, text in application
    layer are wchar_t and text in user interface layer are whatever weirdo
    rules there. If maintainer forgets to convert in interface between
    layers he gets compiler warnings or errors. That makes life easy, but
    i suspect my problems with texts are more trivial than these of some
    others.
     
    Öö Tiib, May 19, 2010
    #6
  7. Öö Tiib

    James Kanze Guest

    On May 19, 12:01 am, Öö Tiib <> wrote:
    > On 18 mai, 17:18, James Kanze <> wrote:


    [...]
    > > But the trade-offs only concern internal representation.
    > > Externally, the world is 8 bits, and UTF-8 is the only solution.


    > I would be honestly extremely glad if it was the only solution. Real
    > life applications throw in texts in all possible forms also they await
    > responses in all possible forms.


    Yes. I meant it is the only solution if you are choosing
    yourself. In practice, there are a lot of other solutions being
    used; they don't work, except in limited environments, but they
    are being widely used.

    > For example texts in financial transactions done in most
    > Northern Europe assume that "/\{}[]" means something like
    > "ÄäÅåÖö" (i do not remember correct order, but something like
    > that).


    > I prefer to convert incoming texts into std::wstring. Outgoing
    > texts i convert back to whatever they await (UTF-8 is really
    > relaxing news there, true). All what i need is a set of
    > conversion functions. If it is going to user interface then
    > std::wstring goes and it is business of UI to convert it
    > further into CString or QString or whatever they enjoy there
    > and sort it out for user.


    In theory, the conversion should take place in the filebuf,
    using the imbued locale.

    > I perhaps have too low experience with sophisticated text processing.
    > Simple std::sort(), wide char literals of C++ and boost::wformat plus
    > full set of conversion functions is all i need really. Peter Olcott
    > raises lot of noise around it and so it makes me a bit
    > interested. :)


    There can be advantages to using UTF-8 internally, as well as at
    the interface level, and if you're not doing too complicated
    things, it can work quite nicely. But only as long as your
    manipulations aren't too complicated.

    --
    James Kanze
     
    James Kanze, May 19, 2010
    #7
  8. Öö Tiib

    Öö Tiib Guest

    On May 19, 1:21 pm, James Kanze <> wrote:
    > On May 19, 12:01 am, Öö Tiib <> wrote:
    >
    > > On 18 mai, 17:18, James Kanze <> wrote:

    >
    >     [...]
    >
    > > > But the trade-offs only concern internal representation.
    > > > Externally, the world is 8 bits, and UTF-8 is the only solution.

    > > I would be honestly extremely glad if it was the only solution. Real
    > > life applications throw in texts in all possible forms also they await
    > > responses in all possible forms.

    >
    > Yes.  I meant it is the only solution if you are choosing
    > yourself.  In practice, there are a lot of other solutions being
    > used; they don't work, except in limited environments, but they
    > are being widely used.
    >
    > > For example texts in financial transactions done in most
    > > Northern Europe assume that  "/\{}[]" means something like
    > > "ÄäÅåÖö" (i do not remember correct order, but something like
    > > that).
    > > I prefer to convert incoming texts into std::wstring. Outgoing
    > > texts i convert back to whatever they await (UTF-8 is really
    > > relaxing news there, true). All what i need is a set of
    > > conversion functions. If it is going to user interface then
    > > std::wstring goes and it is business of UI to convert it
    > > further into CString or QString or whatever they enjoy there
    > > and sort it out for user.

    >
    > In theory, the conversion should take place in the filebuf,
    > using the imbued locale.


    Yes, if it is good wfilebuf then my problems are totally unexisting.
    Often it is not in practice; instead there are strange protocol layers
    and security by obscurity.

    > > I perhaps have too low experience with sophisticated text processing.
    > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
    > > full set of conversion functions is all i need really. Peter Olcott
    > > raises lot of noise around it and so it makes me a bit
    > > interested.  :)

    >
    > There can be advantages to using UTF-8 internally, as well as at
    > the interface level, and if you're not doing too complicated
    > things, it can work quite nicely.  But only as long as your
    > manipulations aren't too complicated.


    My major advantage from using wstring is that ...

    Bytes are often too ambiguous information, even if exception like
    UTF-8 the information is fully sufficient. Compiler does not make
    difference between byte (char) in UTF-8 string, or byte in string in
    some other encoding. wstring ensures that compilers/tools can easily
    frown upon such bytes that sneak into application layer in whatever
    encoding these are and from where-ever these come. That gains
    attention at right place and for right reason.

    For example there is:
    basic_fstream::basic_fstream(const char* s, ios_base::eek:penmode
    mode);

    If i give wstring::c_str() result as parameter s to that constructor i
    get error. So compiler drags my attention to right place. If i get no
    error then there is most likely extension to STL that most likely
    works correctly. Giving result of string::c_str() (that contains
    UTF-8) creates most likely garbage-filled file name.
     
    Öö Tiib, May 19, 2010
    #8
  9. On May 19, 1:50 am, Öö Tiib <> wrote:
    > On May 19, 8:24 am, "Mihai N." <> wrote:
    >
    > > > I perhaps have too low experience with sophisticated text processing.
    > > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
    > > > full set of conversion functions is all i need really.

    >
    > > It depends a lot what you need.

    >
    > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
    > > have different sorting rules).
    > > The CRT (and STL, and boost) are pretty dumb when dealing with things
    > > in a locale sensitive way (meaning that they usualy don't :)

    >
    > Yes, sorting in real alphabetic order for user is perhaps business of
    > GUI. GUI has to display it. GUI however usually has its WxStrings or
    > FooStrings anyway. I hate when someone leaks these weirdos to
    > application mechanics layer. Internal application logic is often best
    > made totally locale-agnostic and not caring about positioning in GUI
    > and if the end-users write from up to down or from right to left.
    >
    > So text in electronic interfaces layer are bytes, text in application
    > layer are wchar_t and text in user interface layer are whatever weirdo
    > rules there. If maintainer forgets to convert in interface between
    > layers he gets compiler warnings or errors. That makes life easy, but
    > i suspect my problems with texts are more trivial than these of some
    > others.


    First, as I mentioned in the other current thread on Unicode, please
    stop saying "wchar_t" and "wstring" as though that means something, or
    is at all a useful portable tool. wchar_t is 16 bits on windows, and
    32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
    some more exceptions.) So, either you're suggesting an entirely not
    portable solution with wstring, or you are suggesting that it makes
    sense to use UTF32 on Unix-like computers and UTF16 on windows
    computers, a quite silly statement.

    Then, locales in my experience have not been terribly portable, not
    portable enough for my company's product which runs on nearly all
    computer OSs known to man, including windows, win x64, the so to be
    "desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
    IPF, and so on. Moreover, it's not terribly practical to tell our
    customers "you have to install these 'x' locales". Moreover, the
    locales of the same name on different OSs have been known to have
    subtly different behavior.

    Finally, I can't think of a useful example off the top of my head
    where sorting based on locale would be required except when
    "printing", to the screen, file, etc., but this doesn't convince me
    that there is no use for it. As a potential example, should you have
    to bring in an entire GUI framework just to implement the Unix utility
    "sort" except with an additional locale option? That seems silly to
    me.
     
    Joshua Maurice, May 19, 2010
    #9
  10. Öö Tiib

    Öö Tiib Guest

    On May 20, 12:02 am, Joshua Maurice <> wrote:
    > On May 19, 1:50 am, Öö Tiib <> wrote:
    > > On May 19, 8:24 am, "Mihai N." <> wrote:

    >
    > > > > I perhaps have too low experience with sophisticated text processing.
    > > > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
    > > > > full set of conversion functions is all i need really.

    >
    > > > It depends a lot what you need.

    >
    > > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
    > > > have different sorting rules).
    > > > The CRT (and STL, and boost) are pretty dumb when dealing with things
    > > > in a locale sensitive way (meaning that they usualy don't :)

    >
    > > Yes, sorting in real alphabetic order for user is perhaps business of
    > > GUI. GUI has to display it. GUI however usually has its WxStrings or
    > > FooStrings anyway. I hate when someone leaks these weirdos to
    > > application mechanics layer. Internal application logic is often best
    > > made totally locale-agnostic and not caring about positioning in GUI
    > > and if the end-users write from up to down or from right to left.

    >
    > > So text in electronic interfaces layer are bytes, text in application
    > > layer are wchar_t and text in user interface layer are whatever weirdo
    > > rules there. If maintainer forgets to convert in interface between
    > > layers he gets compiler warnings or errors. That makes life easy, but
    > > i suspect my problems with texts are more trivial than these of some
    > > others.

    >
    > First, as I mentioned in the other current thread on Unicode, please
    > stop saying "wchar_t" and "wstring" as though that means something, or
    > is at all a useful portable tool. wchar_t is 16 bits on windows, and
    > 32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
    > some more exceptions.) So, either you're suggesting an entirely not
    > portable solution with wstring, or you are suggesting that it makes
    > sense to use UTF32 on Unix-like computers and UTF16 on windows
    > computers, a quite silly statement.


    Now ... seems that there is strange misunderstanding. For anyone
    converting between whatever char sequence to whatever wchar_t sequence
    it is highly-platform-dependent-operation anyway. I have no way said
    that such operations are portable. Since wstring is used for
    internally holding texts the sizeof(wchar_t) is not affecting
    anything. The major property of wchar_t for me is that it is different
    from char on all platforms i know and so i get warnings/errors from
    tools on attempts to mechanically assign one to other.

    > Then, locales in my experience have not been terribly portable, not
    > portable enough for my company's product which runs on nearly all
    > computer OSs known to man, including windows, win x64, the so to be
    > "desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
    > IPF, and so on.


    You managed to somehow have portability in string-to-string
    conversions? Congrats. I have abandoned all hope there. Different code
    is used for conversions platform-by-platform. The platform makers (and
    not only) seemingly fight with each other to make their data
    incompatible so why should i hope there will be peace and portability
    any day? Is there something new? Same goes on with dates, values with
    measurement units and even plain floating point numbers ... only name
    it. Plain text is nothing different.

    > Moreover, it's not terribly practical to tell our
    > customers "you have to install these 'x' locales". Moreover, the
    > locales of the same name on different OSs have been known to have
    > subtly different behavior.


    Exactly! So portability and localization is possible only by having
    converter for each platform that does know the quirks of platform. If
    sizeof(wchar_t) is 2 or 4 does not matter at all since code that
    produces it is anyway different.

    > Finally, I can't think of a useful example off the top of my head
    > where sorting based on locale would be required except when
    > "printing", to the screen, file, etc., but this doesn't convince me
    > that there is no use for it.


    No need to nail me. I only confirm that i have not meet a need for it,
    but i can not prove that it does not exist. I fight problems that i
    meet on field, not theoretical possibilities. ;)

    As a potential example, should you have
    > to bring in an entire GUI framework just to implement the Unix utility
    > "sort" except with an additional locale option? That seems silly to
    > me.


    No. GUI sorts if there is GUI and printing is part of GUI (if it
    really deserves to be named GUI that is). If it goes elsewhere then it
    is not a GUI and so why should i sort without user to see it? As for
    GUI I am optimistic there. GUI sorts based on the things it uses. For
    example:

    bool QString::eek:perator< ( const QString & other ) const {}

    In theoretical failure on particular case/platform/locale i would get
    defect report, can forward a bug to Nokia and meanwhile write some
    custom operator to be used instead:

    bool hack::broken_platform_name_here::less( const QString & one,
    const QString & another);

    In practice however it seems to work or is classified cosmetic or
    minor problem. Such do not affect success.
     
    Öö Tiib, May 19, 2010
    #10
  11. Öö Tiib

    Mihai N. Guest


    > wchar_t is 16 bits on windows, and
    > 32 bits on most Unix-like systems IIRC.


    To make things worse, wchar_t can (in theory) be 8 bits
    (the C standard allows it) and there is in no way guaranteed
    to be some form of Unicode (in fact there is one system the
    I know of that uses wchar_t for non-Unicode strings).


    > Then, locales in my experience have not been terribly portable,


    Agree. And I think that's again the fault of the C standard.
    Which is these areas feels more like a set of guidelines than
    a standard :)


    --
    Mihai Nita [Microsoft MVP, Visual C++]
    http://www.mihai-nita.net
    ------------------------------------------
    Replace _year_ with _ to get the real email
     
    Mihai N., May 20, 2010
    #11
  12. Öö Tiib

    Öö Tiib Guest

    On 20 mai, 09:46, "Mihai N." <> wrote:
    > > wchar_t is 16 bits on windows, and
    > > 32 bits on most Unix-like systems IIRC.

    >
    > To make things worse, wchar_t can (in theory) be 8 bits
    > (the C standard allows it) and there is in no way guaranteed
    > to be some form of Unicode (in fact there is one system the
    > I know of that uses wchar_t for non-Unicode strings).


    That all does not matter as long i can not convert chars or QChars or
    whatever I/O crap (read dirty data) implicitly to wchar_t (read clean
    data). On some platforms there are OS functions (with quirks that i
    have to fix), on others standard library functions help somewhat on
    third platforms i can use some open source functions.

    > > Then, locales in my experience have not been terribly portable,

    >
    > Agree. And I think that's again the fault of the C standard.
    > Which is these areas feels more like a set of guidelines than
    > a standard :)


    Misconception. Standard commitees are military negotiation and world
    dividing tables of companies like Microsoft, Intel, Sun, Google, AT&T,
    HP, Apple etc. All they agree that there should be borders (no
    portability) and crossing borders (writing portable software) should
    be costly. So i play their game and pay the prices. Can not go against
    them?
     
    Öö Tiib, May 20, 2010
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph M. Newcomer
    Replies:
    31
    Views:
    913
    Oliver Regenfelder
    May 21, 2010
  2. Joseph M. Newcomer
    Replies:
    0
    Views:
    527
    Joseph M. Newcomer
    May 17, 2010
  3. Mauricio Fernandez
    Replies:
    7
    Views:
    147
    Mauricio Fernandez
    Mar 1, 2007
  4. Replies:
    2
    Views:
    375
  5. Replies:
    2
    Views:
    386
    Nathan Keel
    Aug 14, 2009
Loading...

Share This Page