Re: Best way to handle UTF-8 in C++

Discussion in 'C++' started by Victor Bazarov, May 6, 2010.

  1. On 5/6/2010 9:45 AM, Peter Olcott wrote:
    > I am looking for a way to handle UTF-8 text in my C++
    > application. The ideal case would be an STL class that
    > handles UTF-8. What is the next best thing?


    What do you mean by "handle"? STL class? Don't you have the compiler
    documentation? If there is one, you already have all information you
    need. Want more? Buy a book on the Standard library. There are
    several that many consider decent. Next best thing? Google.

    V
    --
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, May 6, 2010
    #1
    1. Advertising

  2. On 5/6/2010 10:11 AM, Peter Olcott wrote:
    > "Victor Bazarov"<> wrote in
    > message news:hruhqu$hqt$-september.org...
    >> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    >>> I am looking for a way to handle UTF-8 text in my C++
    >>> application. The ideal case would be an STL class that
    >>> handles UTF-8. What is the next best thing?

    >>
    >> What do you mean by "handle"? STL class? Don't you have
    >> the compiler documentation? If there is one, you already
    >> have all information you need. Want more? Buy a book on
    >> the Standard library. There are several that many
    >> consider decent. Next best thing? Google.

    >
    > I must be able to use UTF-8 strings in my C++ application. I
    > want to know the best way to do this. I prefer an interface
    > that works the same way as the STL interface.


    What do you mean by "use" and in what way can't you "use" the UTF-8
    strings already? There is no such thing as "STL interface", perhaps you
    can explain what you mean by "the same way". I can start guessing, but
    it's much better if you just specify what exactly you're trying to
    accomplish. Try to refrain from using such generic terms as "STL
    interface" or "use". For example, you can say, "I need to be able to
    figure out whether there are uppercase characters in my 'string', like
    the standard function 'isupper' does"...

    V
    --
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, May 6, 2010
    #2
    1. Advertising

  3. On 5/6/2010 1:39 PM, Peter Olcott wrote:
    > "Victor Bazarov"<> wrote in
    > message news:hruqhc$lo6$-september.org...
    >> On 5/6/2010 10:11 AM, Peter Olcott wrote:
    >>> "Victor Bazarov"<> wrote in
    >>> message news:hruhqu$hqt$-september.org...
    >>>> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    >>>>> I am looking for a way to handle UTF-8 text in my C++
    >>>>> application. The ideal case would be an STL class that
    >>>>> handles UTF-8. What is the next best thing?
    >>>>
    >>>> What do you mean by "handle"? STL class? Don't you
    >>>> have
    >>>> the compiler documentation? If there is one, you
    >>>> already
    >>>> have all information you need. Want more? Buy a book
    >>>> on
    >>>> the Standard library. There are several that many
    >>>> consider decent. Next best thing? Google.
    >>>
    >>> I must be able to use UTF-8 strings in my C++
    >>> application. I
    >>> want to know the best way to do this. I prefer an
    >>> interface
    >>> that works the same way as the STL interface.

    >>
    >> What do you mean by "use" and in what way can't you "use"
    >> the UTF-8 strings already? There is no such thing as "STL
    >> interface", perhaps you can explain what you mean by "the
    >> same way". I can start guessing, but it's much better if
    >> you just specify what exactly you're trying to accomplish.
    >> Try to refrain from using such generic terms as "STL
    >> interface" or "use". For example, you can say, "I need to
    >> be able to figure out whether there are uppercase
    >> characters in my 'string', like the standard function
    >> 'isupper' does"...

    >
    > I want a string class that works exactly the same way as
    > std::string, except implements UTF-8.


    ....as opposed to *what*? UTF-8 is an encoding scheme. 'std::string'
    does *not* have an encoding scheme, it's a mere container of 'char'.
    Nothing more, nothing less. What *exactly* in it doesn't work NOW for
    you? Have you tried making the default 'char' unsigned? If your
    platform has 8-bit chars, and you make them unsigned, you got yourself
    UTF-8 storage type. And 'std::string' will provide functionality for
    storing elements of that type (by virtue of being defined as
    'std::basic_string<char>'), and operations to manipulate that storage
    (append to, erase from, enumerate, etc.)

    So, once again, what do you mean by "implements UTF-8"?

    > This means that the
    > interface can remain the same, (all of the member functions
    > have the same name and same parameters) but the underlying
    > meaning may be different.


    "May be different"? If I rewrite 'std::string' for you and just make
    all functions return 0 and do nothing, would that be acceptable? That's
    a rhetorical question, BTW. If you just allow the "meaning" to be
    different, you still haven't specified anything. Does it *have to* be
    different? In what way?

    Could it be that you're don't know yet what you *need* from your class,
    which you hope will "handle" UTF-8? What *operations* do you hope it
    will help you perform on your "UTF-8" strings?

    V
    --
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, May 6, 2010
    #3
  4. On May 6, 11:09 am, Victor Bazarov <> wrote:
    > On 5/6/2010 1:39 PM, Peter Olcott wrote:
    >
    >
    >
    > > "Victor Bazarov"<>  wrote in
    > > messagenews:hruqhc$lo6$-september.org...
    > >> On 5/6/2010 10:11 AM, Peter Olcott wrote:
    > >>> "Victor Bazarov"<>   wrote in
    > >>> messagenews:hruhqu$hqt$-september.org...
    > >>>> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    > >>>>> I am looking for a way to handle UTF-8 text in my C++
    > >>>>> application. The ideal case would be an STL class that
    > >>>>> handles UTF-8. What is the next best thing?

    >
    > >>>> What do you mean by "handle"?  STL class?  Don't you
    > >>>> have
    > >>>> the compiler documentation?  If there is one, you
    > >>>> already
    > >>>> have all information you need.  Want more?  Buy a book
    > >>>> on
    > >>>> the Standard library.  There are several that many
    > >>>> consider decent.  Next best thing?  Google.

    >
    > >>> I must be able to use UTF-8 strings in my C++
    > >>> application. I
    > >>> want to know the best way to do this. I prefer an
    > >>> interface
    > >>> that works the same way as the STL interface.

    >
    > >> What do you mean by "use" and in what way can't you "use"
    > >> the UTF-8 strings already?  There is no such thing as "STL
    > >> interface", perhaps you can explain what you mean by "the
    > >> same way".  I can start guessing, but it's much better if
    > >> you just specify what exactly you're trying to accomplish.
    > >> Try to refrain from using such generic terms as "STL
    > >> interface" or "use".  For example, you can say, "I need to
    > >> be able to figure out whether there are uppercase
    > >> characters in my 'string', like the standard function
    > >> 'isupper' does"...

    >
    > > I want a string class that works exactly the same way as
    > > std::string, except implements UTF-8.

    >
    > ...as opposed to *what*?  UTF-8 is an encoding scheme.  'std::string'
    > does *not* have an encoding scheme, it's a mere container of 'char'.
    > Nothing more, nothing less.  What *exactly* in it doesn't work NOW for
    > you?  Have you tried making the default 'char' unsigned?  If your
    > platform has 8-bit chars, and you make them unsigned, you got yourself
    > UTF-8 storage type.  And 'std::string' will provide functionality for
    > storing elements of that type (by virtue of being defined as
    > 'std::basic_string<char>'), and operations to manipulate that storage
    > (append to, erase from, enumerate, etc.)
    >
    > So, once again, what do you mean by "implements UTF-8"?
    >
    >  >  This means that the
    >
    > > interface can remain the same, (all of the member functions
    > > have the same name and same parameters) but the underlying
    > > meaning may be different.

    >
    > "May be different"?  If I rewrite 'std::string' for you and just make
    > all functions return 0 and do nothing, would that be acceptable?  That's
    > a rhetorical question, BTW.  If you just allow the "meaning" to be
    > different, you still haven't specified anything.  Does it *have to* be
    > different?  In what way?
    >
    > Could it be that you're don't know yet what you *need* from your class,
    > which you hope will "handle" UTF-8?  What *operations* do you hope it
    > will help you perform on your "UTF-8" strings?


    Let me try to explain. std::string has member functions like find and
    substring. When used to store UTF-8 for an 8 bit char, the indexes are
    in terms of 8 bit encoding units. However, generally a user does not
    want to work with indexes in terms of encoding units. They want to
    work with indexes in terms of encoded Unicode code points, or more
    probably Unicode grapheme clusters.

    As an example, my company uses a naive Unicode string abstraction. A
    user of my product can specify a transformation to a set of strings,
    such as "Take input rows of UTF-8 string. For each row X, output the
    row X with the first 'char' removed" aka substring(1,
    std::string::npos). However, suppose the first two encoded Unicode
    code points were "lowercase Latin letter e" and "combining character
    accent acute". When displayed with a program with good Unicode
    support, this will display as é, the lowercase Latin letter e with the
    accent acute. However, when our program removes the first 'char' from
    the string, it removes only the first encoding unit (UTF-16 actually),
    which leaves the string with a single Unicode code point, the
    combining character accent acute, an invalid string.

    Preferably, our product would work with a Unicode string abstraction
    which effectively had 3 substring functions, one which works in terms
    of encoding units, one which works in terms of encoded Unicode code
    points, and one which works in terms of grapheme clusters (which is
    required to get the correct result for é). Actually, preferably such
    logic would be applied to all of the functions of the Unicode string
    class, such as size (size in encoding units, size in encoded Unicode
    code points, size in Unicode grapheme clusters), etc.


    Then there's also collation, aka sorting and equivalent comparison.


    std::string holding UTF-8 data handles both poorly. AFAIK, the new C++
    standard will do basically nothing to help this situation either,
    forcing developers who need good unicode support to use ICU, or a
    "hacked" (read: modified) version thereof for decent speed (as my
    company does).
    Joshua Maurice, May 6, 2010
    #4
  5. On 06-05-2010 20:09, Victor Bazarov wrote:
    > On 5/6/2010 1:39 PM, Peter Olcott wrote:
    >> "Victor Bazarov"<> wrote in
    >> I want a string class that works exactly the same way as
    >> std::string, except implements UTF-8.

    > So, once again, what do you mean by "implements UTF-8"?
    >

    E.g. UTF-8 character has 1 or more bytes. String functions which base on
    index should be aware of that. The simplest way is to covert data in
    constructor and provide utf8_str() function similar to c_str(); Simplest
    doesn't mean the most effective.

    Regards

    Marek
    Marek Borowski, May 6, 2010
    #5
  6. Victor Bazarov

    Ian Collins Guest

    On 05/ 7/10 10:13 AM, Peter Olcott wrote:

    Peter, please do us all a favour and fix your quoting!

    How can you develop a complex application if to can't fix your
    (admittedly brain dead) news client??

    --
    Ian Collins
    Ian Collins, May 6, 2010
    #6
  7. Victor Bazarov

    Ian Collins Guest

    On 05/ 7/10 11:44 AM, Peter Olcott wrote:
    > "Ian Collins"<> wrote in message
    > news:...
    >> On 05/ 7/10 10:13 AM, Peter Olcott wrote:
    >>
    >> Peter, please do us all a favour and fix your quoting!
    >>
    >> How can you develop a complex application if to can't fix
    >> your (admittedly brain dead) news client??

    >
    > I briefly tried Thunderbird and it was far too sluggish. I
    > tried the recommended patch and it didn't work. I have no
    > more time for these trivial aesthetic things.


    Then I'm sure I'm not the only one who has no more time for deciphering
    your mangled posts.

    Remember, Usenet is a write once, read many medium.

    --
    Ian Collins
    Ian Collins, May 7, 2010
    #7
  8. "Peter Olcott" <> writes:

    > "Ian Collins" <> wrote in message
    > news:...
    >> On 05/ 7/10 11:44 AM, Peter Olcott wrote:
    >>> "Ian Collins"<> wrote in message
    >>> news:...
    >>>> On 05/ 7/10 10:13 AM, Peter Olcott wrote:
    >>>>
    >>>> Peter, please do us all a favour and fix your quoting!
    >>>>
    >>>> How can you develop a complex application if to can't
    >>>> fix
    >>>> your (admittedly brain dead) news client??
    >>>
    >>> I briefly tried Thunderbird and it was far too sluggish.
    >>> I
    >>> tried the recommended patch and it didn't work. I have no
    >>> more time for these trivial aesthetic things.

    >>
    >> Then I'm sure I'm not the only one who has no more time
    >> for deciphering your mangled posts.
    >>
    >> Remember, Usenet is a write once, read many medium.
    >>
    >> --
    >> Ian Collins

    >
    > Which newsgroup reader works the best?
    >
    > Also I have many years worth of newsgroup posts stored on my
    > hard drive using Outlook Express.
    >
    > When I briefly tried Thunderbird it looked like it suffered
    > the same sort of problems as Open Office word. Open Office
    > word, sometimes took several minutes to page up to the
    > previous page. When you add up the cost of this (over a
    > lifetime months of one's life are wasted) the "free" open
    > office is far too expensive.
    >
    > There is also the learning curve cost. I also don't want to
    > spend dozens of hours evaluating alternatives just because
    > of inconsequential aesthetics.
    >
    > I mark all of the threads that I create so that Outlook
    > express filters these messages to sort to the top. I don't
    > want to wade through hundreds of irrelevant posts just to
    > see my replies. This feature is essential to me, not wanting
    > to burn up months of my life doing unnecessary work.


    To drop in late on this thread, I'm a little struck by some of the terms
    you are using here, such as "wasted" lifetime months, "inconsequential"
    aesthetics, "irrelevant" posts and "unnecessary" work. These terms
    strike me particularly because they appear to imply all of the
    following:

    1. that, as long as your time is not "wasted," then it doesn't matter
    about the time that others have to "waste" "deciphering your
    mangled posts" (as Ian put it);
    2. that how a usenet post is presented in terms of factors that
    directly affect ease of readability and maintenance of context
    (who said what) is mere "aesthetics," and inconsequential at that;
    3. that posts by anyone other than yourself are in some sense
    "irrelevant"; and
    4. that you are (again) happy if others have to burn up,
    accumulatively, "many months of [their lives] doing unnecessary
    work," just as long as you don't have to.

    Is this how usenet works? Oddly I'd seen it as more of a community.

    Regards

    Paul Bibbings
    Paul Bibbings, May 7, 2010
    #8
  9. On 5/7/2010 11:34 AM, Peter Olcott wrote:
    > [..]
    > It is a matter of cost-effectiveness. It is unreasonable to
    > expect me to spend many hours just to satisfy the
    > misallocated preconceived notions of others.
    >
    > Some people here have a psychological hang up about top
    > posting, just get over it, it really doesn't make any
    > difference except in your imaginainion, so simply quit
    > imagining this annoyance and none will arise in your mind.


    How about you drop the misconceptions about "others" and start listening
    to what the "others" are saying? How many preconceived notions do *you*
    have about the "others" and what percentage of those are "misallocated"?

    > If it really was saving me a little time at the expense of
    > costing others more time, then I would change.


    That's exactly the problem. *You* don't need to change. It's difficult
    to change a person. Just a tiny bit of thinking of how to more clearly
    express yourself is not a full-blown "change" you dread. It's like when
    you realize you are getting late to a meeting, you start doing
    everything you have been doing just a bit faster. You don't need to
    *change* for that. Just a tad more effort put in your activities
    already under way.

    > This is not
    > the case. It is only a petty psychological hang up.


    Asking for a clear presentation of your inquiry is "a petty
    psychological hang up"? I don't think so. You mark your posts in your
    newsreader so you can see them better/sooner, so you don't mind making a
    couple of moves with your arm/hand/mouse to organize *your own*
    workspace for *your* convenience. Do others deserve the same or don't
    they? More so or less so? You don't have to answer that.

    And consider another point: a well rendered illustration of the problem
    is half of the solution. I learned it during the first year at college
    and see proofs of that almost every day. So consider that the more
    cluttered your statements are (both semantically and typographically),
    the more time you and others will spend trying to get to the actual
    problem before a solution is attempted.

    And there is no need to argue with me or anybody else about it. Just
    *think* about it, *consider* it. OK? Good luck!

    V
    --
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, May 7, 2010
    #9
  10. On May 6, 3:13 pm, "Peter Olcott" <> wrote:
    > "Joshua Maurice" <> wrote in message
    >
    > news:...
    > On May 6, 11:09 am, Victor Bazarov<> wrote:
    > > On 5/6/2010 1:39 PM, Peter Olcott wrote:
    > > Could it be that you're don't know yet what you *need*
    > > from your class,
    > > which you hope will "handle" UTF-8? What *operations* do
    > > you hope it
    > > will help you perform on your "UTF-8" strings?

    >
    > Preferably, our product would work with a Unicode string
    > abstraction
    > which effectively had 3 substring functions, one which works
    > in terms
    > of encoding units, one which works in terms of encoded
    > Unicode code
    > points, and one which works in terms of grapheme clusters
    > (which is
    > required to get the correct result for é). Actually,
    > preferably such
    > logic would be applied to all of the functions of the
    > Unicode string
    > class, such as size (size in encoding units, size in encoded
    > Unicode
    > code points, size in Unicode grapheme clusters), etc.
    >
    > Wouldn't is have been simpler to eliminate grapheme clusters
    > and have one CodePoint for "e" and another different
    > CodePoint for "é" and not have any CodePoint for the accent
    > mark by itself? It seems that this aspect of Unicode has
    > been designed to be much more cumbersome than necessary.
    > This aspect of Unicode is clumsy.
    >
    > From what I recall there already are Unicode CodePoints for
    > accented characters, thus all that it would take to fix this
    > problem is to deprecate the accent character.


    Not all grapheme clusters have an equivalent single Unicode code
    point. I forget which, but some non-Latin scripts have been encoded so
    that most "grapheme clusters" of that language are encoded with 2
    Unicode code points. It's not possible to run a preprocessing step to
    change all grapheme cluster sequences into a single Unicode code point
    because there might not exist an equivalent single Unicode code
    point.

    Besides, I generally want to perform a lot of transformations by
    encoded Unicode code point and not by encoding unit, and std::string's
    interface is not well suited to this, even in this impossible scenario
    of a single Unicode code point per grapheme cluster.
    Joshua Maurice, May 7, 2010
    #10
  11. Victor Bazarov

    DaveB Guest

    Peter Olcott wrote:
    > "Victor Bazarov" <> wrote in
    > message news:hruhqu$hqt$-september.org...
    >> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    >>> I am looking for a way to handle UTF-8 text in my C++
    >>> application. The ideal case would be an STL class that
    >>> handles UTF-8. What is the next best thing?

    >>
    >> What do you mean by "handle"? STL class? Don't you have
    >> the compiler documentation? If there is one, you already
    >> have all information you need. Want more? Buy a book on
    >> the Standard library. There are several that many
    >> consider decent. Next best thing? Google.

    >
    > I must be able to use UTF-8 strings in my C++ application. I
    > want to know the best way to do this. I prefer an interface
    > that works the same way as the STL interface.


    "the same way"? Really? What way is that (obfuscation)? Do explain
    please.
    DaveB, May 8, 2010
    #11
  12. Victor Bazarov

    DaveB Guest

    Victor Bazarov wrote:
    > On 5/6/2010 10:11 AM, Peter Olcott wrote:
    >> "Victor Bazarov"<> wrote in
    >> message news:hruhqu$hqt$-september.org...
    >>> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    >>>> I am looking for a way to handle UTF-8 text in my C++
    >>>> application. The ideal case would be an STL class that
    >>>> handles UTF-8. What is the next best thing?
    >>>
    >>> What do you mean by "handle"? STL class? Don't you have
    >>> the compiler documentation? If there is one, you already
    >>> have all information you need. Want more? Buy a book on
    >>> the Standard library. There are several that many
    >>> consider decent. Next best thing? Google.

    >>
    >> I must be able to use UTF-8 strings in my C++ application. I
    >> want to know the best way to do this. I prefer an interface
    >> that works the same way as the STL interface.

    >
    > What do you mean by "use" and in what way can't you "use" the UTF-8
    > strings already? There is no such thing as "STL interface", perhaps
    > you can explain what you mean by "the same way". I can start
    > guessing, but it's much better if you just specify what exactly
    > you're trying to accomplish. Try to refrain from using such generic
    > terms as "STL interface" or "use". For example, you can say, "I need
    > to be able to figure out whether there are uppercase characters in my
    > 'string', like the standard function 'isupper' does"...


    I just responded with the same to the OP. My post was much more concise
    though. But then I didn't have the side goal of seeking employment from
    the question that you undoubtedly do Victor Borza? ;)
    DaveB, May 8, 2010
    #12
  13. Victor Bazarov

    DaveB Guest

    Peter Olcott wrote:
    > "Victor Bazarov" <> wrote in
    > message news:hruqhc$lo6$-september.org...
    >> On 5/6/2010 10:11 AM, Peter Olcott wrote:
    >>> "Victor Bazarov"<> wrote in
    >>> message news:hruhqu$hqt$-september.org...
    >>>> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    >>>>> I am looking for a way to handle UTF-8 text in my C++
    >>>>> application. The ideal case would be an STL class that
    >>>>> handles UTF-8. What is the next best thing?
    >>>>
    >>>> What do you mean by "handle"? STL class? Don't you
    >>>> have
    >>>> the compiler documentation? If there is one, you
    >>>> already
    >>>> have all information you need. Want more? Buy a book
    >>>> on
    >>>> the Standard library. There are several that many
    >>>> consider decent. Next best thing? Google.
    >>>
    >>> I must be able to use UTF-8 strings in my C++
    >>> application. I
    >>> want to know the best way to do this. I prefer an
    >>> interface
    >>> that works the same way as the STL interface.

    >>
    >> What do you mean by "use" and in what way can't you "use"
    >> the UTF-8 strings already? There is no such thing as "STL
    >> interface", perhaps you can explain what you mean by "the
    >> same way". I can start guessing, but it's much better if
    >> you just specify what exactly you're trying to accomplish.
    >> Try to refrain from using such generic terms as "STL
    >> interface" or "use". For example, you can say, "I need to
    >> be able to figure out whether there are uppercase
    >> characters in my 'string', like the standard function
    >> 'isupper' does"...

    >
    > I want a string class that works exactly the same way as
    > std::string, except implements UTF-8. This means that the
    > interface can remain the same, (all of the member functions
    > have the same name and same parameters) but the underlying
    > meaning may be different.
    >


    One can only hope that there will be support and help for you when the
    war is over. (If you survive it, that is).
    DaveB, May 8, 2010
    #13
  14. Victor Bazarov

    DaveB Guest

    Victor Bazarov wrote:
    > On 5/6/2010 1:39 PM, Peter Olcott wrote:
    >> "Victor Bazarov"<> wrote in
    >> message news:hruqhc$lo6$-september.org...
    >>> On 5/6/2010 10:11 AM, Peter Olcott wrote:
    >>>> "Victor Bazarov"<> wrote in
    >>>> message news:hruhqu$hqt$-september.org...
    >>>>> On 5/6/2010 9:45 AM, Peter Olcott wrote:
    >>>>>> I am looking for a way to handle UTF-8 text in my C++
    >>>>>> application. The ideal case would be an STL class that
    >>>>>> handles UTF-8. What is the next best thing?
    >>>>>
    >>>>> What do you mean by "handle"? STL class? Don't you
    >>>>> have
    >>>>> the compiler documentation? If there is one, you
    >>>>> already
    >>>>> have all information you need. Want more? Buy a book
    >>>>> on
    >>>>> the Standard library. There are several that many
    >>>>> consider decent. Next best thing? Google.
    >>>>
    >>>> I must be able to use UTF-8 strings in my C++
    >>>> application. I
    >>>> want to know the best way to do this. I prefer an
    >>>> interface
    >>>> that works the same way as the STL interface.
    >>>
    >>> What do you mean by "use" and in what way can't you "use"
    >>> the UTF-8 strings already? There is no such thing as "STL
    >>> interface", perhaps you can explain what you mean by "the
    >>> same way". I can start guessing, but it's much better if
    >>> you just specify what exactly you're trying to accomplish.
    >>> Try to refrain from using such generic terms as "STL
    >>> interface" or "use". For example, you can say, "I need to
    >>> be able to figure out whether there are uppercase
    >>> characters in my 'string', like the standard function
    >>> 'isupper' does"...

    >>
    >> I want a string class that works exactly the same way as
    >> std::string, except implements UTF-8.

    >
    > ...as opposed to *what*? UTF-8 is an encoding scheme. 'std::string'
    > does *not* have an encoding scheme, it's a mere container of 'char'.
    > Nothing more, nothing less. What *exactly* in it doesn't work NOW for
    > you? Have you tried making the default 'char' unsigned? If your
    > platform has 8-bit chars, and you make them unsigned, you got yourself
    > UTF-8 storage type. And 'std::string' will provide functionality for
    > storing elements of that type (by virtue of being defined as
    > 'std::basic_string<char>'), and operations to manipulate that storage
    > (append to, erase from, enumerate, etc.)
    >
    > So, once again, what do you mean by "implements UTF-8"?
    >
    >> This means that the
    >> interface can remain the same, (all of the member functions
    >> have the same name and same parameters) but the underlying
    >> meaning may be different.

    >
    > "May be different"? If I rewrite 'std::string' for you and just make
    > all functions return 0 and do nothing, would that be acceptable? That's
    > a rhetorical question, BTW. If you just allow the "meaning"
    > to be different, you still haven't specified anything. Does it *have
    > to* be different? In what way?
    >
    > Could it be that you're don't know yet what you *need* from your
    > class, which you hope will "handle" UTF-8? What *operations* do you
    > hope it will help you perform on your "UTF-8" strings?
    >


    Victor, why not just "see" what he is asking and give him what he needs:
    the answer! Most things do not need a "dancing around it" style. Assess
    it as best you can in the first message, then blurt out an "answer"! Who
    cares if it is or you are wrong? Make some headway fast. For example, if
    you think UTF-8 sucks, reply "UTF-8 sucks." and move on and let him
    figure it out. This dreary milking of syntactical and even semantical
    nothingness is ... well it sucks!
    DaveB, May 8, 2010
    #14
  15. Victor Bazarov

    DaveB Guest

    Joshua Maurice wrote:
    >
    > Let me try to explain. std::string has member functions like find and
    > substring. When used to store UTF-8 for an 8 bit char, the indexes are
    > in terms of 8 bit encoding units. However, generally a user does not
    > want to work with indexes in terms of encoding units. They want to
    > work with indexes in terms of encoded Unicode code points, or more
    > probably Unicode grapheme clusters.


    Good thing I'm not relevant to evaluating your resume. JK, your
    corporate-coding "experience" shows up in the stock tickers though. I
    dunno what to think anymore. I'll be cliche: you get what you pay for,
    and easy come/easy go.
    DaveB, May 8, 2010
    #15
  16. Victor Bazarov

    DaveB Guest

    Peter Olcott wrote:
    > "Joshua Maurice" <> wrote in message
    > news:...
    > On May 6, 11:09 am, Victor Bazarov
    > <> wrote:
    >> On 5/6/2010 1:39 PM, Peter Olcott wrote:
    >> Could it be that you're don't know yet what you *need*
    >> from your class,
    >> which you hope will "handle" UTF-8? What *operations* do
    >> you hope it
    >> will help you perform on your "UTF-8" strings?

    > Preferably, our product would work with a Unicode string
    > abstraction


    In reality, someone with stake in the actual problem would seek to find
    someone to state their problem if they could not instead of getting into
    their current situation with you? (No offense, I'm sure you mean well,
    but you have to understand your limitations). Your question has all the
    warnings of "programming project gone awry".
    DaveB, May 8, 2010
    #16
  17. Victor Bazarov

    DaveB Guest

    Ian Collins wrote:
    > On 05/ 7/10 10:13 AM, Peter Olcott wrote:
    >
    > Peter, please do us all a favour and fix your quoting!
    >
    > How can you develop a complex application if to can't fix your
    > (admittedly brain dead) news client??


    He's not the stakeholder. Duh. Hopefully he is an employee and not a
    consultant, because the latter, he is not!
    DaveB, May 8, 2010
    #17
  18. Victor Bazarov

    DaveB Guest

    Peter Olcott wrote:
    > "Ian Collins" <> wrote in message
    > news:...
    >> On 05/ 7/10 11:44 AM, Peter Olcott wrote:
    >>> "Ian Collins"<> wrote in message
    >>> news:...
    >>>> On 05/ 7/10 10:13 AM, Peter Olcott wrote:
    >>>>
    >>>> Peter, please do us all a favour and fix your quoting!
    >>>>
    >>>> How can you develop a complex application if to can't
    >>>> fix
    >>>> your (admittedly brain dead) news client??
    >>>
    >>> I briefly tried Thunderbird and it was far too sluggish.
    >>> I
    >>> tried the recommended patch and it didn't work. I have no
    >>> more time for these trivial aesthetic things.

    >>
    >> Then I'm sure I'm not the only one who has no more time
    >> for deciphering your mangled posts.
    >>
    >> Remember, Usenet is a write once, read many medium.
    >>
    >> --
    >> Ian Collins

    >
    > Which newsgroup reader works the best?
    >
    > Also I have many years worth of newsgroup posts stored on my
    > hard drive using Outlook Express.
    >
    > When I briefly tried Thunderbird it looked like it suffered
    > the same sort of problems as Open Office word. Open Office
    > word, sometimes took several minutes to page up to the
    > previous page. When you add up the cost of this (over a
    > lifetime months of one's life are wasted) the "free" open
    > office is far too expensive.
    >
    > There is also the learning curve cost. I also don't want to
    > spend dozens of hours evaluating alternatives just because
    > of inconsequential aesthetics.
    >
    > I mark all of the threads that I create so that Outlook
    > express filters these messages to sort to the top. I don't
    > want to wade through hundreds of irrelevant posts just to
    > see my replies. This feature is essential to me, not wanting
    > to burn up months of my life doing unnecessary work.


    Wow, this thread is a good case study. Stakeholders, beware!
    DaveB, May 8, 2010
    #18
  19. "Peter Olcott" <> writes:

    > The absence of other answers leads to the answer of build it
    > myself.


    Careful. That means `work', and work requires `time' - /valuable/ time!

    Regards

    Paul Bibbings
    Paul Bibbings, May 8, 2010
    #19
  20. Victor Bazarov

    James Kanze Guest

    On May 6, 6:39 pm, "Peter Olcott" <> wrote:
    > "Victor Bazarov" <> wrote in
    > messagenews:hruqhc$lo6$-september.org...


    [...]
    > I want a string class that works exactly the same way as
    > std::string, except implements UTF-8.


    I think Victor's point is that std::string does implement UTF-8.
    And ISO 8859-1, and EBCDIC, and any other encoding which uses
    char (as opposed to UTF32, for example, which requires 32 bit
    entities).

    And I think he's only right to a point: in the end, an
    std::string doesn't handle characters, it handles small
    integers. In a single byte encoding, however, those small
    integers are the same as your characters, with one character per
    integer. So to advance one character, you can simply use ++ on
    an std::string::iterator. UTF-8 does require more. And there's
    no support for that "more" in C++ (including, as far as I know,
    C++0x---in C++0x, you can have UTF-8 string literals, but you
    can't take an std::string::iterator and advance it one UTF-8
    character).

    > This means that the interface can remain the same, (all of the
    > member functions have the same name and same parameters) but
    > the underlying meaning may be different.


    It's not that easy. You can't simply implement something like
    utf8_string_iterator::eek:perator++()
    {
    underlying_iter += size(*underlying_iter);
    }
    since there might not be enough bytes in the string pointed to
    by underlying_iter.

    --
    James Kanze
    James Kanze, May 8, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ravikanth[MVP]
    Replies:
    6
    Views:
    3,870
    Aemca
    Jul 18, 2003
  2. Thomas Scheiderich

    Best way to handle documents in ASP.NET

    Thomas Scheiderich, May 20, 2004, in forum: ASP .Net
    Replies:
    11
    Views:
    2,475
    Jim Corey
    May 20, 2004
  3. Replies:
    2
    Views:
    347
    DaveB
    May 14, 2010
  4. Juha Nieminen

    Re: Best way to handle UTF-8 in C++

    Juha Nieminen, May 6, 2010, in forum: C++
    Replies:
    4
    Views:
    485
    DaveB
    May 14, 2010
  5. Yohan N. Leder

    Best way to output literal strings as UTF-8 ?

    Yohan N. Leder, Jun 1, 2006, in forum: Perl Misc
    Replies:
    4
    Views:
    102
    Bart Van der Donck
    Jun 2, 2006
Loading...

Share This Page