char_traits<char>::compare

Discussion in 'C++' started by Earl Purple, Aug 10, 2005.

  1. Earl Purple

    Earl Purple Guest

    On VC++.NET it is implemented like this

    static int __cdecl compare
    (
    const _Elem *_First1,
    const _Elem *_First2,
    size_t _Count
    )
    { // compare [_First1, _First1 + _Count) with [_First2, ...)
    return :):memcmp(_First1, _First2, _Count));
    }

    i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
    is a signed character.

    Therefore if I declare a std::string as "\x80" and another std::string
    as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
    although if I compared their first characters then the first character
    of the "\x80" string is "lower".

    Is this behaviour standard? Is it correct? Is there a formal definition
    of what the result of a std::string comparison should return if one or
    more of the characters in one or other of the strings is "negative".
     
    Earl Purple, Aug 10, 2005
    #1
    1. Advertising

  2. Earl Purple wrote:
    > On VC++.NET it is implemented like this
    >
    > static int __cdecl compare
    > (
    > const _Elem *_First1,
    > const _Elem *_First2,
    > size_t _Count
    > )
    > { // compare [_First1, _First1 + _Count) with [_First2, ...)
    > return :):memcmp(_First1, _First2, _Count));
    > }
    >
    > i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
    > is a signed character.


    Whether 'char' is signed is implementation-defined. You can change it
    usually with some compiler command-line switch.

    > Therefore if I declare a std::string as "\x80" and another std::string
    > as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
    > although if I compared their first characters then the first character
    > of the "\x80" string is "lower".
    >
    > Is this behaviour standard?


    Reading the requirements for char_traits, 'compare' should yield 0 if for
    any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
    yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
    'eq' is true for all preceding chars, and 1 otherwise.

    There is no requirement in the Standard as to how to implement those.
    The traits essentially govern the sorting, not operator< or operator==,
    which you were probably using when you "compared their first characters".

    > Is it correct? Is there a formal definition
    > of what the result of a std::string comparison should return if one or
    > more of the characters in one or other of the strings is "negative".


    There is no "negative" or "positive" in there. Those are just characters
    for which there are traits, which in turn say how the strings compare.

    V
     
    Victor Bazarov, Aug 10, 2005
    #2
    1. Advertising

  3. Earl Purple

    Earl Purple Guest

    Victor Bazarov wrote:
    > Reading the requirements for char_traits, 'compare' should yield 0 if for
    > any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
    > yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
    > 'eq' is true for all preceding chars, and 1 otherwise.
    >
    > There is no requirement in the Standard as to how to implement those.
    > The traits essentially govern the sorting, not operator< or operator==,
    > which you were probably using when you "compared their first characters".


    from char_traits<char> (on VC .NET)

    static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
    {
    // test if _Left precedes _Right

    return (_Left < _Right);
    }

    but 0x80 < 0x7f because char is signed. Thus when I have my strings

    std::string s128( "\x80" );
    std::string s127 ("\x7f" );

    s127 < s128 but s128[0] < s127[0]

    As basic_string (correctly) uses char_traits to do the comparison
    (that's what it's there for isn't it?) the inconsistency is in
    char_traits.

    VC .NET provides no specialisation for char_traits<unsigned char> and I
    have actually implemented my own traits class for unsigned char (but
    not char_traits because I'm not supposed to extend namespace std),
    which for me guarantees I will get consistent behaviour.

    I just wanted to know if this inconsistency is part of the standard,
    and by your quoting of the standard it is not - it is against the
    standard rule for char_traits::compare.




    > > Is it correct? Is there a formal definition
    > > of what the result of a std::string comparison should return if one or
    > > more of the characters in one or other of the strings is "negative".

    >
    > There is no "negative" or "positive" in there. Those are just characters
    > for which there are traits, which in turn say how the strings compare.
    >
    > V
     
    Earl Purple, Aug 10, 2005
    #3
  4. Earl Purple wrote:
    > Victor Bazarov wrote:
    >
    >>Reading the requirements for char_traits, 'compare' should yield 0 if for
    >>any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
    >>yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
    >>'eq' is true for all preceding chars, and 1 otherwise.
    >>
    >>There is no requirement in the Standard as to how to implement those.
    >>The traits essentially govern the sorting, not operator< or operator==,
    >>which you were probably using when you "compared their first characters".

    >
    >
    > from char_traits<char> (on VC .NET)
    >
    > static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
    > {
    > // test if _Left precedes _Right
    >
    > return (_Left < _Right);
    > }
    >
    > [...]
    > I just wanted to know if this inconsistency is part of the standard,
    > and by your quoting of the standard it is not - it is against the
    > standard rule for char_traits::compare.
    >


    Yes, it certainly seems so. You should perhaps contact Dinkumware (the
    implementors of the standard library Microsoft ships along with VC++
    compilers) and let them know...

    V
     
    Victor Bazarov, Aug 10, 2005
    #4
  5. Earl Purple

    P.J. Plauger Guest

    "Earl Purple" <> wrote in message
    news:...

    > Victor Bazarov wrote:
    >> Reading the requirements for char_traits, 'compare' should yield 0 if for
    >> any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
    >> yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
    >> 'eq' is true for all preceding chars, and 1 otherwise.
    >>
    >> There is no requirement in the Standard as to how to implement those.
    >> The traits essentially govern the sorting, not operator< or operator==,
    >> which you were probably using when you "compared their first characters".

    >
    > from char_traits<char> (on VC .NET)
    >
    > static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
    > {
    > // test if _Left precedes _Right
    >
    > return (_Left < _Right);
    > }
    >
    > but 0x80 < 0x7f because char is signed. Thus when I have my strings
    >
    > std::string s128( "\x80" );
    > std::string s127 ("\x7f" );
    >
    > s127 < s128 but s128[0] < s127[0]
    >
    > As basic_string (correctly) uses char_traits to do the comparison
    > (that's what it's there for isn't it?) the inconsistency is in
    > char_traits.
    >
    > VC .NET provides no specialisation for char_traits<unsigned char> and I
    > have actually implemented my own traits class for unsigned char (but
    > not char_traits because I'm not supposed to extend namespace std),
    > which for me guarantees I will get consistent behaviour.


    The template definition works fine for unsigned char. You don't
    need to explicitly specialize it.

    > I just wanted to know if this inconsistency is part of the standard,
    > and by your quoting of the standard it is not - it is against the
    > standard rule for char_traits::compare.


    Once upon a time, the draft C++ Standard spelled out that memcmp
    should be used for char_traits<char>::compare. That got lost
    along the way. Most (or possibly all) implementations still use
    memcmp as a result. I know there has been discussion on the
    C++ library committee reflector about this. IIRC, the consensus
    is that memcmp is the right way to go. Whether there's a Defect
    Report on this topic I don't recall.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Aug 10, 2005
    #5
  6. Earl Purple

    Earl Purple Guest

    P.J. Plauger wrote:
    >
    > The template definition works fine for unsigned char. You don't
    > need to explicitly specialize it.


    Actually it does not work fine when using it for basic_ofstream to
    write binary, but this is caused by another issue. If the character at
    position 0 or any multiple of 8192 happens to be 0xff it rips it out as
    an EOF.

    The templated version for compare "works" but does not take advantage
    of the nature of unsigned char such that memcmp and memcpy can be
    safely used for comparison/copying and are probably more efficient than
    the byte-by-byte versions.

    > Once upon a time, the draft C++ Standard spelled out that memcmp
    > should be used for char_traits<char>::compare. That got lost
    > along the way. Most (or possibly all) implementations still use
    > memcmp as a result. I know there has been discussion on the
    > C++ library committee reflector about this. IIRC, the consensus
    > is that memcmp is the right way to go. Whether there's a Defect
    > Report on this topic I don't recall.


    Thank you for clearing that up. So effectively it's better not to use
    it if you are going to have any characters in your string that have the
    negative bit set if you want consistent results across all compilers.
     
    Earl Purple, Aug 10, 2005
    #6
  7. Earl Purple

    P.J. Plauger Guest

    "Earl Purple" <> wrote in message
    news:...

    > P.J. Plauger wrote:
    >>
    >> The template definition works fine for unsigned char. You don't
    >> need to explicitly specialize it.

    >
    > Actually it does not work fine when using it for basic_ofstream to
    > write binary, but this is caused by another issue. If the character at
    > position 0 or any multiple of 8192 happens to be 0xff it rips it out as
    > an EOF.


    I'm assuming that's a lower-level C issue. No reason why it should
    happen in the C++ buffering.

    > The templated version for compare "works" but does not take advantage
    > of the nature of unsigned char such that memcmp and memcpy can be
    > safely used for comparison/copying and are probably more efficient than
    > the byte-by-byte versions.


    Until you can demonstrate that your program runs too slow because
    this optimization is missing, it's safe to say that the templated
    version works, period.

    >> Once upon a time, the draft C++ Standard spelled out that memcmp
    >> should be used for char_traits<char>::compare. That got lost
    >> along the way. Most (or possibly all) implementations still use
    >> memcmp as a result. I know there has been discussion on the
    >> C++ library committee reflector about this. IIRC, the consensus
    >> is that memcmp is the right way to go. Whether there's a Defect
    >> Report on this topic I don't recall.

    >
    > Thank you for clearing that up. So effectively it's better not to use
    > it if you are going to have any characters in your string that have the
    > negative bit set if you want consistent results across all compilers.


    The only real issue is the ordering rule used for comparisons. If
    you don't like what you get by default, you can always make your
    own.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Aug 10, 2005
    #7
  8. Earl Purple

    Earl Purple Guest

    P.J. Plauger wrote:
    > > Actually it does not work fine when using it for basic_ofstream to
    > > write binary, but this is caused by another issue. If the character at
    > > position 0 or any multiple of 8192 happens to be 0xff it rips it out as
    > > an EOF.

    >
    > I'm assuming that's a lower-level C issue. No reason why it should
    > happen in the C++ buffering.

    own.
    >
    > P.J. Plauger
    > Dinkumware, Ltd.
    > http://www.dinkumware.com


    No, the error comes from this function in basic_streambuf: (I have
    formatted it to make it a bit easier to read)

    virtual streamsize xsputn
    (const _Elem *_Ptr, streamsize _Count)
    { // put _Count characters to stream
    streamsize _Size, _Copied;

    for (_Copied = 0; 0 < _Count; )
    {
    if
    (
    ( pptr() != 0 ) &&
    ( 0 < (_Size = (streamsize)(epptr() - pptr())) )
    )
    { // copy to write buffer
    if (_Count < _Size)
    {
    _Size = _Count;
    }
    _Traits::copy(pptr(), _Ptr, _Size);
    _Ptr += _Size;
    _Copied += _Size;
    _Count -= _Size;
    pbump((int)_Size);
    }
    else if // ** ERROR IN THIS SECTION **
    (
    _Traits::eq_int_type
    (
    _Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
    )
    )
    {
    break; // single character put failed, quit
    }
    else
    { // count character successfully put
    ++_Ptr;
    ++_Copied;
    --_Count;
    }
    }
    return (_Copied);
    }

    thus you have assumed that if the first character in our buffer happens
    to be 0xff it is an end of file. (For a binary file this is not the
    case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
    equal to 0xffffffff.

    My "fix" in my own version was to make int_type an int so eq always
    fails.

    Here is a test to reproduce the bug.

    #include <fstream>
    #include <string>

    int main()
    {
    std::basic_ofstream< unsigned char > outFile
    (
    "test.dat",
    std::ios_base::binary | std::ios_base::trunc
    );

    std::basic_string<unsigned char> data( 16, '\xff' );
    for ( int iters=0; iters<8192; ++iters )
    {
    outFile.write( data.c_str(), 17 );
    }
    }

    So we are writing 17 characters, 16 of 0xff followed by the 0
    terminator, 8192 times. That should give us a file length of 139264 or
    in hex 22000. On mine (VC7.1.3088) it is 49 bytes short.
     
    Earl Purple, Aug 11, 2005
    #8
  9. Earl Purple

    P.J. Plauger Guest

    "Earl Purple" <> wrote in message
    news:...

    > P.J. Plauger wrote:
    >> > Actually it does not work fine when using it for basic_ofstream to
    >> > write binary, but this is caused by another issue. If the character at
    >> > position 0 or any multiple of 8192 happens to be 0xff it rips it out as
    >> > an EOF.

    >>
    >> I'm assuming that's a lower-level C issue. No reason why it should
    >> happen in the C++ buffering.

    > own.
    >
    > No, the error comes from this function in basic_streambuf: (I have
    > formatted it to make it a bit easier to read)
    >
    > virtual streamsize xsputn
    > (const _Elem *_Ptr, streamsize _Count)
    > { // put _Count characters to stream
    > streamsize _Size, _Copied;
    >
    > for (_Copied = 0; 0 < _Count; )
    > {
    > if
    > (
    > ( pptr() != 0 ) &&
    > ( 0 < (_Size = (streamsize)(epptr() - pptr())) )
    > )
    > { // copy to write buffer
    > if (_Count < _Size)
    > {
    > _Size = _Count;
    > }
    > _Traits::copy(pptr(), _Ptr, _Size);
    > _Ptr += _Size;
    > _Copied += _Size;
    > _Count -= _Size;
    > pbump((int)_Size);
    > }
    > else if // ** ERROR IN THIS SECTION **
    > (
    > _Traits::eq_int_type
    > (
    > _Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
    > )
    > )
    > {
    > break; // single character put failed, quit
    > }
    > else
    > { // count character successfully put
    > ++_Ptr;
    > ++_Copied;
    > --_Count;
    > }
    > }
    > return (_Copied);
    > }
    >
    > thus you have assumed that if the first character in our buffer happens
    > to be 0xff it is an end of file. (For a binary file this is not the
    > case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
    > equal to 0xffffffff.
    >
    > My "fix" in my own version was to make int_type an int so eq always
    > fails.


    Ah, now I see the problem. We've long since changed the default type
    for the template version of basic_streambuf to long, which is essentially
    the same as your fix. That happened after we delivered the V7.1 library
    to Microsoft. The old default, having int_type the same as char_type,
    is not binary transparent, as you've observed.

    It's fixed in the library we currently license from our web site (thus
    my confusion). Should also work fine in Whidbey (VC++ V8).

    Thanks for the clarification.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Aug 11, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wwj
    Replies:
    7
    Views:
    597
  2. wwj
    Replies:
    24
    Views:
    2,573
    Mike Wahler
    Nov 7, 2003
  3. jamx
    Replies:
    11
    Views:
    861
    Barry Schwarz
    Mar 12, 2006
  4. lovecreatesbeauty
    Replies:
    1
    Views:
    1,152
    Ian Collins
    May 9, 2006
  5. multics.cn
    Replies:
    1
    Views:
    394
Loading...

Share This Page