char_traits<char>::compare

E

Earl Purple

On VC++.NET it is implemented like this

static int __cdecl compare
(
const _Elem *_First1,
const _Elem *_First2,
size_t _Count
)
{ // compare [_First1, _First1 + _Count) with [_First2, ...)
return :):memcmp(_First1, _First2, _Count));
}

i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
is a signed character.

Therefore if I declare a std::string as "\x80" and another std::string
as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
although if I compared their first characters then the first character
of the "\x80" string is "lower".

Is this behaviour standard? Is it correct? Is there a formal definition
of what the result of a std::string comparison should return if one or
more of the characters in one or other of the strings is "negative".
 
V

Victor Bazarov

Earl said:
On VC++.NET it is implemented like this

static int __cdecl compare
(
const _Elem *_First1,
const _Elem *_First2,
size_t _Count
)
{ // compare [_First1, _First1 + _Count) with [_First2, ...)
return :):memcmp(_First1, _First2, _Count));
}

i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
is a signed character.

Whether 'char' is signed is implementation-defined. You can change it
usually with some compiler command-line switch.
Therefore if I declare a std::string as "\x80" and another std::string
as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
although if I compared their first characters then the first character
of the "\x80" string is "lower".

Is this behaviour standard?

Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".
> Is it correct? Is there a formal definition
of what the result of a std::string comparison should return if one or
more of the characters in one or other of the strings is "negative".

There is no "negative" or "positive" in there. Those are just characters
for which there are traits, which in turn say how the strings compare.

V
 
E

Earl Purple

Victor said:
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".


from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

but 0x80 < 0x7f because char is signed. Thus when I have my strings

std::string s128( "\x80" );
std::string s127 ("\x7f" );

s127 < s128 but s128[0] < s127[0]

As basic_string (correctly) uses char_traits to do the comparison
(that's what it's there for isn't it?) the inconsistency is in
char_traits.

VC .NET provides no specialisation for char_traits<unsigned char> and I
have actually implemented my own traits class for unsigned char (but
not char_traits because I'm not supposed to extend namespace std),
which for me guarantees I will get consistent behaviour.

I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.
 
V

Victor Bazarov

Earl said:
Victor said:
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".



from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

[...]
I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.


Yes, it certainly seems so. You should perhaps contact Dinkumware (the
implementors of the standard library Microsoft ships along with VC++
compilers) and let them know...

V
 
P

P.J. Plauger

Victor said:
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First, _Second)' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".


from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

but 0x80 < 0x7f because char is signed. Thus when I have my strings

std::string s128( "\x80" );
std::string s127 ("\x7f" );

s127 < s128 but s128[0] < s127[0]

As basic_string (correctly) uses char_traits to do the comparison
(that's what it's there for isn't it?) the inconsistency is in
char_traits.

VC .NET provides no specialisation for char_traits<unsigned char> and I
have actually implemented my own traits class for unsigned char (but
not char_traits because I'm not supposed to extend namespace std),
which for me guarantees I will get consistent behaviour.


The template definition works fine for unsigned char. You don't
need to explicitly specialize it.
I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.

Once upon a time, the draft C++ Standard spelled out that memcmp
should be used for char_traits<char>::compare. That got lost
along the way. Most (or possibly all) implementations still use
memcmp as a result. I know there has been discussion on the
C++ library committee reflector about this. IIRC, the consensus
is that memcmp is the right way to go. Whether there's a Defect
Report on this topic I don't recall.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
E

Earl Purple

P.J. Plauger said:
The template definition works fine for unsigned char. You don't
need to explicitly specialize it.

Actually it does not work fine when using it for basic_ofstream to
write binary, but this is caused by another issue. If the character at
position 0 or any multiple of 8192 happens to be 0xff it rips it out as
an EOF.

The templated version for compare "works" but does not take advantage
of the nature of unsigned char such that memcmp and memcpy can be
safely used for comparison/copying and are probably more efficient than
the byte-by-byte versions.
Once upon a time, the draft C++ Standard spelled out that memcmp
should be used for char_traits<char>::compare. That got lost
along the way. Most (or possibly all) implementations still use
memcmp as a result. I know there has been discussion on the
C++ library committee reflector about this. IIRC, the consensus
is that memcmp is the right way to go. Whether there's a Defect
Report on this topic I don't recall.

Thank you for clearing that up. So effectively it's better not to use
it if you are going to have any characters in your string that have the
negative bit set if you want consistent results across all compilers.
 
P

P.J. Plauger

Actually it does not work fine when using it for basic_ofstream to
write binary, but this is caused by another issue. If the character at
position 0 or any multiple of 8192 happens to be 0xff it rips it out as
an EOF.

I'm assuming that's a lower-level C issue. No reason why it should
happen in the C++ buffering.
The templated version for compare "works" but does not take advantage
of the nature of unsigned char such that memcmp and memcpy can be
safely used for comparison/copying and are probably more efficient than
the byte-by-byte versions.

Until you can demonstrate that your program runs too slow because
this optimization is missing, it's safe to say that the templated
version works, period.
Thank you for clearing that up. So effectively it's better not to use
it if you are going to have any characters in your string that have the
negative bit set if you want consistent results across all compilers.

The only real issue is the ordering rule used for comparisons. If
you don't like what you get by default, you can always make your
own.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
E

Earl Purple

P.J. Plauger said:
I'm assuming that's a lower-level C issue. No reason why it should
happen in the C++ buffering. own.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

No, the error comes from this function in basic_streambuf: (I have
formatted it to make it a bit easier to read)

virtual streamsize xsputn
(const _Elem *_Ptr, streamsize _Count)
{ // put _Count characters to stream
streamsize _Size, _Copied;

for (_Copied = 0; 0 < _Count; )
{
if
(
( pptr() != 0 ) &&
( 0 < (_Size = (streamsize)(epptr() - pptr())) )
)
{ // copy to write buffer
if (_Count < _Size)
{
_Size = _Count;
}
_Traits::copy(pptr(), _Ptr, _Size);
_Ptr += _Size;
_Copied += _Size;
_Count -= _Size;
pbump((int)_Size);
}
else if // ** ERROR IN THIS SECTION **
(
_Traits::eq_int_type
(
_Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
)
)
{
break; // single character put failed, quit
}
else
{ // count character successfully put
++_Ptr;
++_Copied;
--_Count;
}
}
return (_Copied);
}

thus you have assumed that if the first character in our buffer happens
to be 0xff it is an end of file. (For a binary file this is not the
case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
equal to 0xffffffff.

My "fix" in my own version was to make int_type an int so eq always
fails.

Here is a test to reproduce the bug.

#include <fstream>
#include <string>

int main()
{
std::basic_ofstream< unsigned char > outFile
(
"test.dat",
std::ios_base::binary | std::ios_base::trunc
);

std::basic_string<unsigned char> data( 16, '\xff' );
for ( int iters=0; iters<8192; ++iters )
{
outFile.write( data.c_str(), 17 );
}
}

So we are writing 17 characters, 16 of 0xff followed by the 0
terminator, 8192 times. That should give us a file length of 139264 or
in hex 22000. On mine (VC7.1.3088) it is 49 bytes short.
 
P

P.J. Plauger

own.

No, the error comes from this function in basic_streambuf: (I have
formatted it to make it a bit easier to read)

virtual streamsize xsputn
(const _Elem *_Ptr, streamsize _Count)
{ // put _Count characters to stream
streamsize _Size, _Copied;

for (_Copied = 0; 0 < _Count; )
{
if
(
( pptr() != 0 ) &&
( 0 < (_Size = (streamsize)(epptr() - pptr())) )
)
{ // copy to write buffer
if (_Count < _Size)
{
_Size = _Count;
}
_Traits::copy(pptr(), _Ptr, _Size);
_Ptr += _Size;
_Copied += _Size;
_Count -= _Size;
pbump((int)_Size);
}
else if // ** ERROR IN THIS SECTION **
(
_Traits::eq_int_type
(
_Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
)
)
{
break; // single character put failed, quit
}
else
{ // count character successfully put
++_Ptr;
++_Copied;
--_Count;
}
}
return (_Copied);
}

thus you have assumed that if the first character in our buffer happens
to be 0xff it is an end of file. (For a binary file this is not the
case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
equal to 0xffffffff.

My "fix" in my own version was to make int_type an int so eq always
fails.

Ah, now I see the problem. We've long since changed the default type
for the template version of basic_streambuf to long, which is essentially
the same as your fix. That happened after we delivered the V7.1 library
to Microsoft. The old default, having int_type the same as char_type,
is not binary transparent, as you've observed.

It's fixed in the library we currently license from our web site (thus
my confusion). Should also work fine in Whidbey (VC++ V8).

Thanks for the clarification.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top