# wtf is happening here @ bitwise comparison

Discussion in 'C++' started by tschmittldk, Dec 22, 2010.

1. ### tschmittldkGuest

Hey guys... I had an issue today in the university which i really dont
understand:

char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

now i tried to compare several times:
if(c == '\xc3')
if((unsigned int)c == 0xc3)
if((int)c == 0xc3)
if((unsigned int)c == (unsigned int)0xc3)

All of them negate and go on. But when i do a very stupid bitwise
comparison before it works:

if(((unsigned int)c & 0xc3) == 0xc3)

can anyone explain that to me? I really don't get the difference
betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
0xc3).

Best regards
Tobias

tschmittldk, Dec 22, 2010

2. ### Victor BazarovGuest

On 12/22/2010 7:59 AM, tschmittldk wrote:
> Hey guys... I had an issue today in the university which i really dont
> understand:
>
> char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>
> now i tried to compare several times:
> if(c == '\xc3')
> if((unsigned int)c == 0xc3)
> if((int)c == 0xc3)
> if((unsigned int)c == (unsigned int)0xc3)
>
> All of them negate and go on.

Really? Please post the entire program. I'm asking because I just tried

#include <cassert>

int main()
{
char c = '\xc3';

assert(c == '\xc3');
}

And it passed with flying colors (as it should). So, you're either
your point. I don't like the latter, and hopefully it's not true.

> But when i do a very stupid bitwise
> comparison before it works:
>
> if(((unsigned int)c& 0xc3) == 0xc3)
>
> can anyone explain that to me? I really don't get the difference
> betweet if(((unsigned int)c& 0xc3) == 0xc3) and if((unsigned int)c ==
> 0xc3).

The trick with the other three initial equality comparisons is that the
explicit promotions and conversions cause different effect (apparently)
than the default ones.

The value of 'c' (which is likely only 8 bits long) is *negative*
according to your initialization (and is -61). The value 0xC3 (an
implicit int) is positive (+ 195). Convert -61 (which undergoes an
implicit conversion to int first) to unsigned, and you get 0xFFC3, which
is definitely not equal to 0xC3. Converting to int (your third
comparison) just makes explicit the usual implicit one. In the fourth
comparison casting of 0xC3 to unsigned int makes no difference, the
value does not change.

The problem you have is that your 'c' is *signed* and *negative*.
Please study explicit and implicit integral promotions and arithmetic
conversions to get to the bottom of what's happening.

V
--

Victor Bazarov, Dec 22, 2010

3. ### SGGuest

On 22 Dez., 13:59, tschmittldk wrote:
>
> char c  = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>
> now i tried to compare several times:
> if(c == '\xc3')

Really? This fails? Weird...

> if((unsigned int)c == 0xc3)
> if((int)c == 0xc3)
> if((unsigned int)c == (unsigned int)0xc3)
>
> All of them negate and go on. But when i do a very stupid bitwise
> comparison before it works:
>
> if(((unsigned int)c & 0xc3) == 0xc3)
>
> can anyone explain that to me? I really don't get the difference

A couple of hints:
- integral promotion
- 'char' appears to be a signed type in your case

Before the comparison operator is applied, integral promotion takes
place which converts both operands to a common type that's at least
'int'. Assuming 'char' is a signed 8-bit type, '\xc3' represents a
negative number. Assuming the popular two's complement, its value is
-61. Even (unsigned int)c gives you a value like 0xF...FC3 due to the
rules about converting signed to unsigned values.

representations) nor correct (false positives).

I'd simply use unsigned char and unsigned types. The C++ standard
allows you to use a pointer of type "unsigned char*" to point to a
char array.

Cheers!
SG

SG, Dec 22, 2010
4. ### tschmittldkGuest

Okay thanks for all your answers. I try it tomorrow and post the code
then (I left my notebook in my student flat...). But it seems more
clearly to me now, thanks!

tschmittldk, Dec 22, 2010
5. ### tschmittldkGuest

On 22 Dez., 19:45, tschmittldk <> wrote:
> Okay thanks for all your answers. I try it tomorrow and post the code
> then (I left my notebook in my student flat...). But it seems more
> clearly to me now, thanks!

Okay, now here's the code:

void codevert(char *ArrayToTransform)
{
int j = 0;
char *ptr = ArrayToTransform;
while (*ptr != '\0') {
if((*ptr & 0xC0) > 0xbf)
{
if(*ptr == '\xc3')
simplifier_correct(3, ptr++);
else if(*ptr == '\xc4')
simplifier_correct(3, ptr++);
else if(*ptr == '\xc4')
simplifier_correct(3, ptr++);
else
std::cout << "E01";
}
ptr++;
}
}

it runs through and just checks if the byte is an leadbyte and passes
it to different mapfunctions, which replace the byte with a normal
ascii letter. For example making an ó to o or an À to A.

Now i just need to kill the Leadbyte and it's done.

tschmittldk, Dec 23, 2010
6. ### tschmittldkGuest

> This is all very brittle.
Sorry I'm new to c++ .

> I would rewrite this code about like this:
>
>      const unsigned char *ptr = reinterpret_cast<unsigned char*>
> (ArrayToTransform);
>      while (*ptr) {
>           if((*ptr & 0xC0) > 0xbf)
>           {
>                if(*ptr == 0xc3)
>                // ...

I mostly fixed my program with your code, the only thing: I cannot use
*ptr as const, because simplifier_correct gets ptr as a referenced var
and writes into it's value.

We have this now:

void unicodevert(char *ArrayToTransform) // works
{
int j = 0;
unsigned char *ptr = reinterpret_cast<unsigned
char*>(ArrayToTransform);

//char *ptr = ArrayToTransform;
while (*ptr)
{
if((*ptr & 0xC0) > 0xbf) // is Leadbyte?!
{
// Check which Leadbyte and give the right information to
simplifier_correct...
if(*ptr == 0xc3)
simplifier_correct(3,(ptr+1));
//....

And

void simplifier_correct(int j, const unsigned char *search)
{
unsigned char *buff = search;
if(j == 4)
{
for(int i=0; i<3 ;i++) {
buff = _mbspbrk(gsC4UCHAR_CONVMAP.MAP, search);
if(buff != NULL)
*search = gsC4UCHAR_CONVMAP.REPLACER;
}
}
//... with other cases, but it's all the same code with other maps.

Another thing... i tried to use "memmove" to overwrite the leadbyte in
the char array, like:
"helloworld" should be "hellworld" if o was a lead byte. But i got
Access violation errors all the time. So i coded:

unsigned char *ptr3 = ptr;
unsigned char *ptr2 = (ptr+1);
while(*ptr2)
{
*ptr3 = *ptr2;
ptr3++;
ptr2++;
}
*ptr3 = '\0';

I know, it just works "to the left" but i just need it like that. Do
you think that is okay? I mean... it does mostly the same than memmove
does.

Thanks for help
best regards
Tobias

tschmittldk, Dec 23, 2010
7. ### Paul NGuest

On Dec 22, 12:59 pm, tschmittldk <> wrote:
> Hey guys... I had an issue today in the university which i really dont
> understand:
>
> char c  = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>
> now i tried to compare several times:
> if(c == '\xc3')
> if((unsigned int)c == 0xc3)
> if((int)c == 0xc3)
> if((unsigned int)c == (unsigned int)0xc3)
>
> All of them negate and go on. But when i do a very stupid bitwise
> comparison before it works:
>
> if(((unsigned int)c & 0xc3) == 0xc3)
>
> can anyone explain that to me? I really don't get the difference
> betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
> 0xc3).

Other people have gone into the detail of this but there is one detail
that *might* be causing problems.

In the language C, '\xc3' has type int. In the language C++, '\xc3'
has type char. So the exact same code can give different results,
depending on whether you feed it into a C compiler or a C++ compiler.

For good measure, many C++ compilers actually include a C compiler
which, if told to compile a C program, will compile the code as if it
is a C program. So you need to be sure you are driving the compiler
correctly. It might be a useful test to include something in your
program which is valid C++ but not valid C, just to make sure you are
using the right language.

Hope that helps.
Paul.

Paul N, Dec 23, 2010
8. ### James KanzeGuest

On Dec 23, 9:03 am, Paavo Helde <> wrote:
> tschmittldk <> wrote in news:9139bceb-5be4-
> :
> > On 22 Dez., 19:45, tschmittldk <> wrote:
> >> Okay thanks for all your answers. I try it tomorrow and
> >> post the code then (I left my notebook in my student
> >> flat...). But it seems more clearly to me now, thanks!

> > Okay, now here's the code:

> > void codevert(char *ArrayToTransform)
> > {
> > int j = 0;
> > char *ptr = ArrayToTransform;
> > while (*ptr != '\0') {
> > if((*ptr & 0xC0) > 0xbf)
> > {
> > if(*ptr == '\xc3')
> > simplifier_correct(3, ptr++);
> > else if(*ptr == '\xc4')
> > simplifier_correct(3, ptr++);
> > else if(*ptr == '\xc4')
> > simplifier_correct(3, ptr++);
> > else
> > std::cout << "E01";
> > }
> > ptr++;
> > }
> > }

> This is all very brittle.

Yes, but not for the reasons you imply. It's brittle because
it only handles a very small subset of UTF-8. But presumably,
the poster knows that, and accepts that any but a few specific
two byte sequences will result in "E01". Not to mention the
typo: the last two else if test exactly the same thing.

There's nothing brittle about it at the C++ level.

> *ptr is char, which is most probably a signed
> type and can be negative.

And is probably 8 bits.

> (*ptr & 0xC0) is int and appears to be positive

Not only appears to be: is.

The intermediate values will be unexpected, of course, but the
final result should be correct. (The expression *ptr might be
negative.)

> and of the desired value even if *ptr is negative, this is
> more by chance and not very portable.

Could you name an architecture where it wouldn't work? And
explain why, and what you'd get. (There is, perhaps, a brittle
part in filling the char[]. Formally, at least, it's possible
that the iostream library reject any negative char's. In
practice, a compiler whose iostream library didn't support this
kind of thing won't be used, so you don't have to worry about it.)

> 0xbf is int and positive, '\xc3' is char and
> negative.

And? In all cases, integral promotion occurs. And when the &
is present, it ensures that the results must be positive.

> I would rewrite this code about like this:

> const unsigned char *ptr = reinterpret_cast<unsigned char*>
> (ArrayToTransform);
> while (*ptr) {
> if((*ptr & 0xC0) > 0xbf)
> {
> if(*ptr == 0xc3)
> // ...

Why bother?

Actually, I'd rewrite the code more fundamentally, to make it
clear what is actually being tested; if nothing else >= 0xC0,
rather than > 0xBF, but more likely with a switch on the results
of *ptr & 0xC0 (with four cases clearly delimiting the
possibilities).

--
James Kanze

James Kanze, Dec 26, 2010
9. ### Jorgen GrahnGuest

On Wed, 2010-12-22, Victor Bazarov wrote:
> On 12/22/2010 7:59 AM, tschmittldk wrote:
>> Hey guys... I had an issue today in the university which i really dont
>> understand:
>>
>> char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>>
>> now i tried to compare several times:
>> if(c == '\xc3')
>> if((unsigned int)c == 0xc3)
>> if((int)c == 0xc3)
>> if((unsigned int)c == (unsigned int)0xc3)
>>
>> All of them negate and go on.

>
> Really? Please post the entire program. I'm asking because I just tried
>
> #include <cassert>
>
> int main()
> {
> char c = '\xc3';
>
> assert(c == '\xc3');
> }
>
> And it passed with flying colors (as it should). So, you're either
> mistaken about your first case or you're lying intentionally to make
> your point. I don't like the latter, and hopefully it's not true.

Interesting. I read his first line

char c = '\xc3' or '\xc4' ect...

as actually containing the token 'or', the synonym for ||. Then his
problems would make perfect sense.

The later postings showed this wasn't was he really meant, though ...

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Jorgen Grahn, Dec 29, 2010