wtf is happening here @ bitwise comparison

tschmittldk · Dec 22, 2010

Hey guys... I had an issue today in the university which i really dont
understand:

char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

now i tried to compare several times:
if(c == '\xc3')
if((unsigned int)c == 0xc3)
if((int)c == 0xc3)
if((unsigned int)c == (unsigned int)0xc3)

All of them negate and go on. But when i do a very stupid bitwise
comparison before it works:

if(((unsigned int)c & 0xc3) == 0xc3)

can anyone explain that to me? I really don't get the difference
betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
0xc3).

Best regards
Tobias

Victor Bazarov · Dec 22, 2010

Hey guys... I had an issue today in the university which i really dont
understand:

char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

now i tried to compare several times:
if(c == '\xc3')
if((unsigned int)c == 0xc3)
if((int)c == 0xc3)
if((unsigned int)c == (unsigned int)0xc3)

All of them negate and go on.

Really? Please post the entire program. I'm asking because I just tried

#include <cassert>

int main()
{
char c = '\xc3';

assert(c == '\xc3');
}

And it passed with flying colors (as it should). So, you're either
mistaken about your first case or you're lying intentionally to make
your point. I don't like the latter, and hopefully it's not true.

> But when i do a very stupid bitwise
comparison before it works:

if(((unsigned int)c& 0xc3) == 0xc3)

can anyone explain that to me? I really don't get the difference
betweet if(((unsigned int)c& 0xc3) == 0xc3) and if((unsigned int)c ==
0xc3).

The trick with the other three initial equality comparisons is that the
explicit promotions and conversions cause different effect (apparently)
than the default ones.

The value of 'c' (which is likely only 8 bits long) is *negative*
according to your initialization (and is -61). The value 0xC3 (an
implicit int) is positive (+ 195). Convert -61 (which undergoes an
implicit conversion to int first) to unsigned, and you get 0xFFC3, which
is definitely not equal to 0xC3. Converting to int (your third
comparison) just makes explicit the usual implicit one. In the fourth
comparison casting of 0xC3 to unsigned int makes no difference, the
value does not change.

The problem you have is that your 'c' is *signed* and *negative*.
Please study explicit and implicit integral promotions and arithmetic
conversions to get to the bottom of what's happening.

V

SG · Dec 22, 2010

char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

now i tried to compare several times:
if(c == '\xc3')

Really? This fails? Weird...

if((unsigned int)c == 0xc3)
if((int)c == 0xc3)
if((unsigned int)c == (unsigned int)0xc3)

All of them negate and go on. But when i do a very stupid bitwise
comparison before it works:

if(((unsigned int)c & 0xc3) == 0xc3)

can anyone explain that to me? I really don't get the difference

A couple of hints:
- integral promotion
- 'char' appears to be a signed type in your case

Before the comparison operator is applied, integral promotion takes
place which converts both operands to a common type that's at least
'int'. Assuming 'char' is a signed 8-bit type, '\xc3' represents a
negative number. Assuming the popular two's complement, its value is
-61. Even (unsigned int)c gives you a value like 0xF...FC3 due to the
rules about converting signed to unsigned values.

Btw, your bit mask trick is neither portable (w.r.t. signed value
representations) nor correct (false positives).

I'd simply use unsigned char and unsigned types. The C++ standard
allows you to use a pointer of type "unsigned char*" to point to a
char array.

Cheers!
SG

tschmittldk · Dec 22, 2010

Okay thanks for all your answers. I try it tomorrow and post the code
then (I left my notebook in my student flat...). But it seems more
clearly to me now, thanks!

tschmittldk · Dec 23, 2010

Okay thanks for all your answers. I try it tomorrow and post the code
then (I left my notebook in my student flat...). But it seems more
clearly to me now, thanks!

Okay, now here's the code:

void codevert(char *ArrayToTransform)
{
int j = 0;
char *ptr = ArrayToTransform;
while (*ptr != '\0') {
if((*ptr & 0xC0) > 0xbf)
{
if(*ptr == '\xc3')
simplifier_correct(3, ptr++);
else if(*ptr == '\xc4')
simplifier_correct(3, ptr++);
else if(*ptr == '\xc4')
simplifier_correct(3, ptr++);
else
std::cout << "E01";
}
ptr++;
}
}

it runs through and just checks if the byte is an leadbyte and passes
it to different mapfunctions, which replace the byte with a normal
ascii letter. For example making an ó to o or an À to A.

Now i just need to kill the Leadbyte and it's done.

tschmittldk · Dec 23, 2010

This is all very brittle.
Sorry I'm new to c++

.

I would rewrite this code about like this:

const unsigned char *ptr = reinterpret_cast<unsigned char*>
(ArrayToTransform);
while (*ptr) {
if((*ptr & 0xC0) > 0xbf)
{
if(*ptr == 0xc3)
// ...

I mostly fixed my program with your code, the only thing: I cannot use
*ptr as const, because simplifier_correct gets ptr as a referenced var
and writes into it's value.

We have this now:

void unicodevert(char *ArrayToTransform) // works
{
int j = 0;
unsigned char *ptr = reinterpret_cast<unsigned
char*>(ArrayToTransform);

//char *ptr = ArrayToTransform;
while (*ptr)
{
if((*ptr & 0xC0) > 0xbf) // is Leadbyte?!
{
// Check which Leadbyte and give the right information to
simplifier_correct...
if(*ptr == 0xc3)
simplifier_correct(3,(ptr+1));
//....

And

void simplifier_correct(int j, const unsigned char *search)
{
unsigned char *buff = search;
if(j == 4)
{
for(int i=0; i<3 ;i++) {
buff = _mbspbrk(gsC4UCHAR_CONVMAP.MAP, search);
if(buff != NULL)
*search = gsC4UCHAR_CONVMAP.REPLACER;
}
}
//... with other cases, but it's all the same code with other maps.

Another thing... i tried to use "memmove" to overwrite the leadbyte in
the char array, like:
"helloworld" should be "hellworld" if o was a lead byte. But i got
Access violation errors all the time. So i coded:

unsigned char *ptr3 = ptr;
unsigned char *ptr2 = (ptr+1);
while(*ptr2)
{
*ptr3 = *ptr2;
ptr3++;
ptr2++;
}
*ptr3 = '\0';

I know, it just works "to the left" but i just need it like that. Do
you think that is okay? I mean... it does mostly the same than memmove
does.

Thanks for help
best regards
Tobias

Paul N · Dec 23, 2010

Hey guys... I had an issue today in the university which i really dont
understand:

char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

now i tried to compare several times:
if(c == '\xc3')
if((unsigned int)c == 0xc3)
if((int)c == 0xc3)
if((unsigned int)c == (unsigned int)0xc3)

All of them negate and go on. But when i do a very stupid bitwise
comparison before it works:

if(((unsigned int)c & 0xc3) == 0xc3)

can anyone explain that to me? I really don't get the difference
betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
0xc3).

Other people have gone into the detail of this but there is one detail
that *might* be causing problems.

In the language C, '\xc3' has type int. In the language C++, '\xc3'
has type char. So the exact same code can give different results,
depending on whether you feed it into a C compiler or a C++ compiler.

For good measure, many C++ compilers actually include a C compiler
which, if told to compile a C program, will compile the code as if it
is a C program. So you need to be sure you are driving the compiler
correctly. It might be a useful test to include something in your
program which is valid C++ but not valid C, just to make sure you are
using the right language.

Hope that helps.
Paul.

James Kanze · Dec 26, 2010

This is all very brittle.

Yes, but not for the reasons you imply. It's brittle because
it only handles a very small subset of UTF-8. But presumably,
the poster knows that, and accepts that any but a few specific
two byte sequences will result in "E01". Not to mention the
typo: the last two else if test exactly the same thing.

There's nothing brittle about it at the C++ level.

*ptr is char, which is most probably a signed
type and can be negative.

And is probably 8 bits.

(*ptr & 0xC0) is int and appears to be positive

Not only appears to be: is.

The intermediate values will be unexpected, of course, but the
final result should be correct. (The expression *ptr might be
negative.)

and of the desired value even if *ptr is negative, this is
more by chance and not very portable.

Could you name an architecture where it wouldn't work? And
explain why, and what you'd get. (There is, perhaps, a brittle
part in filling the char[]. Formally, at least, it's possible
that the iostream library reject any negative char's. In
practice, a compiler whose iostream library didn't support this
kind of thing won't be used, so you don't have to worry about it.)

0xbf is int and positive, '\xc3' is char and
negative.

And? In all cases, integral promotion occurs. And when the &
is present, it ensures that the results must be positive.

I would rewrite this code about like this:

const unsigned char *ptr = reinterpret_cast<unsigned char*>
(ArrayToTransform);
while (*ptr) {
if((*ptr & 0xC0) > 0xbf)
{
if(*ptr == 0xc3)
// ...

Why bother?

Actually, I'd rewrite the code more fundamentally, to make it
clear what is actually being tested; if nothing else >= 0xC0,
rather than > 0xBF, but more likely with a switch on the results
of *ptr & 0xC0 (with four cases clearly delimiting the
possibilities).

Jorgen Grahn · Dec 29, 2010

Really? Please post the entire program. I'm asking because I just tried

#include <cassert>

int main()
{
char c = '\xc3';

assert(c == '\xc3');
}

And it passed with flying colors (as it should). So, you're either
mistaken about your first case or you're lying intentionally to make
your point. I don't like the latter, and hopefully it's not true.

Interesting. I read his first line

char c = '\xc3' or '\xc4' ect...

as actually containing the token 'or', the synonym for ||. Then his
problems would make perfect sense.

The later postings showed this wasn't was he really meant, though ...

/Jorgen

Implicit promotion of chars by bitwise operators	1	Apr 11, 2009
Here is how I would move forward on this endeavor.	0	Feb 12, 2011
unsigned short comparison	1	Nov 11, 2005
Explicit typedef	10	May 8, 2007
Here is a function for finding age	3	Jan 4, 2004
You opion on "should the limit in bitfields be removed?"	6	Oct 4, 2005
feedback on code design	23	May 30, 2012
void types and how to use them efficiently?	6	Sep 18, 2009

wtf is happening here @ bitwise comparison

tschmittldk

Victor Bazarov

SG

tschmittldk

tschmittldk

tschmittldk

Paul N

James Kanze

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads