wtf is happening here @ bitwise comparison

Discussion in 'C++' started by tschmittldk, Dec 22, 2010.

  1. tschmittldk

    tschmittldk Guest

    Hey guys... I had an issue today in the university which i really dont
    understand:

    char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

    now i tried to compare several times:
    if(c == '\xc3')
    if((unsigned int)c == 0xc3)
    if((int)c == 0xc3)
    if((unsigned int)c == (unsigned int)0xc3)

    All of them negate and go on. But when i do a very stupid bitwise
    comparison before it works:

    if(((unsigned int)c & 0xc3) == 0xc3)

    can anyone explain that to me? I really don't get the difference
    betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
    0xc3).

    Best regards
    Tobias
     
    tschmittldk, Dec 22, 2010
    #1
    1. Advertising

  2. On 12/22/2010 7:59 AM, tschmittldk wrote:
    > Hey guys... I had an issue today in the university which i really dont
    > understand:
    >
    > char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
    >
    > now i tried to compare several times:
    > if(c == '\xc3')
    > if((unsigned int)c == 0xc3)
    > if((int)c == 0xc3)
    > if((unsigned int)c == (unsigned int)0xc3)
    >
    > All of them negate and go on.


    Really? Please post the entire program. I'm asking because I just tried

    #include <cassert>

    int main()
    {
    char c = '\xc3';

    assert(c == '\xc3');
    }

    And it passed with flying colors (as it should). So, you're either
    mistaken about your first case or you're lying intentionally to make
    your point. I don't like the latter, and hopefully it's not true.

    > But when i do a very stupid bitwise
    > comparison before it works:
    >
    > if(((unsigned int)c& 0xc3) == 0xc3)
    >
    > can anyone explain that to me? I really don't get the difference
    > betweet if(((unsigned int)c& 0xc3) == 0xc3) and if((unsigned int)c ==
    > 0xc3).


    The trick with the other three initial equality comparisons is that the
    explicit promotions and conversions cause different effect (apparently)
    than the default ones.

    The value of 'c' (which is likely only 8 bits long) is *negative*
    according to your initialization (and is -61). The value 0xC3 (an
    implicit int) is positive (+ 195). Convert -61 (which undergoes an
    implicit conversion to int first) to unsigned, and you get 0xFFC3, which
    is definitely not equal to 0xC3. Converting to int (your third
    comparison) just makes explicit the usual implicit one. In the fourth
    comparison casting of 0xC3 to unsigned int makes no difference, the
    value does not change.

    The problem you have is that your 'c' is *signed* and *negative*.
    Please study explicit and implicit integral promotions and arithmetic
    conversions to get to the bottom of what's happening.

    V
    --
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Dec 22, 2010
    #2
    1. Advertising

  3. tschmittldk

    SG Guest

    On 22 Dez., 13:59, tschmittldk wrote:
    >
    > char c  = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
    >
    > now i tried to compare several times:
    > if(c == '\xc3')


    Really? This fails? Weird...

    > if((unsigned int)c == 0xc3)
    > if((int)c == 0xc3)
    > if((unsigned int)c == (unsigned int)0xc3)
    >
    > All of them negate and go on. But when i do a very stupid bitwise
    > comparison before it works:
    >
    > if(((unsigned int)c & 0xc3) == 0xc3)
    >
    > can anyone explain that to me? I really don't get the difference


    A couple of hints:
    - integral promotion
    - 'char' appears to be a signed type in your case

    Before the comparison operator is applied, integral promotion takes
    place which converts both operands to a common type that's at least
    'int'. Assuming 'char' is a signed 8-bit type, '\xc3' represents a
    negative number. Assuming the popular two's complement, its value is
    -61. Even (unsigned int)c gives you a value like 0xF...FC3 due to the
    rules about converting signed to unsigned values.

    Btw, your bit mask trick is neither portable (w.r.t. signed value
    representations) nor correct (false positives).

    I'd simply use unsigned char and unsigned types. The C++ standard
    allows you to use a pointer of type "unsigned char*" to point to a
    char array.

    Cheers!
    SG
     
    SG, Dec 22, 2010
    #3
  4. tschmittldk

    tschmittldk Guest

    Okay thanks for all your answers. I try it tomorrow and post the code
    then (I left my notebook in my student flat...). But it seems more
    clearly to me now, thanks!
     
    tschmittldk, Dec 22, 2010
    #4
  5. tschmittldk

    tschmittldk Guest

    On 22 Dez., 19:45, tschmittldk <> wrote:
    > Okay thanks for all your answers. I try it tomorrow and post the code
    > then (I left my notebook in my student flat...). But it seems more
    > clearly to me now, thanks!


    Okay, now here's the code:

    void codevert(char *ArrayToTransform)
    {
    int j = 0;
    char *ptr = ArrayToTransform;
    while (*ptr != '\0') {
    if((*ptr & 0xC0) > 0xbf)
    {
    if(*ptr == '\xc3')
    simplifier_correct(3, ptr++);
    else if(*ptr == '\xc4')
    simplifier_correct(3, ptr++);
    else if(*ptr == '\xc4')
    simplifier_correct(3, ptr++);
    else
    std::cout << "E01";
    }
    ptr++;
    }
    }

    it runs through and just checks if the byte is an leadbyte and passes
    it to different mapfunctions, which replace the byte with a normal
    ascii letter. For example making an ó to o or an À to A.

    Now i just need to kill the Leadbyte and it's done.
     
    tschmittldk, Dec 23, 2010
    #5
  6. tschmittldk

    tschmittldk Guest

    > This is all very brittle.
    Sorry I'm new to c++ ;).

    > I would rewrite this code about like this:
    >
    >      const unsigned char *ptr = reinterpret_cast<unsigned char*>
    > (ArrayToTransform);
    >      while (*ptr) {
    >           if((*ptr & 0xC0) > 0xbf)
    >           {
    >                if(*ptr == 0xc3)
    >                // ...


    I mostly fixed my program with your code, the only thing: I cannot use
    *ptr as const, because simplifier_correct gets ptr as a referenced var
    and writes into it's value.

    We have this now:

    void unicodevert(char *ArrayToTransform) // works
    {
    int j = 0;
    unsigned char *ptr = reinterpret_cast<unsigned
    char*>(ArrayToTransform);

    //char *ptr = ArrayToTransform;
    while (*ptr)
    {
    if((*ptr & 0xC0) > 0xbf) // is Leadbyte?!
    {
    // Check which Leadbyte and give the right information to
    simplifier_correct...
    if(*ptr == 0xc3)
    simplifier_correct(3,(ptr+1));
    //....




    And

    void simplifier_correct(int j, const unsigned char *search)
    {
    unsigned char *buff = search;
    if(j == 4)
    {
    for(int i=0; i<3 ;i++) {
    buff = _mbspbrk(gsC4UCHAR_CONVMAP.MAP, search);
    if(buff != NULL)
    *search = gsC4UCHAR_CONVMAP.REPLACER;
    }
    }
    //... with other cases, but it's all the same code with other maps.


    Another thing... i tried to use "memmove" to overwrite the leadbyte in
    the char array, like:
    "helloworld" should be "hellworld" if o was a lead byte. But i got
    Access violation errors all the time. So i coded:

    unsigned char *ptr3 = ptr;
    unsigned char *ptr2 = (ptr+1);
    while(*ptr2)
    {
    *ptr3 = *ptr2;
    ptr3++;
    ptr2++;
    }
    *ptr3 = '\0';

    I know, it just works "to the left" but i just need it like that. Do
    you think that is okay? I mean... it does mostly the same than memmove
    does.


    Thanks for help
    best regards
    Tobias
     
    tschmittldk, Dec 23, 2010
    #6
  7. tschmittldk

    Paul N Guest

    On Dec 22, 12:59 pm, tschmittldk <> wrote:
    > Hey guys... I had an issue today in the university which i really dont
    > understand:
    >
    > char c  = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
    >
    > now i tried to compare several times:
    > if(c == '\xc3')
    > if((unsigned int)c == 0xc3)
    > if((int)c == 0xc3)
    > if((unsigned int)c == (unsigned int)0xc3)
    >
    > All of them negate and go on. But when i do a very stupid bitwise
    > comparison before it works:
    >
    > if(((unsigned int)c & 0xc3) == 0xc3)
    >
    > can anyone explain that to me? I really don't get the difference
    > betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
    > 0xc3).


    Other people have gone into the detail of this but there is one detail
    that *might* be causing problems.

    In the language C, '\xc3' has type int. In the language C++, '\xc3'
    has type char. So the exact same code can give different results,
    depending on whether you feed it into a C compiler or a C++ compiler.

    For good measure, many C++ compilers actually include a C compiler
    which, if told to compile a C program, will compile the code as if it
    is a C program. So you need to be sure you are driving the compiler
    correctly. It might be a useful test to include something in your
    program which is valid C++ but not valid C, just to make sure you are
    using the right language.

    Hope that helps.
    Paul.
     
    Paul N, Dec 23, 2010
    #7
  8. tschmittldk

    James Kanze Guest

    On Dec 23, 9:03 am, Paavo Helde <> wrote:
    > tschmittldk <> wrote in news:9139bceb-5be4-
    > :
    > > On 22 Dez., 19:45, tschmittldk <> wrote:
    > >> Okay thanks for all your answers. I try it tomorrow and
    > >> post the code then (I left my notebook in my student
    > >> flat...). But it seems more clearly to me now, thanks!


    > > Okay, now here's the code:


    > > void codevert(char *ArrayToTransform)
    > > {
    > > int j = 0;
    > > char *ptr = ArrayToTransform;
    > > while (*ptr != '\0') {
    > > if((*ptr & 0xC0) > 0xbf)
    > > {
    > > if(*ptr == '\xc3')
    > > simplifier_correct(3, ptr++);
    > > else if(*ptr == '\xc4')
    > > simplifier_correct(3, ptr++);
    > > else if(*ptr == '\xc4')
    > > simplifier_correct(3, ptr++);
    > > else
    > > std::cout << "E01";
    > > }
    > > ptr++;
    > > }
    > > }


    > This is all very brittle.


    Yes, but not for the reasons you imply. It's brittle because
    it only handles a very small subset of UTF-8. But presumably,
    the poster knows that, and accepts that any but a few specific
    two byte sequences will result in "E01". Not to mention the
    typo: the last two else if test exactly the same thing.

    There's nothing brittle about it at the C++ level.

    > *ptr is char, which is most probably a signed
    > type and can be negative.


    And is probably 8 bits.

    > (*ptr & 0xC0) is int and appears to be positive


    Not only appears to be: is.

    The intermediate values will be unexpected, of course, but the
    final result should be correct. (The expression *ptr might be
    negative.)

    > and of the desired value even if *ptr is negative, this is
    > more by chance and not very portable.


    Could you name an architecture where it wouldn't work? And
    explain why, and what you'd get. (There is, perhaps, a brittle
    part in filling the char[]. Formally, at least, it's possible
    that the iostream library reject any negative char's. In
    practice, a compiler whose iostream library didn't support this
    kind of thing won't be used, so you don't have to worry about it.)

    > 0xbf is int and positive, '\xc3' is char and
    > negative.


    And? In all cases, integral promotion occurs. And when the &
    is present, it ensures that the results must be positive.

    > I would rewrite this code about like this:


    > const unsigned char *ptr = reinterpret_cast<unsigned char*>
    > (ArrayToTransform);
    > while (*ptr) {
    > if((*ptr & 0xC0) > 0xbf)
    > {
    > if(*ptr == 0xc3)
    > // ...


    Why bother?

    Actually, I'd rewrite the code more fundamentally, to make it
    clear what is actually being tested; if nothing else >= 0xC0,
    rather than > 0xBF, but more likely with a switch on the results
    of *ptr & 0xC0 (with four cases clearly delimiting the
    possibilities).

    --
    James Kanze
     
    James Kanze, Dec 26, 2010
    #8
  9. tschmittldk

    Jorgen Grahn Guest

    On Wed, 2010-12-22, Victor Bazarov wrote:
    > On 12/22/2010 7:59 AM, tschmittldk wrote:
    >> Hey guys... I had an issue today in the university which i really dont
    >> understand:
    >>
    >> char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
    >>
    >> now i tried to compare several times:
    >> if(c == '\xc3')
    >> if((unsigned int)c == 0xc3)
    >> if((int)c == 0xc3)
    >> if((unsigned int)c == (unsigned int)0xc3)
    >>
    >> All of them negate and go on.

    >
    > Really? Please post the entire program. I'm asking because I just tried
    >
    > #include <cassert>
    >
    > int main()
    > {
    > char c = '\xc3';
    >
    > assert(c == '\xc3');
    > }
    >
    > And it passed with flying colors (as it should). So, you're either
    > mistaken about your first case or you're lying intentionally to make
    > your point. I don't like the latter, and hopefully it's not true.


    Interesting. I read his first line

    char c = '\xc3' or '\xc4' ect...

    as actually containing the token 'or', the synonym for ||. Then his
    problems would make perfect sense.

    The later postings showed this wasn't was he really meant, though ...

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Dec 29, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    414
    Howard
    Feb 22, 2005
  2. dan miller (moderator, s.p.d)

    what is happening here?

    dan miller (moderator, s.p.d), Apr 6, 2004, in forum: Python
    Replies:
    3
    Views:
    364
  3. whats happening here

    , Mar 25, 2008, in forum: C Programming
    Replies:
    4
    Views:
    2,601
    Eric Sosman
    Mar 25, 2008
  4. vikramtheone

    Basics of VHDL. Whats happening here?

    vikramtheone, Jun 1, 2009, in forum: VHDL
    Replies:
    1
    Views:
    508
    jeppe
    Jun 1, 2009
  5. George George

    what is happening here

    George George, Apr 8, 2009, in forum: Ruby
    Replies:
    6
    Views:
    141
    George George
    Apr 9, 2009
Loading...

Share This Page