matching assembly speed for small string comparison

Discussion in 'C Programming' started by qak, Aug 31, 2013.

  1. qak

    Tim Rentsch Guest

    Actually both sides are undefined behavior regardless of
    alignment, because effective type conditions are not met.
    Tim Rentsch, Aug 31, 2013
    1. Advertisements

  2. (snip, I wrote)
    I would probably do multi-character constants over inline assembler,
    but, yes, memcmp() is the best way.

    -- glen
    glen herrmannsfeldt, Aug 31, 2013
    1. Advertisements

  3. qak

    Ian Collins Guest

    Or strncmp.

    One compiler I tried (Sun cc) generates the same code as for the OP's
    slow and long case with strncmp but calls memcmp. gcc uses a loop for both.
    Ian Collins, Aug 31, 2013
  4. Specifically, many compilers generate inline code for these.
    There is no function call overhead in that case.

    It is a little less obvious that the inline code is optimal for
    specific known size cases. But only a little.

    -- glen
    glen herrmannsfeldt, Sep 1, 2013
  5. qak

    Eric Sosman Guest

    What's the byte ordering of 0x54657374? *That's* where
    qak's "optimization" is very likely wrong: If his machine is
    Little-Endian (as seems likely from the one line of assembly
    exhibited), then 0x54657374 compares equal to the first four
    characters of "tseT". Not only is his code "hard to change,"
    it's also "hard to get right in the first place."

    It's said that optimization is the art of getting the wrong
    answer sooner, and qak's code is a perfect example. qak: Use
    memcmp(); life's too short to waste on insignificances.

    (An old acquaintance spoke of removing cigarette butts and
    bottle tops from the beach, "so the sand would be nice and clean
    around the whale carcases." That's probably what qak is up to.)
    Eric Sosman, Sep 1, 2013
  6. qak

    qak Guest

    Somehow, I can't not reply to all other replies (the reply button is
    greyed out and their icons turn to red diamond, closed or deleted
    maybe?). If I post an other message, the thread is lost. So here is the
    my general reply:
    1) My endiantness (hard to changed) is wrong, it is just an example and
    quickly look up give the wrong order.
    2) Some 'experts' don't seem to realize 'cmp' is a single assemply
    instruction, all memcmp, strcmp, strncmp... even after optimized turn to
    many 'cmp'(s) : cmp to see if the pointer is null, cmp to see if the
    pointer is aligned, cmp to see if it can compare word or dword at a time,
    cmp to see if the end is reached... how can they match assemply speed?
    3) Without alignment, the 'cmp' return correctly, only slower.
    4) to christian.bau, I'm looking for better solution, a good macro maybe?
    Thanks all who participate.
    qak, Sep 1, 2013
  7. qak

    Eric Sosman Guest

    The fact that you got it wrong should be cautionary. If even
    a simple example has a bug, how will you fare when things get more
    First, the "single assemply [sic] instruction" you exhibited
    will not in fact do the job. It will perform the comparison and
    set a flag somewhere -- and then proceed merrily to the next
    instruction, just as if nothing had happened. You need at least
    one further instruction to get the program to do This versus That
    depending on the outcome of the comparison. In short, your "single
    assembly instruction" is not equivalent to *any* C `if' construct;
    it is equivalent to something like

    memcmp(p, "Test", 4);

    .... with no test. Many compilers will recognize that this statement
    has no effect and will optimize it away (typically with a warning),
    leaving *zero* instructions in the code -- *much* faster than your
    "single assembly instruction."

    But that's a quibble: We all know that there'll be a conditional
    branch or some such, and probably an unconditional branch as well
    if there's an `else'. What everyone's trying to tell you is that
    your effort is almost certainly misplaced. How many nanoseconds do
    you hope to save, and how many nanoseconds have you *already* spent
    in hopes of saving them?

    Let's just imagine that it takes you one minute to look up the
    four characters in an ASCII chart, type in their hex values, proofread
    them, and double-check the endianness (since you've already shown that
    endianness is a trap for you, omitting the double-check would be Really
    Dumb). Okay, one minute. Now let's suppose that this effort saves
    ONE HUNDRED NANOSECONDS per comparison, wow! You will *break even*



    Now, six hundred million is not a Really Enormous number of
    comparisons: You'd expect about that many when sorting twenty-five
    million items, for example. But remember: Your optimization only
    applies when comparing to a *constant*, so scenarios like big sorts
    just aren't applicable. I put it to you that either (1) your code
    will never even come close to making six hundred million string-
    vs-constant comparisons, or (2) you've got an infinite loop.

    What the `experts' are trying to tell you is that there are
    almost certainly other parts of your program whose optimization
    will make much more difference than this one could ever hope to.
    Eric Sosman, Sep 1, 2013
  8. qak

    BartC Guest

    You can get fast short-string compares by forgetting about strings and just
    using ints:

    if (a == 'ABCD') ...

    However C seems to have a problem turning 'ABCD' into an int value in a
    consistent, portable manner. (Maybe there are macros that can be created but
    if it involves having to write stuff like ('A'<<24)+('B'<<16)... then forget

    But if 'ABCD' is handled properly on the systems you're interested in, then
    that's simplest. Endian-ness may or may not be relevant. (How does ASM
    denote 'ABCD' anyway?)

    And are strings always going to be 4 characters? I think shorter strings
    will be OK, but if some will be longer, or you might be comparing short with
    long, then it might be simpler to just stick with strcmp(), or use some
    entirely different scheme depending on what you're trying to do.
    BartC, Sep 1, 2013
  9. The value of such a character constant is implementation-defined. On
    one system I tested, the byte order of 'ABCD' is the opposite of the
    byte order of "ABCD".

    And you'll have problems on system where sizeof (int) < 4.

    I wouldn't consider using 'ABCD' in any code intended to be portable.
    I wouldn't willingly use even in non-portable code.
    You only have to write that once. If you want to reject the idea
    because of that, you can.
    I don't know what you mean by "handled properly".
    The OP presented this:
    as the assembly code he's trying to replicate in C, but hasn't told us
    what CPU he's using (I think "esi" is x86-specific, but I'm not
    certain), or how his assembly treats 'Test'.
    Keith Thompson, Sep 2, 2013
  10. qak

    Siri Cruise Guest

    This is an old CDC programming trick. A lot of fields were 1, 2, 3, 7, or 10
    character strings that were compared as integers. It also meant when you dumped
    memory it characters and octal, you could easily find and decode these values.
    Apple also uses this technique from way back in 1984. A compiler that works with
    Macintosh systems will do this consistently.
    You can't do switches on strings.
    Siri Cruise, Sep 2, 2013
  11. qak

    BartC Guest

    That might not matter, if you're only interested in comparing the whole
    Naturally, you would have to use an int size that is 4 chars long (or even 8
    It will usually be more than once, as there could be hundreds of these
    constants. And normal development involves deleting, adding and modifying
    these all the time. Maybe there are editor tricks that can be used to
    replicate the pattern (but it still looks terrible). If it was possible to
    have a macro that looked like INTSTR("ABCD") then it wouldn't be so bad (but
    I think not, and it would need different versions for different lengths).
    By ending up with an int value looks like 0x41424344 or 0x44434241 or
    something along those lines. Just a unique representation of a 4-char string
    that can be stored in a single int value. Shorter strings would need to be
    justified towards the lsb (but I think that happens anyway otherwise 'A'
    would be 0x41000000).

    Clearly treating such multi-char constants as int values would be extremely
    useful. C would have allowed these, but I suspect that a number of
    implementations that did things slightly differently meant it couldn't be
    standardised. (FWIW my gcc stores 'ABCD' as 0x41424344, but it gives a
    warning that I don't know how to disable. And it doesn't like 'ABCDEFGH'.)

    BartC, Sep 2, 2013
  12. qak

    James Kuyper Guest

    I suppose that's true, but only because you used the word "might" -
    which acknoledges that the flip side of your statement is also true: it
    might matter, even if you're only interested in comparing the whole
    string. If the integer constant that you need to use to match "ABDC" in
    this fashion is 'DCAB', it matters very much whether or not you're aware
    of that fact.
    Why would you have to write the macro more than once, just because
    you'll be using it hundreds of times?

    And normal development involves deleting, adding and modifying
    Why do you think that wouldn't be possible?
    #define INTSTR(string) (string[0]<<24 + string[1]<<16 + ...

    It depends upon the implementation-specific byte order of 'int', and
    needs rewriting to be useful on systems where INT_MAX is too small, but
    you've already conceded those portability issues. The result wouldn't be
    an integer constant expression, which restricts it's potential uses.
    Howe but I hadn't thought that the intended use would be one that
    requires an ICE.
    When applied to a string literal, it is an integer expression with a
    value that can be computed at compile time, and I would hope that many
    compilers would actually do so. I haven't bothered checking; I would
    never have any use for something so thoroughly unportable.
    James Kuyper, Sep 2, 2013
  13. qak

    Tim Rentsch Guest

    Let's see if we can reach some useful conclusions on the question.

    I assume you are interested primarily in program speed, and either
    not concerned, or less concerned, with generated code size.

    As others have pointed out, it's likely any performance gain here
    will be down in the noise relative to many other issues.

    Having said that, if you have decided it's important, the only
    sure way to answer the question is try out different approaches
    and measure.

    Different techniques will have different performance results on
    different hardware and under different compilers, so measurements
    should be done on a representative set of platforms.

    Which approach is best (using "best" somewhat guardedly) depends
    on some things you haven't brought up. Some examples: is this
    test going to be an isolated test, or will there be lots of
    comparisons against other word choices? Will the comparisons
    typically succeed or typically fail? If they typically fail,
    what is the breakdown of how many initial characters match?

    Starting at the end, if this is one isolated test, and most
    comparisons have a mismatch on the first letter, the method
    you describe as "slow and long" will likely be fastest or
    nearly fastest. (To clarify - this is what my tests show on
    two different platforms, but don't take my word for it, try
    measuring yourself.)

    If this is one isolated test, but tests typically succeed
    or match on at least two initial characters, you might get
    a slight performance advantage from using one of the full
    word comparison approaches (compared to the "slow and long"
    method). Unlikely to be worthwhile, but again the only
    way to be sure is measure.

    If the test is not to find a single word, but doing lots
    of comparisons against, eg, a table of words, this can be
    done quickly, simply, and safely using unions, like this:

    union char4_unsigned32 {
    char s[4];
    uint32_t u;

    union char4_unsigned32 four_letter_words[] = {

    find_word( const char *p ){
    union char4_unsigned32 pu;
    uint32_t u;
    int i;
    int i_limit = sizeof four_letter_words / sizeof four_letter_words[0];

    memcpy( pu.s, p, 4 );
    u = pu.u;
    for( i = 0; i < i_limit; i++ ){
    if( u == four_letter_words.u ) break;
    return i;

    The code shown above is safe and portable (assuming a suitable
    typedef for uint32_t), and all comparisons are done using a
    simple 32-bit-word compare. The loop should run very fast,
    and the setup overhead is small.

    Given all the above, it's pretty unlikely that comparing against
    inline integer constants (either as hex/decimal numbers or
    character constants) will be the method of choice. But, if you
    do choose to go down that path, it's probably a good idea to
    generate the constants programmatically, eg, by generating a
    header file, and refer to the constants symbolically in your
    program. Note that using the string literal approach, eg,

    *(int*)p == *(int*)"Test"

    produced (in my trials) performance results very close to
    using straight integer constants, so using string literals
    might be preferable so the code is easier to write and

    In my performance tests, memcmp() and strncmp() were both slower
    than all other methods tested, by factors ranging from two to
    five, depending on platform and number of leading characters
    matching the target word ("Test"). Again, don't take my word
    on any of these results - take measurements yourself to be sure.

    General performance advice: starting out, write the simplest
    and most obvious code you can think of, and worry about these
    kinds of micro-issues only after the program is running. Then
    if you still think performance is a problem, go back and take
    measurements as part of evaluating different approaches. I
    suspect you'll be surprised by the results (I was, even for
    the simple set of tests that I did).

    Remember when doing measurements to take into account the
    relative frequencies of which paths are taken, and also
    the expected mix of different platforms on which the
    program will run. Different plaforms often will favor
    different approaches, and different approaches may be
    faster or slower depending on whether the test is likely
    to succeed or likely to fail, especially if it fails

    Good luck!
    Tim Rentsch, Sep 2, 2013
  14. qak

    Tim Rentsch Guest

    Actual measurements show otherwise.

    Tests run on two different platforms show memcmp() and strncmp()
    both running slower by factors of at least 2.5 compared to
    copying into an aligned buffer and then doing a single 32-bit
    Tim Rentsch, Sep 2, 2013
  15. qak

    BartC Guest

    That's why I said to think of it as int operations rather than string ones.
    So the implementer has to be aware of the multi-char format and make sure
    both sides of == use the same.
    Well, OK, the << and + will be in the macro. You just have to write or edit
    code such as MACRO('A','B','C','D') many times.
    Writing just "X"[0] in places where a constant is expected gives an error on
    gcc. If gcc doesn't support something then I would call *that* unportable!
    Without this, making such a macro that yields a compile-time constant is not
    going to work.
    BartC, Sep 2, 2013
  16. qak

    James Kuyper Guest

    I suppose if you want to define it to be used that way, it would be
    rather inconvenient, which is one reason why I wouldn't do so.
    As I said in text that you've clipped, it isn't an integer constant
    expression. Therefore, using it in a context where one is required would
    be a constraint violation. But if(a == INTSTR("Test")) is NOT such a

    Places where constant expressions are required include the size of a
    bit-field, the value of an enumeration constant, the argument of an
    _Alignas() specifier, in the subscript of an array element designator,
    the initializer for an object of static or thread local storage
    duration, the first argument of a _static_assert(), in a case label, and
    in a #if or #elif directive. The only of of those that sounds like a
    plausible context for this macro are the initializer and the case label.
    James Kuyper, Sep 2, 2013
  17. qak

    qak Guest

    Even for once, we should use the more efficient method.
    Knowledge can be re-apply else where.
    I can stand code like this:
    if(!strcmp(p, "This"))...
    else if(!strcmp(p, "That"))...
    else if(!strcmp(p, "Other"))...
    Note that 'Other' doesn't fit into integer but if I do first 4, eliminate
    majority of cases, then one more test to see if it is zero terminated, or
    some short string... 8 bytes or even 12 bytes we still win (2.5 times as
    your timing test in other message), not to mention with the advance of 64

    This is brilliant, master. Thank you very much, never occur to me we could
    do that.
    qak, Sep 3, 2013
  18. qak

    BartC Guest

    Watch out for:

    * Alignment (if it might be a problem machines you're likely to run this on)

    * p pointing to strings shorter than 3 characters, because there will be
    garbage following the nul terminator (but I think an ==/!= compare will
    still work provided the right-hand-side is at least 3 characters so doesn't
    have garbage of its own).

    * (There is also the unlikely situation where p points to a short string at
    the end of a memory block, and accessing a full 4 bytes causes a memory

    * p containing strings longer than 4 characters, because "Testa" will
    wrongly match "Test"

    * Also that the right-hand-side is likely to be a memory reference rather
    than a constant such as 'ABCD', which might adversely affect the timing, if
    it is a critical as you make out. (But if the processor is an x86, it could
    well be faster, because what is fast or slow is unintuitive.)
    BartC, Sep 3, 2013
  19. qak

    Tim Rentsch Guest

    Let's try this again.

    There is no such thing as "the" most efficient method. Depending
    on circumstances different choices, including the "slow and long"
    method, will exhibit better performance than other approaches,
    and may end up being a better choice overall.

    There is no substitute for measuring actual performance.

    Putting in effort for negligible performance gains is a
    waste of time and a mark of a poor programmer.

    In almost all cases coding choices should be limited to code
    that is safe and widely portable, especially when the safe
    and widely portable methods have nearly identical performance
    characteristics as the "faster" versions do.

    If you still feel a need to spend time on such micro-level
    performance questions, read through comments in the thread
    again until the lessons sink in.

    It wasn't my idea, it came from Ben Bacarisse in a direct
    response to your original posting. You haven't been paying
    attention. Please add "pay attention when you read" to
    the list of development practices you should be following.
    Tim Rentsch, Sep 3, 2013
  20. (snip)
    Depending how many there are to test, it is common to use a hash
    table to look up the string values.

    Then again, converting the strings to 32 bit integers is, more or
    less, hashing.

    -- glen
    glen herrmannsfeldt, Sep 3, 2013
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.