sorting 5million minute data: choose std::list or std::vector

Discussion in 'C++' started by MM, Feb 2, 2011.

  1. MM

    MM Guest

    I have about 5million rows like these (day, minute, and 4 doubles)

    YYYYMMDD, HHMM, double, double, double, double

    which I parse from a file and store into

    struct one_line {
    boost::uint32_t day;
    boost::uint32_t minute;
    double d1, d2, d3, d4;
    };

    I may need to sort the container once the data is parsed, sort based off the
    day and minute.

    Would sorting a std::list<one_line> be faster than std::vector<one_line>?
    Is the best to use std::map<Key, Value> where Key is
    std::pair<boost::uint32_t ,boost::uint32_t> (day and minute) ?

    regards,
     
    MM, Feb 2, 2011
    #1
    1. Advertising

  2. MM

    MM Guest

    "Leigh Johnston" <> wrote in message
    news:...
    > On 02/02/2011 15:09, Leigh Johnston wrote:
    > On VC++ I find sorting a std::vector of your struct to be twice as fast as
    > sorting a std::list of your struct.
    >
    > /Leigh
    >

    strange, I guess it depends on how unordered the list was.
    both sorts are Nlog(N) , aren't they?
    I'd think moving pointers is faster than value moving 4 doubles and 2 ints?

    thanks,
     
    MM, Feb 2, 2011
    #2
    1. Advertising

  3. MM <> wrote:
    > Would sorting a std::list<one_line> be faster than std::vector<one_line>?


    Is it really that hard to test it?
     
    Juha Nieminen, Feb 2, 2011
    #3
  4. MM <> wrote:
    > strange, I guess it depends on how unordered the list was.
    > both sorts are Nlog(N) , aren't they?
    > I'd think moving pointers is faster than value moving 4 doubles and 2 ints?


    A doubly-linked list takes significantly more memory than a vector
    (especially for such a relatively small element type), and the elements
    of the list will usually end up being significantly more dispersed in
    RAM while the elements in a vector will be at contiguous memory locations.
    This means that sorting (or even just traversing) the list will usually
    cause significantly more cache misses than a vector.
     
    Juha Nieminen, Feb 2, 2011
    #4
  5. MM

    Bo Persson Guest

    MM wrote:
    > "Leigh Johnston" <> wrote in message
    > news:...
    >> On 02/02/2011 15:09, Leigh Johnston wrote:
    >> On VC++ I find sorting a std::vector of your struct to be twice as
    >> fast as sorting a std::list of your struct.
    >>
    >> /Leigh
    >>

    > strange, I guess it depends on how unordered the list was.
    > both sorts are Nlog(N) , aren't they?
    > I'd think moving pointers is faster than value moving 4 doubles and
    > 2 ints?
    > thanks,


    Note that you have to update 6 pointers to move a node in a doubly
    linked list. :)

    Often std::vector has an edge by being more cache friendly. The
    elements are adjacent. And smaller than list nodes.



    Bo Persson
     
    Bo Persson, Feb 2, 2011
    #5
  6. MM

    Jorgen Grahn Guest

    On Wed, 2011-02-02, Juha Nieminen wrote:
    > MM <> wrote:
    >> Would sorting a std::list<one_line> be faster than std::vector<one_line>?

    >
    > Is it really that hard to test it?


    AOL. Test it and report your results here. We don't really know what
    will work best --or if the difference is big enough to matter.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Feb 3, 2011
    #6
  7. MM

    Jorgen Grahn Guest

    On Wed, 2011-02-02, MM wrote:
    > I have about 5million rows like these (day, minute, and 4 doubles)
    >
    > YYYYMMDD, HHMM, double, double, double, double
    >
    > which I parse from a file and store into
    >
    > struct one_line {
    > boost::uint32_t day;
    > boost::uint32_t minute;
    > double d1, d2, d3, d4;
    > };


    Nitpick: don't call HHMM 'minute', or you'll be surprised one day when
    you forget that 102 doesn't mean 102 minutes but 62 minutes.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Feb 3, 2011
    #7
  8. MM

    James Kanze Guest

    On Feb 4, 2:39 pm, (Yannick Tremblay) wrote:
    > In article <>,
    > Jorgen Grahn <> wrote:


    > >On Wed, 2011-02-02, MM wrote:
    > >> I have about 5million rows like these (day, minute, and 4 doubles)


    > >> YYYYMMDD, HHMM, double, double, double, double

    >
    > >> which I parse from a file and store into


    > >> struct one_line {
    > >> boost::uint32_t day;
    > >> boost::uint32_t minute;
    > >> double d1, d2, d3, d4;
    > >> };


    > >Nitpick: don't call HHMM 'minute', or you'll be surprised one day when
    > >you forget that 102 doesn't mean 102 minutes but 62 minutes.


    > Additional nitpick: don't use bit specified integer unless you *need*
    > to know the size of the integers in bits. Normal "unsigned integer"
    > would have been perfectly fine here.


    Normal int would have been perfectly fine here. On the other
    hand, if he's got 5 million of them, short or even signed char
    might be more appropriate (although in practice, alignment
    requirements will probably mean that he won't gain anything by
    using a type smaller than int).

    --
    James Kanze
     
    James Kanze, Feb 4, 2011
    #8
  9. Leigh Johnston <> wrote:
    > Eschewing unsigned integral types is sometimes irrational. *If* it
    > makes no sense to have negative days or minutes then why use a signed
    > integral type?


    Because if you are calculating the difference of two dates, you will
    have to deal with the wrap-around problem if you are using unsigned
    types, usually by explicitly casting the integrals being compared to
    signed (which kind of defeats the whole purpose of them being unsigned
    in the first place).

    There might not be negative days in dates, but there may well be when
    calculating differences between dates.
     
    Juha Nieminen, Feb 5, 2011
    #9
  10. MM

    gwowen Guest

    On Feb 4, 5:19 pm, Leigh Johnston <> wrote:

    > Eschewing unsigned integral types is sometimes irrational.  *If* it
    > makes no sense to have negative days or minutes then why use a signed
    > integral type?


    There are sometimes good reasons. Unsigned overflow has defined
    semantics; signed does not. If the bare silicon of the target
    platform's registers does not follow the defined unsigned semantics,
    using unsigned can result in a lot of unnecessary code generation
    (checking for overflow and implementing correct behaviour)

    I've written code for compilers/platforms for which

    int find(char foo,const char arr[],size_t sz){
    for(int q=0;q<sz;++sz){
    if(arr[q] == foo) return q;
    }
    return sz;
    }

    produces *considerably* smaller code than

    int find(char foo,const char arr[],size_t sz){
    for(unsigned q=0;q<sz;++sz){
    if(arr[q] == foo) return q;
    }
    return sz;
    }

    Of course, that latter has a larger domain of defined behaviour, but
    in reality you may well know that you're never going to hit either
    limit.
     
    gwowen, Feb 7, 2011
    #10
  11. MM

    gwowen Guest

    On Feb 7, 1:06 pm, Leigh Johnston <> wrote:

    >> I've written code for compilers/platforms for which [snip]

    > [Your correct code fix snipped]
    > In which case on VC++ I get:


    Then I think we can safely say that VC++ was not the platform to which
    I was referring can't we? Sheesh, some people are so keen to be
    thought of as smarter than everyone else in the whole world, they end
    up looking like idiots.

    Stick to arguing with Paul, at least you *are* probably smarter than
    him.
     
    gwowen, Feb 7, 2011
    #11
  12. MM

    gwowen Guest

    On Feb 7, 1:42 pm, Leigh Johnston <> wrote:

    > Here the only difference is intel opcode jl versus intel opcode jb which
    > IICR have the same performance characteristics.


    I'm not talking about an intel processor, so the effacacy of intel
    opcodes is utterly irrelevant.
     
    gwowen, Feb 7, 2011
    #12
  13. MM

    gwowen Guest

    On Feb 7, 1:58 pm, Leigh Johnston <> wrote:
    > It was simply a balanced, pertinent counter-example to your claim of
    > multiple platforms


    If you think "one platform agrees with me" provides a counter
    example .

    > Like it or not the Intel platform is quite pervasive (not some esoteric platform that nobody cares much about).


    I know that. In fact, that was entirely my point. I was talking
    about an esoteric platform - specifically a custom DSP Asic who's
    default signed integer arithmetic saturated (because on embedded DSPs,
    that is almost always what you want signed arithmetic to do - clip
    your signal rather than wrap it around).

    So every time you do an unsigned addition on that processor, the C
    standard insists that you check for overflow and correct the results
    if overflow is about to occur. If you use signed arithmetic, the
    compiler just loads the registers and calls the appropriate
    instruction. (In fact, if memory recalls, unsigned arithmetic was done
    using a subroutine, signed arithmetic was a single opcode, although I
    may have misremembered that).

    As to specific cases where the compiler can use the "overflow is
    undefined" to optimize, with assembler generated, your attention is
    drawn to http://www.airs.com/blog/archives/120
     
    gwowen, Feb 7, 2011
    #13
  14. MM

    gwowen Guest

    On Feb 7, 2:42 pm, Leigh Johnston <> wrote:

    > The implementation specific behaviour of some esoteric platform (or any
    > specific platform for that matter) should not necessarily direct us as
    > to what constitutes good C++ coding practice.
    >


    Given that all I originally said was "There are sometimes good
    reasons" -- the thing which you claimed to have disproved with a
    counterexample [not sure how you disprove a "sometimes" with a
    counterexample, but hey ho] -- its good to see you've come around.

    Unsigned integral types - I use them a lot, they're handy, especially
    when I'm writing code for Intel and other general purpose processors
    [though I bet there are some GPUs that work like DSPs for similar
    reasons]. Eschewing (good word, by the way) them without a good reason
    is pretty silly, but as I originally -- and correctly, as you now
    agree -- said *THERE ARE SOMETIMES GOOD REASONS*.
     
    gwowen, Feb 7, 2011
    #14
  15. MM

    gwowen Guest

    On Feb 7, 3:47 pm, Leigh Johnston <> wrote:

    > My original reply stated more or less what you claim I *now* agree with:


    So what was your "counterexample" meant to demonstrate?
     
    gwowen, Feb 7, 2011
    #15
  16. Leigh Johnston <> wrote:
    > Which is why is said "*If*" in my reply. On the implementation I use
    > static_cast<int> has no overhead and unlike some people I don't mind the
    > verbosity of the C++ style casts.


    It's not about verbosity. It's about clarity and the maning of your
    code. If you say that a variable is unsigned and then later you cast it
    to signed, you *lied* in your variable declaration. That can only cause
    confusion.
     
    Juha Nieminen, Feb 7, 2011
    #16
  17. Leigh Johnston <> wrote:
    > On 07/02/2011 17:01, Juha Nieminen wrote:
    >> Leigh Johnston<> wrote:
    >>> Which is why is said "*If*" in my reply. On the implementation I use
    >>> static_cast<int> has no overhead and unlike some people I don't mind the
    >>> verbosity of the C++ style casts.

    >>
    >> It's not about verbosity. It's about clarity and the maning of your
    >> code. If you say that a variable is unsigned and then later you cast it
    >> to signed, you *lied* in your variable declaration. That can only cause
    >> confusion.

    >
    > Wrong. The cast is sometimes required when using the unsigned variable
    > in an expression involving signed variables (as clearly indicated in the
    > example code I posted); a cast is not required in other contexts. It
    > seems to me that the only confusion is your understanding of the issue.


    But that's exactly why you need to make the original variable signed:
    Because it can be used in signed expressions. It makes no sense to make
    it unsigned when it will be used as a signed value anyways.

    Whenever you need to cast an unsigned variable to a signed one in an
    expression, that's a sign of bad design. You are using the variable in
    a way that contradicts its original declaration.
     
    Juha Nieminen, Feb 7, 2011
    #17
  18. MM

    Bo Persson Guest

    Leigh Johnston wrote:
    > On 07/02/2011 16:50, gwowen wrote:
    >> On Feb 7, 3:47 pm, Leigh Johnston<> wrote:
    >>
    >>> My original reply stated more or less what you claim I *now*
    >>> agree with:

    >>
    >> So what was your "counterexample" meant to demonstrate?

    >
    > It was simply to add some balance to your initial reply in which you
    > mention compilers/platforms (plural) producing *considerably*
    > smaller code when not using unsigned which I initially viewed as
    > argument.


    And now we know one platform (singular) where the code is not
    considerably smaller?


    Bo Persson
     
    Bo Persson, Feb 7, 2011
    #18
  19. MM

    Kaba Guest

    Leigh Johnston wrote:
    > Wrong; it is not a sign of bad design at all. If a variable is used as
    > an unsigned value more often than as a signed value and it logically
    > makes sense for it to be unsigned (e.g. it is never negative) then
    > making the variable unsigned makes perfect sense.


    --8x--

    > You might be in the "use int everywhere" camp but I am not. Perhaps
    > Java is a more suitable language for you rather than C++.


    For what it's worth, here's my take on this. This is a piece of text I
    have written earlier (explaining the form). I am in the 'use int
    everywhere' camp:) Ideas can't and shouldn't be forced, but they can be
    shared; the following text reflects how I reason my position, not that
    the other camps are wrong.

    In this article I present two problems in using unsigned
    integers in a program. These are the _zero boundary problem_,
    and the _extended positive range problem_.

    Zero boundary problem
    ---------------------

    Most of the time we are assuming that, away from boundaries caused
    by the finite representation, an integer object works like an element
    of the integer ring ZZ. In addition, it seems plausible that most
    integer calculations are concentrated around a neighborhood of zero.

    The _zero-boundary problem_ (of unsigned integers) is that the zero
    is on the boundary beyond which the assumption of working with integers
    falls apart. Thus, the probability of introducing errors in computations
    increases greatly. Furthermore, those errors can not be catched, since
    every value is legal.

    ### Looping with unsigned integers

    Looping over values from n to zero backwards demonstrates the zero
    boundary problem with unsigned integers:

    for (unsigned int i = n;i >= 0;--i)
    {
    // Do something.
    }

    Since there are no negative values to fail the loop test,
    this loop never finishes. In contrast, with signed integers the problem
    does not exist, since the boundaries are located far away from zero:

    for (int i = n;i >= 0;--i)
    {
    // Do something.
    }

    Extended positive range problem
    -------------------------------

    Conversion between an unsigned integer and a signed integer
    is an information destroying process in either direction.
    The only way to avoid this is to use one or the other
    consistently.

    If the normal arithmetic is the intention, then a signed integer
    represents a more general concept than an unsigned integer: the former
    covers both the negative and positive integers, whereas the latter
    only covers non-negative integers. In programming terms, it is
    possible to create a program using signed integers alone, however,
    the same can't be said about the unsigned integers. Therefore, if
    only one type is to be used consistently, the choice should be a
    signed integer. However, let us do some more analysis.

    Since any non-trivial program must use signed integers, the use of
    unsigned integers eventually leads to an unsigned-signed
    conversion. In particular, because `std::size_t` in the Standard
    Library is an unsigned integer, there are few programs that can
    escape the unsigned-signed conversions.

    Despite these conversions, programs still seem to work. The reason
    for this, I reflect, is that the unsigned integers are normally
    not taken into the extended positive range they allow.
    The _extended positive range problem_ is that if the unsigned integers
    are taken to their extended positive range, then the signed-unsigned
    conversions become a reality. Ensuring correctness under such a threat
    is hard, if not practically impossible.

    Conclusion
    ----------

    Using unsigned integers to model integers decreases the probability
    that a program is correct.

    --
    http://kaba.hilvi.org
     
    Kaba, Feb 7, 2011
    #19
  20. MM

    Kaba Guest

    Leigh Johnston wrote:
    > On 07/02/2011 21:58, Kaba wrote:
    > > The _zero-boundary problem_ (of unsigned integers) is that the zero
    > > is on the boundary beyond which the assumption of working with integers
    > > falls apart. Thus, the probability of introducing errors in computations
    > > increases greatly. Furthermore, those errors can not be catched, since
    > > every value is legal.

    >
    > It is debatable if the probability of introducing errors increases
    > greatly when using unsigned integral types; this is certainly an
    > assertion on your part, do you have any evidence to back it up?


    Consider any function taking in an unsigned integer (e.g.
    std::vector::resize()). It is possible that the computation of a new
    size, as a bug, ends up being a negative number, which is then wrapped
    to a large positive number. This bug can't be catched, since every value
    is legal.

    Something similar can happen with signed integers: a large negative
    number wraps to large positive number. But this is much less probable:
    the action is concentrated around zero.

    Correctness is the most important property of a program. In the
    described case using a signed integer _brings visible_ an often
    occurring bug.

    The problem of unchecked bugs gets worse with increasing levels of
    abstraction. In the worst case you go through a deep layer of functions,
    with an unexplained crash, although the actual problem was that an
    argument of the top-most function did not fulfill its precondition (e.g.
    size < 0). It's quite comparable to the thousand-lines template errors
    when using Boost.Spirit.

    > > ### Looping with unsigned integers
    > >
    > > Looping over values from n to zero backwards demonstrates the zero
    > > boundary problem with unsigned integers:
    > >
    > > for (unsigned int i = n;i>= 0;--i)
    > > {
    > > // Do something.
    > > }

    >
    > for (unsigned int i = n + 1;i-- > 0;)
    > {
    > // Do something.
    > }


    Nice:) However, the point is that to carry out a task you have to be
    careful to avoid a bug. That's something that should not be left to
    humans (because by Murphy's law...).

    > > Despite these conversions, programs still seem to work. The reason
    > > for this, I reflect, is that the unsigned integers are normally
    > > not taken into the extended positive range they allow.
    > > The _extended positive range problem_ is that if the unsigned integers
    > > are taken to their extended positive range, then the signed-unsigned
    > > conversions become a reality. Ensuring correctness under such a threat
    > > is hard, if not practically impossible.

    >
    > You ensure correctness as you would ensure correctness using any other
    > language feature.


    I don't think it can be afforded, because no one can put the effort to
    check each unsigned-signed conversion for trouble. The practical way,
    which I also use, is to close the eyes and avoid exercising a program to
    its extremes (e.g. 2^32 - 1 objects).

    Interestingly, the 64-bit integers help with the extended range problem
    _in practice_: the extreme values are then often avoided naturally (e.g.
    no one creates an std::vector with 2^64 elements). But the zero-boundary
    problem remains.

    > > Conclusion
    > > ----------
    > >
    > > Using unsigned integers to model integers decreases the probability
    > > that a program is correct.
    > >

    >
    > Bugs can be created using any language feature therefore it is, IMO,
    > wrong to eschew a particular language feature based on the possibility
    > of bug creation unless there is clear evidence that the language feature
    > in question has a high probability of bug creation (for some definition
    > of "high").


    The possibility of a bug creation is not a problem. The problem is that
    using unsigned integers prevents me from catching bugs, and that way
    decreases my confidence on program correctness.

    > What you have presented is little more than conjecture IMO.


    Critical thinking appreciated.

    --
    http://kaba.hilvi.org
     
    Kaba, Feb 7, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Denebola
    Replies:
    3
    Views:
    1,003
    Denebola
    Feb 27, 2006
  2. Anonymous
    Replies:
    20
    Views:
    4,434
    Pete Becker
    Mar 30, 2005
  3. Jason Heyes
    Replies:
    8
    Views:
    768
    Andrew Koenig
    Jan 15, 2006
  4. Replies:
    8
    Views:
    2,001
    Csaba
    Feb 18, 2006
  5. Replies:
    2
    Views:
    1,491
    James Kanze
    Jul 6, 2010
Loading...

Share This Page