Array#inject to create a hash versus Hash[*array.collect{}.flatten] -- Speed, segfault

Discussion in 'Ruby' started by Anthony Martinez, Jun 11, 2007.

  1. --TRYliJ5NKNqkz5bu
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline
    Content-Transfer-Encoding: quoted-printable

    So, I was tinkering with ways to build a hash out of transforming an
    array, knowing the standard/idiomatic

    id_list =3D [:test1, :test2, :test3]
    id_list.inject({}) { |a,e| a[e]=3De.object_id ; a }

    I also decided to try something like this:

    Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten]

    and further (attempt to) optimize it via
    Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!]
    and=20
    Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!]

    Running this via Benchmark#bmbm gives pretty interesting, and to me,
    unexpected, results (on a 3.2 GHz P4, 1GB of RAM, FC5 with ruby 1.8.4)

    require 'benchmark'
    id_list =3D (1..1_000_000).to_a
    Benchmark::bmbm do |x|
    x.report("inject") { id_list.inject({}) { |a,e| a[e] =3D e.objec=
    t_id ; a} }
    x.report("non-bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.f=
    latten] }
    x.report("bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.f=
    latten!] }
    x.report("two-bang") { Hash[ *id_list.collect! { |e| [e,e.object_id]}.f=
    latten!] }
    end

    Rehearsal --------------------------------------------
    inject 16.083333 0.033333 16.116667 ( 9.670747)
    non-bang 1657.050000 1.800000 1658.850000 (995.425642)
    bang 1593.716667 0.016667 1593.733333 (956.334565)
    two-bang 1604.816667 1.350000 1606.166667 (963.803356)
    -------------------------------- total: 4874.866667sec

    user system total real
    inject 5.183333 0.000000 5.183333 ( 3.102379)
    non-bang zsh: segmentation fault ruby

    Ow?

    Also, I just thought of a similar way to accomplish the same thing:

    x.report("zip") { Hash[ *id_list.zip(id_list.collect {|e| e.object_id}=
    )] }

    Array#collect! won't work right with this, of course, but it seems to
    have equally-bad performance. Is Array#inject just optimized for this,
    or something?

    -- Pi


    --=20
    panic ("No CPUs found. System halted.\n");
    2.4.3 linux/arch/parisc/kernel/setup.c

    --TRYliJ5NKNqkz5bu
    Content-Type: application/pgp-signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFGbOBsKaiGM/xGKzQRAvjjAJ9FYZabPf88j0GdKq/Y4+Vu1/UfYwCgwPBL
    xng4PjGyzyz00YSTL3jhm6o=
    =qOdB
    -----END PGP SIGNATURE-----

    --TRYliJ5NKNqkz5bu--
     
    Anthony Martinez, Jun 11, 2007
    #1
    1. Advertising

  2. Re: Array#inject to create a hash versus Hash[*array.collect{}.flatten]-- Speed, segfault

    On 11.06.2007 07:41, Anthony Martinez wrote:
    > So, I was tinkering with ways to build a hash out of transforming an
    > array, knowing the standard/idiomatic
    >
    > id_list = [:test1, :test2, :test3]
    > id_list.inject({}) { |a,e| a[e]=e.object_id ; a }
    >
    > I also decided to try something like this:
    >
    > Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten]
    >
    > and further (attempt to) optimize it via
    > Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!]
    > and
    > Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!]
    >
    > Running this via Benchmark#bmbm gives pretty interesting, and to me,
    > unexpected, results (on a 3.2 GHz P4, 1GB of RAM, FC5 with ruby 1.8.4)
    >
    > require 'benchmark'
    > id_list = (1..1_000_000).to_a
    > Benchmark::bmbm do |x|
    > x.report("inject") { id_list.inject({}) { |a,e| a[e] = e.object_id ; a} }
    > x.report("non-bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten] }
    > x.report("bang") { Hash[ *id_list.collect { |e| [e,e.object_id]}.flatten!] }
    > x.report("two-bang") { Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!] }
    > end
    >
    > Rehearsal --------------------------------------------
    > inject 16.083333 0.033333 16.116667 ( 9.670747)
    > non-bang 1657.050000 1.800000 1658.850000 (995.425642)
    > bang 1593.716667 0.016667 1593.733333 (956.334565)
    > two-bang 1604.816667 1.350000 1606.166667 (963.803356)
    > -------------------------------- total: 4874.866667sec
    >
    > user system total real
    > inject 5.183333 0.000000 5.183333 ( 3.102379)
    > non-bang zsh: segmentation fault ruby
    >
    > Ow?
    >
    > Also, I just thought of a similar way to accomplish the same thing:
    >
    > x.report("zip") { Hash[ *id_list.zip(id_list.collect {|e| e.object_id})] }
    >
    > Array#collect! won't work right with this, of course, but it seems to
    > have equally-bad performance. Is Array#inject just optimized for this,
    > or something?


    The reason why you are seeing this (performance as well as timing) is
    most likely caused by the different approach. When you use #inject you
    just create one copy of the Array (the Hash). When you use #collect you
    create at least one additional copy of the large array plus a ton of two
    element arrays. That's way less efficient considering memory usage and
    GC. You'll probably see much different results if your input array is
    much shorter (try with 10 or 100 elements).

    Kind regards

    robert
     
    Robert Klemme, Jun 11, 2007
    #2
    1. Advertising

  3. That segfault shouldn't be happening... I guess you've found a bug. I
    modified your benchmark slightly (reduced id_list to 100_000 elements,
    since I don't want to wait more than an hour for it to finish) and I
    didn't get a segfault, but the results are definitely strange:

    Rehearsal --------------------------------------------
    inject 0.320000 0.350000 0.670000 ( 0.682610)
    non-bang 7.700000 0.590000 8.290000 ( 8.522170)
    bang 7.150000 0.230000 7.380000 ( 7.392471)
    two-bang 6.500000 0.220000 6.720000 ( 6.766079)
    ---------------------------------- total: 23.060000sec

    user system total real
    inject 0.700000 0.310000 1.010000 ( 1.019340)
    non-bang 131.260000 1.440000 132.700000 (132.823762)
    bang 134.610000 1.280000 135.890000 (136.323224)
    two-bang 137.260000 0.750000 138.010000 (138.027025)

    Why is the rehearsal so much faster? Especially for the various
    collect implementations?
    (I'm running Debian Etch, Ruby 1.8.5.)



    I expect that the fastest way to do what you want is like this:

    id_list = [:test1, :test2, :test3]
    result={}
    id_list.each { |e| result[e]=e.object_id }

    Anyway, that's what I usually do.

    I have several times wanted a method on Enumerable that helps you
    transform it into a hash. Something like:

    module Enumerable
    def map_to_hash
    result={}
    each { |e|
    k,v=*yield(e)
    result[k]=v
    }
    return result
    end
    end

    id_list.map_to_hash {|x| [x,x.object_id]}

    I don't know what the performance of this might be.... It's creating a
    2-element hash on each loop now, but maybe that has minimal cost.
     
    Caleb Clausen, Jun 11, 2007
    #3
  4. --rz+pwK2yUstbofK6
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline
    Content-Transfer-Encoding: quoted-printable

    On Mon, Jun 11, 2007 at 04:05:10PM +0900, Robert Klemme wrote:
    > On 11.06.2007 07:41, Anthony Martinez wrote:
    > >So, I was tinkering with ways to build a hash out of transforming an

    [snip]
    > >Also, I just thought of a similar way to accomplish the same thing:
    > >x.report("zip") { Hash[ *id_list.zip(id_list.collect {|e| e.object_=

    id})] }
    > >Array#collect! won't work right with this, of course, but it seems to
    > >have equally-bad performance. Is Array#inject just optimized for this,
    > >or something?

    >=20
    > The reason why you are seeing this (performance as well as timing) is
    > most likely caused by the different approach. When you use #inject
    > you just create one copy of the Array (the Hash). When you use
    > #collect you create at least one additional copy of the large array
    > plus a ton of two element arrays. That's way less efficient


    Ah. This does make sense. Now, I throw in the #zip variant!

    > considering memory usage and GC. You'll probably see much different
    > results if your input array is much shorter (try with 10 or 100
    > elements).


    Well, the general trend is that any of the non-idiomatic solutions are
    pretty horrible, although the two-bang gets vaguely better than 0 or 1 on
    bigger arrays:

    1000 elements:
    Rehearsal --------------------------------------------
    inject 0.010000 0.000000 0.010000 ( 0.007391)
    non-bang 0.000000 0.000000 0.000000 ( 0.006602)
    bang 0.000000 0.000000 0.000000 ( 0.008503)
    two-bang 0.010000 0.000000 0.010000 ( 0.010136)
    zip 0.790000 0.220000 1.010000 ( 1.202124)
    ----------------------------------- total: 1.030000sec

    user system total real
    inject 0.010000 0.010000 0.020000 ( 0.012621)
    non-bang 0.020000 0.000000 0.020000 ( 0.022407)
    bang 0.020000 0.000000 0.020000 ( 0.019274)
    two-bang 0.020000 0.000000 0.020000 ( 0.027521)
    zip 2.240000 0.670000 2.910000 ( 5.804656)


    10000 elements:
    Rehearsal --------------------------------------------
    inject 0.050000 0.010000 0.060000 ( 0.079325)
    non-bang 0.140000 0.010000 0.150000 ( 0.265995)
    bang 0.150000 0.010000 0.160000 ( 0.184713)
    two-bang 0.140000 0.020000 0.160000 ( 0.202579)
    zip 77.510000 24.270000 101.780000 (134.493519)
    --------------------------------- total: 102.310000sec

    user system total real
    inject 0.110000 0.030000 0.140000 ( 0.146650)
    non-bang 0.450000 0.020000 0.470000 ( 0.526816)
    bang 0.470000 0.020000 0.490000 ( 0.583837)
    two-bang 0.440000 0.040000 0.480000 ( 0.520914)
    zip 222.910000 69.370000 292.280000 (389.541023)


    Extra-ouch! This is what I get for trying to be clever.

    >=20
    > Kind regards
    >=20
    > robert
    >=20


    --=20
    Too many errors on one line (make fewer)
    -- Apple MPW C compiler

    --rz+pwK2yUstbofK6
    Content-Type: application/pgp-signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFGbPheKaiGM/xGKzQRAqDTAJ0S/iGw/h/uJFZfelvmn9fXm44k8QCfZzcM
    j/X79OEPP0+1qoqj4Uz9cw8=
    =lUR4
    -----END PGP SIGNATURE-----

    --rz+pwK2yUstbofK6--
     
    Anthony Martinez, Jun 11, 2007
    #4
  5. Re: Array#inject to create a hash versus Hash[*array.collect{}.flatten]-- Speed, segfault

    On 11.06.2007 09:23, Anthony Martinez wrote:
    > On Mon, Jun 11, 2007 at 04:05:10PM +0900, Robert Klemme wrote:
    >> On 11.06.2007 07:41, Anthony Martinez wrote:
    >>> So, I was tinkering with ways to build a hash out of transforming an

    > [snip]
    >>> Also, I just thought of a similar way to accomplish the same thing:
    >>> x.report("zip") { Hash[ *id_list.zip(id_list.collect {|e| e.object_id})] }
    >>> Array#collect! won't work right with this, of course, but it seems to
    >>> have equally-bad performance. Is Array#inject just optimized for this,
    >>> or something?

    >> The reason why you are seeing this (performance as well as timing) is
    >> most likely caused by the different approach. When you use #inject
    >> you just create one copy of the Array (the Hash). When you use
    >> #collect you create at least one additional copy of the large array
    >> plus a ton of two element arrays. That's way less efficient

    >
    > Ah. This does make sense. Now, I throw in the #zip variant!
    >
    >> considering memory usage and GC. You'll probably see much different
    >> results if your input array is much shorter (try with 10 or 100
    >> elements).

    >
    > Well, the general trend is that any of the non-idiomatic solutions are
    > pretty horrible, although the two-bang gets vaguely better than 0 or 1 on
    > bigger arrays:
    >
    > 1000 elements:
    > Rehearsal --------------------------------------------
    > inject 0.010000 0.000000 0.010000 ( 0.007391)
    > non-bang 0.000000 0.000000 0.000000 ( 0.006602)
    > bang 0.000000 0.000000 0.000000 ( 0.008503)
    > two-bang 0.010000 0.000000 0.010000 ( 0.010136)
    > zip 0.790000 0.220000 1.010000 ( 1.202124)
    > ----------------------------------- total: 1.030000sec
    >
    > user system total real
    > inject 0.010000 0.010000 0.020000 ( 0.012621)
    > non-bang 0.020000 0.000000 0.020000 ( 0.022407)
    > bang 0.020000 0.000000 0.020000 ( 0.019274)
    > two-bang 0.020000 0.000000 0.020000 ( 0.027521)
    > zip 2.240000 0.670000 2.910000 ( 5.804656)
    >
    >
    > 10000 elements:
    > Rehearsal --------------------------------------------
    > inject 0.050000 0.010000 0.060000 ( 0.079325)
    > non-bang 0.140000 0.010000 0.150000 ( 0.265995)
    > bang 0.150000 0.010000 0.160000 ( 0.184713)
    > two-bang 0.140000 0.020000 0.160000 ( 0.202579)
    > zip 77.510000 24.270000 101.780000 (134.493519)
    > --------------------------------- total: 102.310000sec
    >
    > user system total real
    > inject 0.110000 0.030000 0.140000 ( 0.146650)
    > non-bang 0.450000 0.020000 0.470000 ( 0.526816)
    > bang 0.470000 0.020000 0.490000 ( 0.583837)
    > two-bang 0.440000 0.040000 0.480000 ( 0.520914)
    > zip 222.910000 69.370000 292.280000 (389.541023)
    >
    >
    > Extra-ouch! This is what I get for trying to be clever.


    :)

    I'd say figures for the 1000 elements array are not too meaningful. You
    should probably wrap those tests in a repetition. And if you are really
    trying to get the fastest solution I'd also try

    h = {}
    id_list.each {|e| h[e] = e.object_id }

    This is usually the fastest in my experience (unless of course the
    timing is dominated by what you do in the block).

    Kind regards

    robert
     
    Robert Klemme, Jun 11, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Francis Avila
    Replies:
    0
    Views:
    478
    Francis Avila
    Nov 2, 2003
  2. Replies:
    2
    Views:
    399
    Steve Pope
    Sep 27, 2006
  3. Replies:
    9
    Views:
    588
    SM Ryan
    Oct 20, 2007
  4. Peña, Botp

    inject does not inject last value

    Peña, Botp, Aug 7, 2006, in forum: Ruby
    Replies:
    4
    Views:
    186
    Peña, Botp
    Aug 7, 2006
  5. Michal Suchanek
    Replies:
    6
    Views:
    241
    Nobuyoshi Nakada
    Jun 13, 2007
Loading...

Share This Page