For performance, write it in C - Part 3, Source code now available

Discussion in 'Ruby' started by Peter Hickman, Aug 1, 2006.

  1. Peter Hickman, Aug 1, 2006
    #1
    1. Advertising

  2. Peter Hickman

    Chad Perrin Guest

    Chad Perrin, Aug 1, 2006
    #2
    1. Advertising

  3. Peter Hickman

    Isaac Gouy Guest

    Peter Hickman wrote:
    > The source code is available from
    > http://peterhi.dyndns.org/write_it_in_c/index.html


    There are some details missing from the webpages
    1) which C implementation?
    2) which Java implementation?
    3) what hardware?


    For example, using the code from the webpage

    1) gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)

    gcc -pipe -Wall -O3 -fomit-frame-pointer -funroll-loops -march=pentium4
    latin.c -o latinc
    time ./latinc > /dev/null 2>&1

    user 0m0.820s
    sys 0m0.000s


    2) /sun-jdk-1.5.0.07/bin/javac Latin.java
    time java Latin > /dev/null 2>&1

    user 0m3.800s
    sys 0m0.644s


    3) 2GHz Intel P4
     
    Isaac Gouy, Aug 1, 2006
    #3
  4. Isaac Gouy wrote:
    > Peter Hickman wrote:
    >
    >> The source code is available from
    >> http://peterhi.dyndns.org/write_it_in_c/index.html
    >>

    >
    > There are some details missing from the webpages
    > 1) which C implementation?
    >

    [peterhickman]$ gcc -v
    Using built-in specs.
    Target: powerpc-apple-darwin8
    Configured with: /private/var/tmp/gcc/gcc-5341.obj~1/src/configure
    --disable-checking -enable-werror --prefix=/usr --mandir=/share/man
    --enable-languages=c,objc,c++,obj-c++
    --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/
    --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib
    --build=powerpc-apple-darwin8 --host=powerpc-apple-darwin8
    --target=powerpc-apple-darwin8
    Thread model: posix
    gcc version 4.0.1 (Apple Computer, Inc. build 5341)

    > 2) which Java implementation?
    >

    [peterhickman]$ javac -version
    javac 1.5.0_06

    Additionally

    [peterhickman]$ perl -V
    Summary of my perl5 (revision 5 version 8 subversion 6) configuration:

    [peterhickman]$ ruby -v
    ruby 1.8.4 (2005-12-24) [powerpc-darwin]

    > 3) what hardware?
    >
    >

    Macintosh G4 with 1Gb ram
     
    Peter Hickman, Aug 1, 2006
    #4
  5. Peter Hickman

    Isaac Gouy Guest

    Peter Hickman wrote:
    > Isaac Gouy wrote:
    > > Peter Hickman wrote:
    > >
    > >> The source code is available from
    > >> http://peterhi.dyndns.org/write_it_in_c/index.html



    I recall someone stating "benchmarking without analysis is bogus".

    As a first step, comment out the print statements

    time ./latinc

    user 0m0.492s
    sys 0m0.004


    time java Latin

    user 0m0.992s
    sys 0m0.052s


    With the print statements ~5.4x
    without print statements ~2.1x

    iirc the Java program is shuffling around double byte unicode chars and
    the C program is handling single byte chars.
     
    Isaac Gouy, Aug 1, 2006
    #5
  6. Peter Hickman

    csaba Guest

    Peter Hickman wrote:
    > The source code is available from
    > http://peterhi.dyndns.org/write_it_in_c/index.html


    Hmm, it would look much better for me if you had included Kristof
    Bastiaensen's Curry version
    into the party... Heck, if that code is not smart then nothing is, and
    in a way, it's a much more
    interesting question how a compiled functional language compares to
    compiled imepative ones
    than the thousand time discussed interpreted vs. compiled match-up.

    Yes, Curry is anything but mainstream, but you can't say a decent
    compiler is not at your disposal.
    The Munster CC (which was used by Kristof, and AFAIK that's the most
    commonly used implementation) does even have an IDE for OS X.

    Regards,
    Csaba
     
    csaba, Aug 1, 2006
    #6
  7. On Wed, 02 Aug 2006, Charles O Nutter defenestrated me:
    >
    > First, some notes on benchmarking:
    >
    > - NEVER include IO when benchmarking a numeric algorithm; IO speeds vary
    > greatly from system to system and can vary from run to run depending on what
    > else is happening


    IO can be noisy. I say avoid it for any benchmarking since it can
    greatly influence timings. Usually the IO is not what you want to
    measure so why add this variable into things?

    > - If you're building up a large chunk of strings, write to an internal
    > buffer and then write out at the end; don't write for every little tiny
    > operation. At the very least, use a buffer per-line, rather than a separate
    > write for every element on that line.


    I just informally thought I would measure a few things involving IO.
    I only changed the printing and nothing else:

    Unaltered test: ~3.8s
    Use of StringBuffer to print out a single row: ~2.1s
    Use of StringBuffer for entire run: ~1.5s
    Preallocated StringBuffer for entire run: ~1.4s

    As you can see IO can have a large affect on clock time. I demonstrated
    that in Java's case the IO in the benchmark accounted for over 2/3 of the
    wall clock time (which is interesting because a decent chunk that is
    left over is JVM startup overhead).

    Some stack allocated space will likely improve the C run as well (and in
    this case you can output it in a single write system call).

    -Tom

    --
    + http://www.tc.umn.edu/~enebo +---- mailto: ----+
    | Thomas E Enebo, Protagonist | "Luck favors the prepared |
    | | mind." -Louis Pasteur |
     
    Thomas E Enebo, Aug 1, 2006
    #7
  8. Peter Hickman

    Alex Young Guest

    csaba wrote:
    > Peter Hickman wrote:
    >> The source code is available from
    >> http://peterhi.dyndns.org/write_it_in_c/index.html

    >
    > Hmm, it would look much better for me if you had included Kristof
    > Bastiaensen's Curry version
    > into the party... Heck, if that code is not smart then nothing is, and
    > in a way, it's a much more
    > interesting question how a compiled functional language compares to
    > compiled imepative ones
    > than the thousand time discussed interpreted vs. compiled match-up.
    >
    > Yes, Curry is anything but mainstream, but you can't say a decent
    > compiler is not at your disposal.
    > The Munster CC (which was used by Kristof, and AFAIK that's the most
    > commonly used implementation) does even have an IDE for OS X.


    While I certainly appreciate the efforts that are going into this, I
    can't help feeling it's all completely irrelevant.

    We can engage in cross-implementation pissing contests until the cows
    come home. None of them help make Ruby any faster.

    My question to the community: is there a comprehensive benchmark suite
    *for Ruby alone* that we can use to tweak compilation settings, try out
    different core algorithms, and improve what is currently an improvable
    situation?

    If not, would a port of pybench.py be a suitable start?

    --
    Alex
     
    Alex Young, Aug 1, 2006
    #8
  9. Peter Hickman

    Isaac Gouy Guest

    Thomas E Enebo wrote:
    > On Wed, 02 Aug 2006, Charles O Nutter defenestrated me:
    > >
    > > First, some notes on benchmarking:
    > >
    > > - NEVER include IO when benchmarking a numeric algorithm; IO speeds vary
    > > greatly from system to system and can vary from run to run depending on what
    > > else is happening

    >
    > IO can be noisy. I say avoid it for any benchmarking since it can
    > greatly influence timings. Usually the IO is not what you want to
    > measure so why add this variable into things?
    >
    > > - If you're building up a large chunk of strings, write to an internal
    > > buffer and then write out at the end; don't write for every little tiny
    > > operation. At the very least, use a buffer per-line, rather than a separate
    > > write for every element on that line.

    >
    > I just informally thought I would measure a few things involving IO.
    > I only changed the printing and nothing else:
    >
    > Unaltered test: ~3.8s
    > Use of StringBuffer to print out a single row: ~2.1s
    > Use of StringBuffer for entire run: ~1.5s
    > Preallocated StringBuffer for entire run: ~1.4s
    >
    > As you can see IO can have a large affect on clock time. I demonstrated
    > that in Java's case the IO in the benchmark accounted for over 2/3 of the
    > wall clock time (which is interesting because a decent chunk that is
    > left over is JVM startup overhead).
    >
    > Some stack allocated space will likely improve the C run as well (and in
    > this case you can output it in a single write system call).
    >
    > -Tom
    >
    > --
    > + http://www.tc.umn.edu/~enebo +---- mailto: ----+
    > | Thomas E Enebo, Protagonist | "Luck favors the prepared |
    > | | mind." -Louis Pasteur |



    As you're having so much fun, let me suggest you try converting the
    OutputStrings to byte-arrays, and pre-allocating a byte buffer for
    output like the approach taken with this program
    http://shootout.alioth.debian.org/gp4/benchmark.php?test=fasta&lang=java&id=2
     
    Isaac Gouy, Aug 1, 2006
    #9
  10. Peter Hickman

    Ola Bini Guest

    Charles O Nutter wrote:
    > On 8/1/06, Alex Young <> wrote:
    >>
    >> While I certainly appreciate the efforts that are going into this, I
    >> can't help feeling it's all completely irrelevant.

    >
    >
    > My only purpose in battling these benchmarks is to help dispel the rumors
    > that "Java is slow," "VMs are slow," and so on. If Ruby does move to a real
    > optimizing VM, it will be a good thing...all those folks who continue to
    > think that VMs are inherently bad need to join the 21st century.
    >


    ... Which is extremely funny, since Common Lisp have had wicked fast
    virtual machines for the last 15 years (on par with C in performance).

    They should catch up with the 20th century first of all. =)


    --
    Ola Bini (http://ola-bini.blogspot.com)
    JvYAML, RbYAML, JRuby and Jatha contributor
    System Developer, Karolinska Institutet (http://www.ki.se)
    OLogix Consulting (http://www.ologix.com)

    "Yields falsehood when quined" yields falsehood when quined.
     
    Ola Bini, Aug 1, 2006
    #10
  11. Peter Hickman

    Chad Perrin Guest

    On Wed, Aug 02, 2006 at 05:04:26AM +0900, Charles O Nutter wrote:
    > On 8/1/06, Alex Young <> wrote:
    > >
    > >While I certainly appreciate the efforts that are going into this, I
    > >can't help feeling it's all completely irrelevant.

    >
    >
    > My only purpose in battling these benchmarks is to help dispel the rumors
    > that "Java is slow," "VMs are slow," and so on. If Ruby does move to a real
    > optimizing VM, it will be a good thing...all those folks who continue to
    > think that VMs are inherently bad need to join the 21st century.


    . . but VMs actually are slow (to start, all else being equal).
    There's a trade-off, though, and VMs tend to be faster later on in
    execution for extended operations (again, all else being equal). There
    are other alternatives than VMs to consider, though, and the specifics
    of what one wishes to accomplish should be examined before settling on
    the VM (or any other implementation style) as "the answer".

    I'm kinda just babbling at this point.

    --
    CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
    "The first rule of magic is simple. Don't waste your time waving your
    hands and hopping when a rock or a club will do." - McCloctnick the Lucid
     
    Chad Perrin, Aug 1, 2006
    #11
  12. Charles O Nutter wrote:
    > Ok, so there's a bunch of problems with the Java version.
    >
    > - In addition to the addRow run and the Java startup time you're also
    > benchmarking over 5200 array modifications to set Compared values to true


    That was simple because I couldn't define the array when I declared it
    as I did in C.

    > - Your variable naming is entirely contrary to every Java coding
    > convention
    > published (not a benchmark thing, but it sets off any Java devs warning
    > flags)


    And this affects the performance?

    > - Almost all of the time spent running is spent building and printing
    > strings
    >
    > Benchmarking just the algorithm run itself with no gross string
    > creation and
    > printing, I'm getting in the neighborhood of 370ms per invocation once
    > HotSpot has optimized the code. I'll have more detailed numbers shortly.
    >
     
    Peter Hickman, Aug 2, 2006
    #12
  13. Peter Hickman

    Ola Bini Guest

    Peter Hickman wrote:
    > Charles O Nutter wrote:
    >
    >> - Your variable naming is entirely contrary to every Java coding
    >> convention
    >> published (not a benchmark thing, but it sets off any Java devs warning
    >> flags)

    >
    > And this affects the performance?
    >


    The point Charles made with saying "but it sets off any Java devs
    warning flags" is that your Java coding conventions differ so much from
    regular conventions that your Java coding capacity is put into doubt. In
    plain speak; are you a good enough Java programmer to write an honest
    benchmark version for Java?

    --
    Ola Bini (http://ola-bini.blogspot.com)
    JvYAML, RbYAML, JRuby and Jatha contributor
    System Developer, Karolinska Institutet (http://www.ki.se)
    OLogix Consulting (http://www.ologix.com)

    "Yields falsehood when quined" yields falsehood when quined.
     
    Ola Bini, Aug 2, 2006
    #13
  14. The output of my original C and Java versions are identical. If you
    write the output to a file a straight diff -s should suffice.
     
    Peter Hickman, Aug 2, 2006
    #14
  15. Ola Bini wrote:
    > Peter Hickman wrote:
    >> Charles O Nutter wrote:
    >>
    >>> - Your variable naming is entirely contrary to every Java coding
    >>> convention
    >>> published (not a benchmark thing, but it sets off any Java devs warning
    >>> flags)

    >>
    >> And this affects the performance?
    >>

    >
    > The point Charles made with saying "but it sets off any Java devs
    > warning flags" is that your Java coding conventions differ so much
    > from regular conventions that your Java coding capacity is put into
    > doubt. In plain speak; are you a good enough Java programmer to write
    > an honest benchmark version for Java?
    >

    The question is the code itself, the names can always be renamed. Does
    the code suddenly become better if I had used the regular conventions?

    Does the naming convention affect performance? NO!

    So having brought up such a lame point we are now trying to deflect
    attention from the code by rubbishing the author. Focus on the code, not
    me or whatever fantasy you have about me.
     
    Peter Hickman, Aug 2, 2006
    #15
  16. Christian Neukirchen wrote:
    >>> ... Which is extremely funny, since Common Lisp have had wicked fast
    >>> virtual machines for the last 15 years (on par with C in performance).
    >>>
    >>> They should catch up with the 20th century first of all. =)
    >>>

    >
    > Sorry, but which CL has a "wicked fast" *virtual machine* and is on
    > par with C? AFAICS, they either are one or the other...
    >

    I don't recall which of the "big four" open source CLs have virtual
    machines, if any. IIRC the "standard Lisp benchmarks" are pretty much a
    dead heat on GCL vs. CMUCL, with Clisp coming in third. When SBCL "grows
    up", it will probably be as fast as GCL or CMUCL. My recollection is
    that the CMUCL *compiler* is on a par with C in terms of generating x86
    machine code.

    "scheme48" has a virtual machine, IIRC, and it's regarded highly by the
    folks who did the "vmgen" and "gforth" virtual machines, which are bred
    for speed using some non-ANSI features of GCC. But that's not Common
    Lisp; Scheme is in some ways easier to make "wicked fast".
     
    M. Edward (Ed) Borasky, Aug 2, 2006
    #16
  17. Here's how it goes on my box:

    [Latin]$ time ruby latin2.rb > r5

    real 0m16.283s
    user 0m15.380s
    sys 0m0.498s

    Which means that your computer is about as crap as mine :) I will add it
    to the pages and make it a download. I think I need a table summarising
    the timings.
     
    Peter Hickman, Aug 2, 2006
    #17
  18. On Tue, 01 Aug 2006 17:48:41 +0900, Peter Hickman wrote:

    > The source code is available from
    > http://peterhi.dyndns.org/write_it_in_c/index.html


    Here is another version in Ruby. It uses a more clever algorithm.
    It makes use of the fact that permuting the rows or the columns of
    a latin square still gives a latin square. It also passes a constraint
    on the columns that the given number can be in, to reduce the backtracking
    even more.

    It runs in 15 seconds on my machine, which I find quite acceptable.

    I am sure that this algorithm coded in a compiled functional or logic
    language would be about as fast, or faster than your C code, but with the
    advantages of a higher level language.

    I still think you benchmark is flawed, because
    1) the C program works only for squares of size 5
    2) It shows totally no relevance to the kind of problems that you
    would use Ruby for. If you want raw speed for numerical applications,
    then I would suggest to use a functional language (i.e. ocaml).
    3) If your application runs slowly, it is probably better to think first
    about why it is slow, if you can improve the code with faster
    algorithms or removing bottlenecks.

    Kristof Bastiaensen

    ---------------- latin2.rb --------------------
    require 'permutation'

    $size = (ARGV.shift || 5).to_i
    MaxVal = $size-1

    RowPermutations = Permutation.new($size).map{|p| p.value}
    BaseStr = (2..$size).to_a.join
    StrPermutations = Permutation.new($size-1).map{|p| p.project(BaseStr)}

    StartColumns = (1..MaxVal).to_a
    def init_columns(el)
    a = StartColumns.dup
    a.delete_at(el-1)
    return a
    end

    def insert(sqr, num, row, columns)
    insert(sqr, num, row+1, columns) if (row == num)
    columns.each do |col|
    next if sqr[row][col] != ?.
    sqr[row][col] = num + ?1
    if (row == MaxVal)
    insert(sqr, num+1, 1, init_columns(num+1))
    elsif (num == MaxVal && row == MaxVal - 1)
    print_solution(sqr)
    else
    insert(sqr, num, row+1, columns - [col])
    end
    sqr[row][col] = ?.
    end
    end

    def print_solution(sqr)
    RowPermutations.each do |rp|
    StrPermutations.each do |sp|
    0.upto(MaxVal) do |r|
    print sqr[rp[r]].tr(BaseStr, sp)
    print ":"
    end
    puts
    end
    end
    end

    $square = [("1" + BaseStr)] +
    Array.new($size-1) { |i| (i+2).to_s + "." * ($size - 1) }

    insert($square, 0, 1, StartColumns)
     
    Kristof Bastiaensen, Aug 2, 2006
    #18
  19. For performance, write it in OCaml

    Kristof Bastiaensen wrote:
    > On Tue, 01 Aug 2006 17:48:41 +0900, Peter Hickman wrote:
    >
    > > The source code is available from
    > > http://peterhi.dyndns.org/write_it_in_c/index.html

    >
    > Here is another version in Ruby. It uses a more clever algorithm.
    > It makes use of the fact that permuting the rows or the columns of
    > a latin square still gives a latin square. It also passes a constraint
    > on the columns that the given number can be in, to reduce the backtracking
    > even more.
    >
    > It runs in 15 seconds on my machine, which I find quite acceptable.
    >
    > I am sure that this algorithm coded in a compiled functional or logic
    > language would be about as fast, or faster than your C code, but with the
    > advantages of a higher level language.
    >
    > I still think you benchmark is flawed, because
    > 1) the C program works only for squares of size 5
    > 2) It shows totally no relevance to the kind of problems that you
    > would use Ruby for. If you want raw speed for numerical applications,
    > then I would suggest to use a functional language (i.e. ocaml).


    Here's an OCaml version that runs in about 1.5 seconds when
    output is redirected to a file on my faster computer (3.2 GHz).
    It uses the same algorithm as the C program.
    The C version takes 1.961 seconds when writing to /dev/null on
    the o.p.'s computer. Since this is my first OCaml program, I'm
    certain an expert could improve it.

    (* compile with:
    ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
    *)

    (* permutation code by Eric C. Cooper *)
    let rec distribute elt = function
    | (hd :: tl) as list -> (elt :: list) ::
    (List.map (fun x -> hd :: x) (distribute elt tl))
    | [] -> [ [elt] ]
    let rec permute = function
    | x :: rest -> List.flatten (List.map (distribute x) (permute rest))
    | [] -> [ [] ] ;;

    let list = [ 1; 2; 3; 4; 5 ]
    let size = List.length list
    let perms = permute list

    let n = List.length perms
    let incompat = Array.make_matrix n n true ;;

    for x = 0 to n - 1 do
    for y = 0 to n - 1 do
    incompat.(x).(y) <-
    List.exists2 ( = ) (List.nth perms x) (List.nth perms y)
    done
    done;;

    let join list = String.concat "" (List.map string_of_int list)
    let output_strings = List.map join perms ;;
    let board = ( Array.make size 0 ) ;;

    let rec add_a_row row =
    if row = size then
    print_endline
    ( String.concat ":"
    (List.map (fun i -> List.nth output_strings i)
    (Array.to_list board)) )
    else
    for latest = 0 to n - 1 do
    let prev_row = ref 0 in
    while (!prev_row < row) &&
    (not incompat.(latest).(board.(!prev_row))) do
    incr prev_row
    done;
    if !prev_row = row then
    ( board.(row) <- latest ; add_a_row (row + 1) )
    done
    ;;

    add_a_row 0 ;;
     
    William James, Aug 2, 2006
    #19
  20. Peter Hickman

    Isaac Gouy Guest

    Peter Hickman wrote:
    > Charles O Nutter wrote:
    > > Ok, so there's a bunch of problems with the Java version.
    > >
    > > - In addition to the addRow run and the Java startup time you're also
    > > benchmarking over 5200 array modifications to set Compared values to true

    >
    > That was simple because I couldn't define the array when I declared it
    > as I did in C.



    1) I made some changes to the jen.pl script you used to generate the
    Java program, it now splits apart the array initialization into many
    different methods - and that will let you generate a Java program for
    6x6. I emailed it to you.


    2) By "benchmarking without analysis is bogus" I understand that
    without analysis the programs might not be doing what we think they are
    doing - what we think is an apples to apples comparison might not be an
    apples to apples comparison.

    In this case, I understand Charles O Nutter to be saying that in C all
    those boolean arrays are initialized at compile time, but in Java all
    those boolean arrays are initialized at runtime - so although we're
    writing a similar thing in the code, we aren't doing the same thing
    when the code is run.

    In the same way, we have similar print statements in the C and Java
    code, but they don't do the same thing when the code is run - one deals
    with bytes the other deals with Unicode double bytes.

    So it depends which part of the computation we're interested in - if
    we're interested in the addARow recursive calls and loops, that part of
    the computation might be swamped by the time taken to to initialize the
    arrays at runtime and naively print strings.


    3) There seems to be a bigger problem with using this latin squares
    algorithm for benchmarking - the computation explodes!

    0.496s gcc 5x5 (without print statements)
    30055.098s gcc 6x6 (without print statements)
     
    Isaac Gouy, Aug 2, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page