For performance, write it in C - Part 3, Source code now available

I

Isaac Gouy

Peter said:

There are some details missing from the webpages
1) which C implementation?
2) which Java implementation?
3) what hardware?


For example, using the code from the webpage

1) gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)

gcc -pipe -Wall -O3 -fomit-frame-pointer -funroll-loops -march=pentium4
latin.c -o latinc
time ./latinc > /dev/null 2>&1

user 0m0.820s
sys 0m0.000s


2) /sun-jdk-1.5.0.07/bin/javac Latin.java
time java Latin > /dev/null 2>&1

user 0m3.800s
sys 0m0.644s


3) 2GHz Intel P4
 
P

Peter Hickman

Isaac said:
There are some details missing from the webpages
1) which C implementation?
[peterhickman]$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin8
Configured with: /private/var/tmp/gcc/gcc-5341.obj~1/src/configure
--disable-checking -enable-werror --prefix=/usr --mandir=/share/man
--enable-languages=c,objc,c++,obj-c++
--program-transform-name=/^[cg][^.-]*$/s/$/-4.0/
--with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib
--build=powerpc-apple-darwin8 --host=powerpc-apple-darwin8
--target=powerpc-apple-darwin8
Thread model: posix
gcc version 4.0.1 (Apple Computer, Inc. build 5341)
2) which Java implementation?
[peterhickman]$ javac -version
javac 1.5.0_06

Additionally

[peterhickman]$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 6) configuration:

[peterhickman]$ ruby -v
ruby 1.8.4 (2005-12-24) [powerpc-darwin]
3) what hardware?
Macintosh G4 with 1Gb ram
 
I

Isaac Gouy

I recall someone stating "benchmarking without analysis is bogus".

As a first step, comment out the print statements

time ./latinc

user 0m0.492s
sys 0m0.004


time java Latin

user 0m0.992s
sys 0m0.052s


With the print statements ~5.4x
without print statements ~2.1x

iirc the Java program is shuffling around double byte unicode chars and
the C program is handling single byte chars.
 
C

csaba

Peter said:

Hmm, it would look much better for me if you had included Kristof
Bastiaensen's Curry version
into the party... Heck, if that code is not smart then nothing is, and
in a way, it's a much more
interesting question how a compiled functional language compares to
compiled imepative ones
than the thousand time discussed interpreted vs. compiled match-up.

Yes, Curry is anything but mainstream, but you can't say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that's the most
commonly used implementation) does even have an IDE for OS X.

Regards,
Csaba
 
T

Thomas E Enebo

First, some notes on benchmarking:

- NEVER include IO when benchmarking a numeric algorithm; IO speeds vary
greatly from system to system and can vary from run to run depending on what
else is happening

IO can be noisy. I say avoid it for any benchmarking since it can
greatly influence timings. Usually the IO is not what you want to
measure so why add this variable into things?
- If you're building up a large chunk of strings, write to an internal
buffer and then write out at the end; don't write for every little tiny
operation. At the very least, use a buffer per-line, rather than a separate
write for every element on that line.

I just informally thought I would measure a few things involving IO.
I only changed the printing and nothing else:

Unaltered test: ~3.8s
Use of StringBuffer to print out a single row: ~2.1s
Use of StringBuffer for entire run: ~1.5s
Preallocated StringBuffer for entire run: ~1.4s

As you can see IO can have a large affect on clock time. I demonstrated
that in Java's case the IO in the benchmark accounted for over 2/3 of the
wall clock time (which is interesting because a decent chunk that is
left over is JVM startup overhead).

Some stack allocated space will likely improve the C run as well (and in
this case you can output it in a single write system call).

-Tom
 
A

Alex Young

csaba said:
Hmm, it would look much better for me if you had included Kristof
Bastiaensen's Curry version
into the party... Heck, if that code is not smart then nothing is, and
in a way, it's a much more
interesting question how a compiled functional language compares to
compiled imepative ones
than the thousand time discussed interpreted vs. compiled match-up.

Yes, Curry is anything but mainstream, but you can't say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that's the most
commonly used implementation) does even have an IDE for OS X.

While I certainly appreciate the efforts that are going into this, I
can't help feeling it's all completely irrelevant.

We can engage in cross-implementation pissing contests until the cows
come home. None of them help make Ruby any faster.

My question to the community: is there a comprehensive benchmark suite
*for Ruby alone* that we can use to tweak compilation settings, try out
different core algorithms, and improve what is currently an improvable
situation?

If not, would a port of pybench.py be a suitable start?
 
I

Isaac Gouy

Thomas said:
IO can be noisy. I say avoid it for any benchmarking since it can
greatly influence timings. Usually the IO is not what you want to
measure so why add this variable into things?


I just informally thought I would measure a few things involving IO.
I only changed the printing and nothing else:

Unaltered test: ~3.8s
Use of StringBuffer to print out a single row: ~2.1s
Use of StringBuffer for entire run: ~1.5s
Preallocated StringBuffer for entire run: ~1.4s

As you can see IO can have a large affect on clock time. I demonstrated
that in Java's case the IO in the benchmark accounted for over 2/3 of the
wall clock time (which is interesting because a decent chunk that is
left over is JVM startup overhead).

Some stack allocated space will likely improve the C run as well (and in
this case you can output it in a single write system call).

-Tom


As you're having so much fun, let me suggest you try converting the
OutputStrings to byte-arrays, and pre-allocating a byte buffer for
output like the approach taken with this program
http://shootout.alioth.debian.org/gp4/benchmark.php?test=fasta&lang=java&id=2
 
O

Ola Bini

Charles said:
My only purpose in battling these benchmarks is to help dispel the rumors
that "Java is slow," "VMs are slow," and so on. If Ruby does move to a real
optimizing VM, it will be a good thing...all those folks who continue to
think that VMs are inherently bad need to join the 21st century.

... Which is extremely funny, since Common Lisp have had wicked fast
virtual machines for the last 15 years (on par with C in performance).

They should catch up with the 20th century first of all. =)


--
Ola Bini (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

"Yields falsehood when quined" yields falsehood when quined.
 
C

Chad Perrin

My only purpose in battling these benchmarks is to help dispel the rumors
that "Java is slow," "VMs are slow," and so on. If Ruby does move to a real
optimizing VM, it will be a good thing...all those folks who continue to
think that VMs are inherently bad need to join the 21st century.

. . but VMs actually are slow (to start, all else being equal).
There's a trade-off, though, and VMs tend to be faster later on in
execution for extended operations (again, all else being equal). There
are other alternatives than VMs to consider, though, and the specifics
of what one wishes to accomplish should be examined before settling on
the VM (or any other implementation style) as "the answer".

I'm kinda just babbling at this point.
 
P

Peter Hickman

Charles said:
Ok, so there's a bunch of problems with the Java version.

- In addition to the addRow run and the Java startup time you're also
benchmarking over 5200 array modifications to set Compared values to true

That was simple because I couldn't define the array when I declared it
as I did in C.
- Your variable naming is entirely contrary to every Java coding
convention
published (not a benchmark thing, but it sets off any Java devs warning
flags)

And this affects the performance?
 
O

Ola Bini

Peter said:
And this affects the performance?

The point Charles made with saying "but it sets off any Java devs
warning flags" is that your Java coding conventions differ so much from
regular conventions that your Java coding capacity is put into doubt. In
plain speak; are you a good enough Java programmer to write an honest
benchmark version for Java?

--
Ola Bini (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

"Yields falsehood when quined" yields falsehood when quined.
 
P

Peter Hickman

The output of my original C and Java versions are identical. If you
write the output to a file a straight diff -s should suffice.
 
P

Peter Hickman

Ola said:
The point Charles made with saying "but it sets off any Java devs
warning flags" is that your Java coding conventions differ so much
from regular conventions that your Java coding capacity is put into
doubt. In plain speak; are you a good enough Java programmer to write
an honest benchmark version for Java?
The question is the code itself, the names can always be renamed. Does
the code suddenly become better if I had used the regular conventions?

Does the naming convention affect performance? NO!

So having brought up such a lame point we are now trying to deflect
attention from the code by rubbishing the author. Focus on the code, not
me or whatever fantasy you have about me.
 
M

M. Edward (Ed) Borasky

Christian said:
Sorry, but which CL has a "wicked fast" *virtual machine* and is on
par with C? AFAICS, they either are one or the other...
I don't recall which of the "big four" open source CLs have virtual
machines, if any. IIRC the "standard Lisp benchmarks" are pretty much a
dead heat on GCL vs. CMUCL, with Clisp coming in third. When SBCL "grows
up", it will probably be as fast as GCL or CMUCL. My recollection is
that the CMUCL *compiler* is on a par with C in terms of generating x86
machine code.

"scheme48" has a virtual machine, IIRC, and it's regarded highly by the
folks who did the "vmgen" and "gforth" virtual machines, which are bred
for speed using some non-ANSI features of GCC. But that's not Common
Lisp; Scheme is in some ways easier to make "wicked fast".
 
P

Peter Hickman

Here's how it goes on my box:

[Latin]$ time ruby latin2.rb > r5

real 0m16.283s
user 0m15.380s
sys 0m0.498s

Which means that your computer is about as crap as mine :) I will add it
to the pages and make it a download. I think I need a table summarising
the timings.
 
K

Kristof Bastiaensen


Here is another version in Ruby. It uses a more clever algorithm.
It makes use of the fact that permuting the rows or the columns of
a latin square still gives a latin square. It also passes a constraint
on the columns that the given number can be in, to reduce the backtracking
even more.

It runs in 15 seconds on my machine, which I find quite acceptable.

I am sure that this algorithm coded in a compiled functional or logic
language would be about as fast, or faster than your C code, but with the
advantages of a higher level language.

I still think you benchmark is flawed, because
1) the C program works only for squares of size 5
2) It shows totally no relevance to the kind of problems that you
would use Ruby for. If you want raw speed for numerical applications,
then I would suggest to use a functional language (i.e. ocaml).
3) If your application runs slowly, it is probably better to think first
about why it is slow, if you can improve the code with faster
algorithms or removing bottlenecks.

Kristof Bastiaensen

---------------- latin2.rb --------------------
require 'permutation'

$size = (ARGV.shift || 5).to_i
MaxVal = $size-1

RowPermutations = Permutation.new($size).map{|p| p.value}
BaseStr = (2..$size).to_a.join
StrPermutations = Permutation.new($size-1).map{|p| p.project(BaseStr)}

StartColumns = (1..MaxVal).to_a
def init_columns(el)
a = StartColumns.dup
a.delete_at(el-1)
return a
end

def insert(sqr, num, row, columns)
insert(sqr, num, row+1, columns) if (row == num)
columns.each do |col|
next if sqr[row][col] != ?.
sqr[row][col] = num + ?1
if (row == MaxVal)
insert(sqr, num+1, 1, init_columns(num+1))
elsif (num == MaxVal && row == MaxVal - 1)
print_solution(sqr)
else
insert(sqr, num, row+1, columns - [col])
end
sqr[row][col] = ?.
end
end

def print_solution(sqr)
RowPermutations.each do |rp|
StrPermutations.each do |sp|
0.upto(MaxVal) do |r|
print sqr[rp[r]].tr(BaseStr, sp)
print ":"
end
puts
end
end
end

$square = [("1" + BaseStr)] +
Array.new($size-1) { |i| (i+2).to_s + "." * ($size - 1) }

insert($square, 0, 1, StartColumns)
 
W

William James

Kristof said:
Here is another version in Ruby. It uses a more clever algorithm.
It makes use of the fact that permuting the rows or the columns of
a latin square still gives a latin square. It also passes a constraint
on the columns that the given number can be in, to reduce the backtracking
even more.

It runs in 15 seconds on my machine, which I find quite acceptable.

I am sure that this algorithm coded in a compiled functional or logic
language would be about as fast, or faster than your C code, but with the
advantages of a higher level language.

I still think you benchmark is flawed, because
1) the C program works only for squares of size 5
2) It shows totally no relevance to the kind of problems that you
would use Ruby for. If you want raw speed for numerical applications,
then I would suggest to use a functional language (i.e. ocaml).

Here's an OCaml version that runs in about 1.5 seconds when
output is redirected to a file on my faster computer (3.2 GHz).
It uses the same algorithm as the C program.
The C version takes 1.961 seconds when writing to /dev/null on
the o.p.'s computer. Since this is my first OCaml program, I'm
certain an expert could improve it.

(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)

(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
| (hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
| x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ] ;;

let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list
let perms = permute list

let n = List.length perms
let incompat = Array.make_matrix n n true ;;

for x = 0 to n - 1 do
for y = 0 to n - 1 do
incompat.(x).(y) <-
List.exists2 ( = ) (List.nth perms x) (List.nth perms y)
done
done;;

let join list = String.concat "" (List.map string_of_int list)
let output_strings = List.map join perms ;;
let board = ( Array.make size 0 ) ;;

let rec add_a_row row =
if row = size then
print_endline
( String.concat ":"
(List.map (fun i -> List.nth output_strings i)
(Array.to_list board)) )
else
for latest = 0 to n - 1 do
let prev_row = ref 0 in
while (!prev_row < row) &&
(not incompat.(latest).(board.(!prev_row))) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;

add_a_row 0 ;;
 
I

Isaac Gouy

Peter said:
That was simple because I couldn't define the array when I declared it
as I did in C.


1) I made some changes to the jen.pl script you used to generate the
Java program, it now splits apart the array initialization into many
different methods - and that will let you generate a Java program for
6x6. I emailed it to you.


2) By "benchmarking without analysis is bogus" I understand that
without analysis the programs might not be doing what we think they are
doing - what we think is an apples to apples comparison might not be an
apples to apples comparison.

In this case, I understand Charles O Nutter to be saying that in C all
those boolean arrays are initialized at compile time, but in Java all
those boolean arrays are initialized at runtime - so although we're
writing a similar thing in the code, we aren't doing the same thing
when the code is run.

In the same way, we have similar print statements in the C and Java
code, but they don't do the same thing when the code is run - one deals
with bytes the other deals with Unicode double bytes.

So it depends which part of the computation we're interested in - if
we're interested in the addARow recursive calls and loops, that part of
the computation might be swamped by the time taken to to initialize the
arrays at runtime and naively print strings.


3) There seems to be a bigger problem with using this latin squares
algorithm for benchmarking - the computation explodes!

0.496s gcc 5x5 (without print statements)
30055.098s gcc 6x6 (without print statements)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top