For performance, write it in C - Part 3, Source code now available

I

Isaac Gouy

Peter said:
So far the only language that has come close to running as fast as the C
is the Ocaml version and since then I have found other ways to speed up
the C version. Just as I have posted the Ruby and Ocaml version if any
certified Java expert cares to show me how it should be done (*cough*
Charles O Nutter *cough*) then I will be happy to time it and check the
output and report the results.

So you don't know enough Java?
That way we can get away from all this
stupidity like the fact that I didn't use the correct naming convention.
Which we all know hugely affects performance. But remember that the
timings will be based on a program that produces output.
If it was the email I replied to yesterday then reason that it got close
to the speed of the C version was because it didn't do any output. Once
I put the output back in it was only around 5 seconds faster than my own
Java version and around 15 seconds slower than the C version. Hardly
encouragement to drop C and pursue Java. Remember I am running the
program to get the output.

If it didn't produce any output I would have mentioned that, as I have
before.

Check your email - the subject line reads " For performance use ascii
output"
 
W

William James

Jon said:
On my machine (Athlon X2 4400+, Debian Linux, gcc 4.0.4, OCaml 3.09.2, Sun
JDK 1.5):

Compile Run
C 0.137s 0.386s gcc -O3 -Wall
OCaml 0.171s 0.668s ocamlopt
OCaml 0.161s 0.626s ocamlopt -inline 100 -unsafe
OCaml2 0.165s 0.415s ocamlopt
OCaml2 0.165s 0.401s ocamlopt -unsafe
Java 1.565s 1.688s javac

Some niggles:

1. A lot of precalculation has been done in the C and Java, to the extent
that we're breaking Java compilers (always a bad sign).

2. Java is nothing like as slow as the OP claimed.

3. OCaml is only 4% slower than C and only 10% slower with bounds checking.

4. Whether Ruby is fast enough or not, you should be programming in OCaml
and not in Ruby or C. ;-)

Here's my OCaml2 (based upon William's):

let rec fact n = if n=0 then 1 else n*fact(n-1)

let size = 5
let n = fact size

(* Permutation generator *)
let p = Array.make size 0
let xx = Array.init size (fun i -> if i<size-1 then 0 else -1)
let gen_perm() =
let i = ref (size - 1) in
while !i > 0 && xx.(!i) = !i do
xx.(!i) <- 0;
decr i
done;
if !i = 0 then 1 else begin
xx.(!i) <- xx.(!i) + 1;
p.(0) <- 1;
for i=0 to size - 1 do
p.(i) <- p.(i - xx.(i));
p.(i - xx.(i)) <- i+1
done;
0
end

(* Permutations *)
let perms = Array.init n (fun _ -> ignore(gen_perm()); Array.copy p)

let incompat =
let aux x y =
Array.iteri (fun i px -> if px = perms.(y).(i) then raise Exit) perms
(x);
false in
Array.init n (fun x -> Array.init n (fun y -> try aux x y with Exit ->
true))

let join list = String.concat "" (Array.to_list (Array.map string_of_int
list))
let output_strings = Array.map join perms
let board = Array.make size 0

let op = String.make (size*(size+1)-1) ':'

let rec add_a_row row =
if row = size then begin
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
print_string op;
print_string "\n"

A small point.
print_endline op
is shorter. Is it faster?
 
W

wneumann

Jon said:
Here's my OCaml2 (based upon William's):

Just a small note here:
Depending on your goals, you can do even better by stuffing the output
in a buffer then dumping the buffer at the end of computation (see
example code below). Of course this consumes more memory, and isn't
possible on a larger board, but it's yet another tweak to exploit,
since we're benchmarking output in addition to benchmarking work (note
that on my G5 output to stdout is a bit slower than your version, but
this isn't the case on the AMD machines I have access to -- redirecting
to a file is always faster).

let rec add_a_row row =
let buf = Buffer.create 1_000_000 in
let rec addh row =
match row with
| x when x=size ->
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
Buffer.add_string buf op; Buffer.add_char buf '\n'
| _ ->
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then (board.(row) <- latest; addh (row + 1))
done in
addh row;
Buffer.output_buffer stdout buf

let () =
add_a_row 0;
 
C

Chad Perrin

Chad said:
I find that I pick up a language faster if
there's a good community for it whose means of communication suits the
way I think, though. What sort of online OCaml community resources are
available?

Primarily, the online resources come from one of the two mailing lists,
OCaml-Beginners[1] for introductory to intermediate questions, and the
main OCaml list[2] for intermediate to advanced questions, or questions
on the OCaml internals and whatnot. There are a few other places where
info moves, like Richard Jones' cocan wiki [3], but at the moment
things are primarily handled through the lists (though it might be
about time for something like perlmonks to arrive as the community is
growing).

[1] http://groups.yahoo.com/group/ocaml_beginners/
[2] http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
[3] http://wiki.cocan.org/

Thanks. I'll look into it.

You mention it needs something like PerlMonks -- which is interesting,
considering I've noticed there's nothing like that for Ruby, either.
I've found myself wishing a couple of times that there was something
like a RubyMonks kinda thing (though in a more Rubyish idiom than *Monks
would be), but every time I consider the notion I find myself wishing
that PerlMonks were just a bit more multi-language, focusing primarily
on languages I personally find interesting. Bah.

I've considered putting something vaguely like that together myself,
actually, but I'm lazy and too much of a perfectionist. Since I'd be
working toward my own project spec to create such a website, I'd
probably just end up spending months working on what I want the site to
do and never get around to writing a line of code.
 
W

William James

Jon said:
let incompat =
let aux x y =
Array.iteri (fun i px -> if px = perms.(y).(i) then raise Exit) perms
(x);

Perhaps google has mangled you code; at this point ocaml reports

File "latin-squares-H.ml", line 30, characters 4-15:
This function is applied to too many arguments,
maybe you forgot a `;'
 
N

Neumann

William said:
File "latin-squares-H.ml", line 30, characters 4-15:
This function is applied to too many arguments,
maybe you forgot a `;'

It lost a . for array access. It should be (slightly reformatted
trying to lose nothing):

let incompat =
let aux x y =
Array.iteri
(fun i px -> if px = perms.(y).(i) then raise Exit)
perms.(x);
 
W

William James

William said:
Jon Harrop wrote:
....

A small point.
print_endline op
is shorter. Is it faster?

Not with the interpreter; it's slower. Haven't tested the compiler yet.
 
N

Neumann

William said:
Not with the interpreter; it's slower. Haven't tested the compiler yet.

If anything, it'll be slower, as print_endline flushes the buffer every
call, whereas print_string and print_char do not.
 
J

Jon Harrop

Isaac said:
Simply writing Java that converts the double-byte Unicode Java
outputStrings to single-byte /once/ instead of everytime there's a
print statement, and following the usual Java pattern of wrapping
output with a BufferedOutputStream, reduces that to

1.396s user+sys, sun-jdk-1.5.0.07

Post your code here.
 
J

Jon Harrop

Jon said:
On my machine (Athlon X2 4400+, Debian Linux, gcc 4.0.4, OCaml 3.09.2, Sun
JDK 1.5):

Compile Run
C 0.137s 0.386s gcc -O3 -Wall
OCaml 0.171s 0.668s ocamlopt
OCaml 0.161s 0.626s ocamlopt -inline 100 -unsafe
OCaml2 0.165s 0.415s ocamlopt
OCaml2 0.165s 0.401s ocamlopt -unsafe
Java 1.565s 1.688s javac

That was 64-bit. Here are my 32-bit timings (language versions are the
same):

Compile Run
C 0.126s 0.386s gcc -O3 -Wall
OCaml2 0.089s 0.391s ocamlopt
OCaml2 0.010s 10.111s ocamlc
Java 1.615s 12.745s javac

Again, OCaml is basically as fast as C. Note that C and OCaml get slower
moving to 64-bit but Java gets >7x faster. Amazingly, OCaml's interpreted
bytecode is faster than Java on 32-bit, whilst being >160x faster to
compile!

Here's my latest OCaml (I think we could revert to the simpler list-based
permuter because no significant time is spent generating the permutations):

let rec fact n = if n=0 then 1 else n*fact(n-1)

let size = 5

(* Permutation generator *)
let p = Array.make size 0
let xx = Array.init size (fun i -> if i<size-1 then 0 else -1)
let gen_perm() =
let i = ref (size - 1) in
while !i > 0 && xx.(!i) = !i do
xx.(!i) <- 0;
decr i
done;
if !i = 0 then 1 else begin
xx.(!i) <- xx.(!i) + 1;
p.(0) <- 1;
for i=0 to size - 1 do
p.(i) <- p.(i - xx.(i));
p.(i - xx.(i)) <- i+1
done;
0
end

let n = fact size

(* Permutations *)
let perms = Array.init n (fun _ -> ignore(gen_perm()); Array.copy p)

let incompat =
let rec aux (px : int array) py i =
i<size && (px.(i) = py.(i) || aux px py (i+1)) in
Array.init n (fun x -> Array.init n (fun y -> aux perms.(x) perms.(y) 0))

let join list = String.concat "" (Array.to_list (Array.map string_of_int
list))
let output_strings = Array.map join perms
let board = Array.make size 0

let op = String.make (size*(size+1)) ':'

let rec add_a_row row =
if row=size then begin
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
print_string op
end else
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row;
done;
if !prev_row = row then begin
board.(row) <- latest;
add_a_row (row + 1)
end
done

let () =
op.[size*(size+1)-1] <- '\n';
add_a_row 0
 
K

Kristof Bastiaensen

Actually the real problem is the amount of disk space I need to store
the output. 6 x 6 resulted in 32Gb of output. The 7 x 7 will probably
eat my hard disk well before it completes. I'm not sure that the 1Tb I
have will be enough so I am looking at removing the rotational and
mirrored versions of the grids from the output.

Why would you want to store all this data? If you want to store the 7x7
version, you will need 3187.2 TB to store all the squares. To store
only the reduced squares you still need 920 MB. To store the reduced
squares for 8x8 you will need 35.6TB. (I used the wikipedia article
on http://en.wikipedia.org/wiki/Latin_square to compute these numbers).

My program can easily show only the reduced squares by removing the
permutations from the solutions method, and it will be a lot faster too.
If you want to do this for large squares it would be better to rewrite my
algorithm in C. But I am interested how this data could ever be useful.

Regards,
Kristof
 
W

William James

Jon said:
Here's my latest OCaml (I think we could revert to the simpler list-based
permuter because no significant time is spent generating the permutations):

let rec fact n = if n=0 then 1 else n*fact(n-1)

let size = 5

(* Permutation generator *)
let p = Array.make size 0
let xx = Array.init size (fun i -> if i<size-1 then 0 else -1)
let gen_perm() =
let i = ref (size - 1) in
while !i > 0 && xx.(!i) = !i do
xx.(!i) <- 0;
decr i
done;
if !i = 0 then 1 else begin
xx.(!i) <- xx.(!i) + 1;
p.(0) <- 1;
for i=0 to size - 1 do
p.(i) <- p.(i - xx.(i));
p.(i - xx.(i)) <- i+1
done;
0
end

let n = fact size

(* Permutations *)
let perms = Array.init n (fun _ -> ignore(gen_perm()); Array.copy p)

let incompat =
let rec aux (px : int array) py i =
i<size && (px.(i) = py.(i) || aux px py (i+1)) in
Array.init n (fun x -> Array.init n (fun y -> aux perms.(x) perms.(y) 0))

let join list = String.concat "" (Array.to_list (Array.map string_of_int
list))
let output_strings = Array.map join perms
let board = Array.make size 0

let op = String.make (size*(size+1)) ':'

let rec add_a_row row =
if row=size then begin
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
print_string op
end else
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row;
done;
if !prev_row = row then begin
board.(row) <- latest;
add_a_row (row + 1)
end
done

let () =
op.[size*(size+1)-1] <- '\n';
add_a_row 0

Here's a faster version of my program.

Eliminated a "not" in a loop by replacing the array of booleans
that shows incompatibility of two rows with an array that shows
compatibility.

Borrowed you idea of creating an output line ahead of time and
modifying it in place.

Timings:
1.11 yours
1.04 mine


(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)


(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
| (hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
| x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ] ;;

let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list

let perms = Array.of_list (permute list)
let n = Array.length perms

(* Used to determine if one row is compatible with another. *)
let compatible = Array.make_matrix n n true ;;
for x = 0 to n - 1 do
for y = 0 to n - 1 do
compatible.(x).(y) <-
List.for_all2 ( <> ) perms.(x) perms.(y)
done
done

let join list = String.concat "" (List.map string_of_int list)
let output_strings = Array.map join perms

(* For speed, create a string that's the length of the lines
that we'll print; the :'s that aren't needed as separators
will later be overwritten. *)
let output_line = String.make (size*(size+1)-1) ':' ^ "\n"
let board = Array.make size 0

let rec add_a_row row =
if row = size then
( for i=0 to size-1 do
String.blit
output_strings.(board.(i)) 0 (* source *)
output_line (i*(size+1)) (* dest *)
size
done;
print_string output_line
)
else
for latest = 0 to n - 1 do
(* Create a changeable thing (variable). *)
let prev_row = ref 0 in
(* The ! below fetches the variable's value. *)
while (!prev_row < row) &&
(compatible.(latest).(board.(!prev_row))) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;

add_a_row 0
 
J

Jon Harrop

Isaac said:
That version is on Peter Hickman's website, perhaps he'll replace it
with the newer version on Friday.

Curiously the two Java implementations on Peter's site perform the same on
my machine in x86 and they are >32x slower than OCaml. I know Java is slow
but that is ridiculous...
 
I

Isaac Gouy

Jon said:
Curiously the two Java implementations on Peter's site perform the same on
my machine in x86 and they are >32x slower than OCaml. I know Java is slow
but that is ridiculous...

You're probably doing something wrong.
 
J

Jon Harrop

William said:
Here's a faster version of my program.

Eliminated a "not" in a loop by replacing the array of booleans
that shows incompatibility of two rows with an array that shows
compatibility.

Heh, should've thought of that. :)
Borrowed you idea of creating an output line ahead of time and
modifying it in place.

Timings:
1.11 yours
1.04 mine

I get the opposite order:

0.394s yours
0.390s mine

but your timings are probably more accurate because your computer is so
slow. ;-)

You can make it slightly faster by factoring out the CSE "compatible
(latest)", giving:

0.382s

That's actually faster than the C here. Woohoo! :)

Using bitvectors may also improve things, depending if skipping a bunch of
comparisons in that loop is significant. However, I think it is time we
searched for a new task to optimise. Ray tracer anyone? ;-)
 
J

Jon Harrop

Jon said:
Here are my 32-bit timings (language versions are the same):

Compile Run
C 0.126s 0.386s gcc -O3 -Wall
OCaml2 0.089s 0.391s ocamlopt
OCaml2 0.010s 10.111s ocamlc
Java 1.615s 12.745s javac

Here's a C++ implementation:

C++ 0.652s 0.608s g++ -O3 -Wall

(95% of time is spent in addRow())

#include <iostream>
#include <algorithm>
#include <vector>

using namespace std;

int fact(int n) { return n == 0 ? 1 : n*fact(n-1); }

int size=5;
vector<int> list(size);
int n = fact(size);
vector< vector<int> > perms(n);
vector< vector<bool> > compat(n);
vector<int> board(size);
string out(size*(size+1), ':');

void print() {
for (int i=0; i<size; ++i)
for (int j=0; j<size; ++j)
out[i*(size+1) + j] = '1' + perms[board][j];
cout << out;
}

int next(vector<bool> &compat, int row) {
int prev_row = 0;
while (prev_row < row && compat[board[prev_row]])
++prev_row;
return prev_row;
}

void addRow(int row) {
if (row == size) print(); else
for (int latest=0; latest<n; ++latest) {
if (next(compat[latest], row) == row) {
board[row] = latest;
addRow(row+1);
}
}
}

int main() {
for (int i=0; i<size; ++i) list = i;

for (int i=0; i<n; ++i) {
perms = vector<int>(list.begin(), list.end());
next_permutation(list.begin(), list.end());
}

for (int i=0; i<n; ++i) {
compat = vector<bool>(n);
for (int j=0; j<n; ++j) {
compat[j] = true;
for (int k=0; k<size; ++k)
if (perms[k] == perms[j][k]) {
compat[j] = false;
break;
}
}
}

out[size*(size+1)-1] = '\n';

addRow(0);
}
 
W

William James

Jon said:
Heh, should've thought of that. :)


I get the opposite order:

0.394s yours
0.390s mine

but your timings are probably more accurate because your computer is so
slow. ;-)

It's 800MHz. I ran each program 4 times and took the average.
You can make it slightly faster by factoring out the CSE "compatible
(latest)", giving:

0.382s

That's actually faster than the C here. Woohoo! :)

Good point. That cuts my time to 1.01 seconds.
Also cleaned up the code a bit.
Here's the final version.

(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)

(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
(hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ]

let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list

let perms = Array.of_list (permute list)
let n = Array.length perms

(* Boolean array used to determine if one row is
compatible with another. *)
let compatible = Array.make_matrix n n true ;;
Array.iteri (fun x ex ->
Array.iteri (fun y ey ->
compatible.(x).(y) <- List.for_all2 (<>) ex ey) perms ) perms

let join list = String.concat "" (List.map string_of_int list)
let output_strings = Array.map join perms

(* For speed, create a string that's the length of the lines
that we'll print; the :'s that aren't needed as separators
will later be overwritten. *)
let output_line = String.make (size*(size+1)-1) ':' ^ "\n"
let board = Array.make size 0

(* A recursive function. *)
let rec add_a_row row =
if row = size then
( for i=0 to size-1 do
String.blit
output_strings.(board.(i)) 0 (* source *)
output_line (i*(size+1)) (* dest *)
size
done;
print_string output_line
)
else
for latest = 0 to n - 1 do
let compatible_slice = compatible.(latest) in
(* Create a changeable thing (variable). *)
let prev_row = ref 0 in
(* The ! below fetches the variable's value. *)
while !prev_row < row &&
compatible_slice.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;

add_a_row 0
 
J

Jon Harrop

Isaac said:
You're probably doing something wrong.

Other people have posted not entirely dissimilar timings for Java and I'm
doing exactly the same thing on x86_64 where is runs much faster.

Has anyone managed to get Java close to OCaml or C?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top