statistic of integer values

  • Thread starter Stefan Wallentowitz
  • Start date
S

Stefan Wallentowitz

Hello together!

I'm searching for a statistic about the value ranges of integers in a set
of "standard" c-programs.
I'm specifically interested in the avarage ratio of assignments etc. of
integers where the value is either of (un)signed 16 bit or 32 bit size.
Has anybody ever been in contact with something similar?

Bye,
Stefan
 
M

Malcolm

Stefan Wallentowitz said:
I'm searching for a statistic about the value ranges of integers in a set
of "standard" c-programs.
I'm specifically interested in the avarage ratio of assignments etc. of
integers where the value is either of (un)signed 16 bit or 32 bit size.
Has anybody ever been in contact with something similar?
No. It will be rather tricky to define exactly what you mean.

For instance, a loop in which a variable, i, takes range between 0 and N-1,
is an extremely common construct. However the counting variable usually
isn't held for more than a few machine cycles. N, on the other hand, may
well keep its value for the life of the program.

However if the loop is in a subroutine, variable N may be passed to that
routine, copied from a "Nemployees" member of a data structure. So does N
now count as one or two integers?

Also, because of C semantics, i will actually be set to N at the end of the
for loop. Probably, however, i will never be accessed outside of the loop.
So has i taken the value N, or not?
 
S

Stefan Wallentowitz

Am Sat, 10 Dec 2005 21:25:38 +0000 schrieb Malcolm:

Hi!
No. It will be rather tricky to define exactly what you mean.

Ok, I thought it might be a bit unexact.
Also, because of C semantics, i will actually be set to N at the end of
the for loop. Probably, however, i will never be accessed outside of the
loop. So has i taken the value N, or not?

Yes, N is one of it. But the initialisation of i also. In fact I'm
interested in the ratio of common integer ranges being loaded in
registers. Several examples:
a=1, a=b+4, a=0x0feff0c4 etc.
It is related to a compiler optimization and it is interesting how much
instructions are in average saved, e.g. when theres a high occurence-rate
of 16 bit signed integers, the optimization brings more success. But some
measurement data would be great.
I also know that there is a machine dependency, but I envision some
overall data from a more abstract view on c-code.

Greets,
Stefan
 
T

Thad Smith

Stefan said:
I'm searching for a statistic about the value ranges of integers in a set
of "standard" c-programs.

You will probably have to generate your own.
I'm specifically interested in the avarage ratio of assignments etc. of
integers where the value is either of (un)signed 16 bit or 32 bit size.

You need to be more precise about what you need.
"ratio of assignments etc. of integers where the value is either of
(un)signed 16 bit or 32 bit size."

Does that mean you want to count the number of integer assignments
executed in which the value has exactly 16 sigificant bits and exactly
32 significant bits and compute their ratio?

Do you want to know how many assignment statements are made to 16-bit
vs. 32-bit integer types without regard to value and without regard to
the number of times executed? Or something else?
 
S

slebetman

Stefan said:
It is related to a compiler optimization and it is interesting how much
instructions are in average saved, e.g. when theres a high occurence-rate
of 16 bit signed integers, the optimization brings more success.

Not a good idea. The compiler should be consistent or the programmer
will be surprised later on if the run-time code suddenly overflows on
65536. If the compiler implements 16bit ints, then that's fine. If it
is 32bit ints, then that's fine also. But if it is sometimes 32 and
sometimes 16 that's not a good thing. An example of a compiler where
this is configurable is the HITECH C compiler for PIC microcontrollers.
The programmer gets to decide via compiler options or #pragma
statements weather ints are 16 or 32 bits.
 
W

Walter Roberson

Not a good idea. The compiler should be consistent or the programmer
will be surprised later on if the run-time code suddenly overflows on
65536. If the compiler implements 16bit ints, then that's fine. If it
is 32bit ints, then that's fine also. But if it is sometimes 32 and
sometimes 16 that's not a good thing.

Ummm, it looked to me like that OP wasn't trying to say anything like
that.

If I understand correctly, the question is about integral literals
appearing in source code -- though it should probably instead be about
integral literals appearing after pre-processing and constant folding.

The question is about the relative rates at which 16 bit literals
appear, compared to 32 bit literals. The point, if I understand
correctly, is not about computing at those widths, but rather about
the instructions involved in coding the literals into the machine
code.

There are architectures which offer several different ways of loading
the same literal.

For example, there are architectures which have (say) "Add Quick", an
instruction which operates on registers of whatever size is specified
in other parts of the instruction (e.g., ADDQ.W, ADDQ.L) -- but no
matter what the register size, the value being added is restricted to a
particular range, such as -128 to +127 . For example, ADDQ.L A3,#16
might be used to increment address register A3 by 16. Such instructions
could hypothetically crop up quite a bit in loops over arrays.

The same architecture might have an ADDI.W and ADDI.L (Add a 16 bit
constant to a 16 bit register; add a 16 bit constant to a 32 bit
register) and might have a regular ADD instruction in which
one of the source operand modes was "immediate 32 bit constant".

When these instructions are available, they can result in smaller
machine language, which is a benefit for fitting more into
cache; they may also end up operating faster because of the
reduced number of cycles required to read the constant from
memory. There may also turn out to be interesting tradeoffs
in such architectures, such as it potentially being faster to
load a byte value in one instruction and sign-extending it to 32
bits in a second instruction, rather than using a single
instruction to load a 32 bit constant -- the dual instruction
version might, for example, hypothetically pipeline better,
or the "sign extend" instruction might be only a two byte-long
instruction, with the "load quick" being 4 bytes long, for
a total of 6 bytes where the architecture would need 8 bytes
(4 for the instruction, 4 for the constant) to load a 32 bit
constant.


Suppose one is working on a compiler for an architecture
that offers different instructions for the same basic task,
with some of the variations being shorter or faster for particular
ranges of literal constants, If one were doing that, then one
might wonder whether it is worth the effort to put in a bunch
of compiler optimization work to get the absolute best possible
memory fit or absolute best possible instruction speed. Such work
could get particularily messy if the less-memory variations
turn out to take longer, as one would then have to analyze to see
whether the surrounding code is such that it is better to
optimize for space or time. In a tight loop that was likely to
stay in cache for a while, the answer could depend on the exact
size of the loop...


If this is what the OP is trying to talk about, then I would
make some suggestions:

1) That possibly this kind of study was undertaken back when RISC
theory was being developed;

2) That not just the explicitly coded (or constant folded) literals
are important, since there can be many implicit literals for
pointer/address manipulation

3) That it matters not only how often various sizes of literals appear,
but also what kind of circumstances they appear in. There might
be (hypothetically) several 32 bit constants per "for" loop
(initialization, termination test, increment), but those would tend
to get loaded into registers, and the literal that might be
most crucial to the speed of the loop might be (say) the 4 in
"add #4, %eax" where %eax is being used to index arrays. Or
depending on the architecture and generated code, it might be the 2
in the sequence INC R3; MOV R3, R2; SHL R2, #2; LOAD R4, $1BE00C(R2)
i.e., increment, shift-level to get a byte offset, load from that
byte offset relative to the contant location of a static array...
 
S

slebetman

Walter said:
Ummm, it looked to me like that OP wasn't trying to say anything like
that.

If I understand correctly, the question is about integral literals
appearing in source code -- though it should probably instead be about
integral literals appearing after pre-processing and constant folding.

In this case there would be no need to ask about statistics. It is
always safe to do this because the compiler know there is ONLY one
value for each literal - not a range. This is implemented by almost all
modern compilers. Not only that, if for example on a RISC machine you
use the literal 1 or 0 the compiler would simply generate code to read
the one and zero registers.
 
W

Walter Roberson

In this case there would be no need to ask about statistics. It is
always safe to do this because the compiler know there is ONLY one
value for each literal - not a range.

There is more than one meaning of "safe" that might apply here.

There is the "safe" of "Is there a certainty that I can find
a valid instruction to use for this particular literal, that the
literal will not overflow the range allowed for the instruction?".
I suspect that's the meaning of "safe" that you are using.

There is another meaning, though, when it comes to generating
instructions: "Is there a certainty that if I chose to generate
this particular one of the instruction sequences that I could
generate, that this choice will not result in worse performance
than could be obtained with a different sequence?" And the answer
to that is "It depends on the instruction set, upon the
internal details of multiple dispatch or pipelining, and upon
the cache characteristics of the surrounding code."


Also, as I indicated earlier, the point might not be "can it
safely be done", but rather, "Is it worth putting in the programming
effort to do this potential optimization? Do small literals occur
often enough that the savings would be appreciable? Is it worth
the time to find good algorithms for the boundary cases such as
the short load taking longer to execute but the extra cost being worth
it if it allows a tight loop to fit in cache when it would not otherwise
fit?" And that's where the statistics come in.

You know the usual refrain about "premature optimization" -- well,
it looks to me as if the OP is trying to get a feel for whether
this kind of optimization is premature, or is worth while.
 
S

Stefan Wallentowitz

Am Sun, 11 Dec 2005 03:31:09 +0000 schrieb Walter Roberson:

Hi Walter!

Many thanks, you got what I was trying to point out.
If this is what the OP is trying to talk about, then I would
make some suggestions:
Yes, thats what I had in mind. In fact it's dealing about designing a
compiler for a simple RISC-architecture, where loading a 32bit value
requires two instructions, but when having 16bit-fitting value it is
possible utilizing only one instruction. It is quite a simple
"optimization", but in a small presentation I wanted to describe its
possible impact with such a study.
1) That possibly this kind of study was undertaken back when RISC theory
was being developed;

2) That not just the explicitly coded (or constant folded) literals
are important, since there can be many implicit literals for
pointer/address manipulation
In general this is true, but in my case, it is only literals being
interested because addresses/pointers are not handled in this way.
3) That it matters not only how often various sizes of literals appear,
but also what kind of circumstances they appear in. There might be
(hypothetically) several 32 bit constants per "for" loop
(initialization, termination test, increment), but those would tend to
get loaded into registers, and the literal that might be most crucial to
the speed of the loop might be (say) the 4 in "add #4, %eax" where %eax
is being used to index arrays. Or depending on the architecture and
generated code, it might be the 2 in the sequence INC R3; MOV R3, R2;
SHL R2, #2; LOAD R4, $1BE00C(R2) i.e., increment, shift-level to get a
byte offset, load from that byte offset relative to the contant location
of a static array...
Yes, I was thinking about some equivalent issues. In total perhaps someone
did some work like profiling somewhat "reference code" with exact values
(or ranges) of values being loaded. Doing it on my own for just a small
chart doesn't make sense. But perhaps someone reading has ever been in
contact with some related studies ans knows a source for raw data.

Thanks so far,
Stefan
 
S

Stefan Wallentowitz

Am Sat, 10 Dec 2005 23:35:03 -0800 schrieb (e-mail address removed):
In this case there would be no need to ask about statistics. It is
always safe to do this because the compiler know there is ONLY one
value for each literal - not a range. This is implemented by almost all
modern compilers. Not only that, if for example on a RISC machine you
use the literal 1 or 0 the compiler would simply generate code to read
the one and zero registers.

Yes, but creating an own compiler leads me to this question regarding the
impact of a quite simple optimization.
I am searching for a statistic of how often a loaded integer (either in
register or as part of an immediate instruction etc.) in whatever general
is:
- in 16-bit domain
- above
(might be extended to is zero, 8-bit and so on)

Bye, Stefan
 
S

Stefan Wallentowitz

Am Sun, 11 Dec 2005 08:21:25 +0000 schrieb Walter Roberson:
Also, as I indicated earlier, the point might not be "can it
safely be done", but rather, "Is it worth putting in the programming
effort to do this potential optimization? Do small literals occur
often enough that the savings would be appreciable? Is it worth
the time to find good algorithms for the boundary cases such as
the short load taking longer to execute but the extra cost being worth
it if it allows a tight loop to fit in cache when it would not otherwise
fit?" And that's where the statistics come in.

You know the usual refrain about "premature optimization" -- well,
it looks to me as if the OP is trying to get a feel for whether
this kind of optimization is premature, or is worth while.

Exactly.
I want to illustrate the possible impact of a simple optimization I made.
In fact this optimization most oftens gains execution, but a quite simple
value of optimization ratio reachid in such an easy way would be nice.
It isn't something very scientific , just for getting a feeling of gain.

Thanks for your elaborated statements,
Stefan
 
S

slebetman

Stefan said:
Am Sat, 10 Dec 2005 23:35:03 -0800 schrieb (e-mail address removed):


Yes, but creating an own compiler leads me to this question regarding the
impact of a quite simple optimization.
I am searching for a statistic of how often a loaded integer (either in
register or as part of an immediate instruction etc.) in whatever general
is:
- in 16-bit domain
- above
(might be extended to is zero, 8-bit and so on)

For what it's worth, I find the following values very common in code:

0, 1, 0xFF

Also, from experience, I find the majority of string literals to be
below 100, usually in the range of 0 to 10. The value 1 is common
enough that most archictectures have either increment/decrement
instructions or a hardwired 1 register. The value 0 is common in a lot
of data initialisations. The value 255 (0xFF) is very common in
protocol handling code usually used for byte masking.
 
S

Stefan Wallentowitz

Am Sun, 11 Dec 2005 08:20:37 -0800 schrieb (e-mail address removed):
For what it's worth, I find the following values very common in code:

0, 1, 0xFF

Also, from experience, I find the majority of string literals to be
below 100, usually in the range of 0 to 10. The value 1 is common
enough that most archictectures have either increment/decrement
instructions or a hardwired 1 register. The value 0 is common in a lot
of data initialisations. The value 255 (0xFF) is very common in
protocol handling code usually used for byte masking.

FOr sure, but a number sais more than thousand words..
 
S

slebetman

Stefan said:
Am Sun, 11 Dec 2005 08:20:37 -0800 schrieb (e-mail address removed):


FOr sure, but a number sais more than thousand words..

Download lots of source code, preferably large ones like the Linux
kernel, Apache, Xfree86 and Mozilla. Then find all literals. I did the
following on my company's source codes:

grep -r -e "= *[0-9]" SONaccessGateway/ > stat.txt

This gets all lines containing an equal sign followed by a number.
Then, if you have Tcl installed, run the following script on the file
above:

================== begin script
#! /usr/bin/env tclsh
if {$argc < 1} {
set input stdin
} else {
set input [open $argv "r"]
}
array set numbers {}
while {[eof $input] == 0} {
set line [gets $input]
# extract right hand side:
set line [lindex [split $line "="] 1]
if {$line != ""} {
# extract first word:
set line [split [string trim $line] " "]
set n [lindex $line 0]
# get and count the number:
if {[string is integer -strict $n]} {
set n [expr $n] ;# normalise the number
if {[info exists numbers($n)]} {
incr numbers($n)
} else {
set numbers($n) 1
}
}
}
}
if {$input != "stdin"} {
close $input
}
# now we are done processing, print out stats:
puts "Literals found:"
foreach n [lsort -integer [array names numbers]] {
puts " $n - $numbers($n) times"
}
===================== end script

Save the script as sumstat.tcl. To run it, you can type the following:

tclsh sumstat.tcl stat.txt

where stat.txt is the file we generated earlier. The following is the
statistics of my employer's source codes:

Literals found:
-774712364 - 7 times
-505224220 - 7 times
-235736076 - 7 times
-60 - 1 times
-1 - 3 times
0 - 1039 times
1 - 723 times
2 - 119 times
3 - 159 times
4 - 159 times
5 - 87 times
6 - 10 times
7 - 16 times
8 - 15 times
9 - 1 times
10 - 17 times
11 - 2 times
12 - 41 times
13 - 1 times
14 - 4 times
15 - 6 times
16 - 20 times
17 - 7 times
20 - 172 times
23 - 6 times
24 - 2 times
25 - 2 times
27 - 1 times
28 - 2 times
30 - 10 times
31 - 3 times
32 - 29 times
36 - 2 times
38 - 2 times
39 - 3 times
40 - 5 times
42 - 10 times
44 - 2 times
48 - 19 times
50 - 6 times
54 - 1 times
56 - 5 times
57 - 2 times
60 - 21 times
64 - 46 times
67 - 2 times
71 - 2 times
78 - 2 times
80 - 3 times
85 - 1 times
90 - 35 times
96 - 3 times
100 - 7 times
112 - 3 times
120 - 6 times
127 - 4 times
128 - 12 times
144 - 3 times
160 - 3 times
164 - 1 times
167 - 3 times
168 - 4 times
169 - 2 times
176 - 3 times
192 - 4 times
200 - 9 times
208 - 3 times
210 - 6 times
211 - 8 times
212 - 8 times
224 - 3 times
226 - 1 times
253 - 1 times
255 - 7 times
256 - 14 times
271 - 1 times
299 - 1 times
300 - 1 times
384 - 23 times
400 - 3 times
420 - 4 times
493 - 10 times
500 - 35 times
513 - 1 times
600 - 10 times
636 - 1 times
647 - 1 times
648 - 4 times
691 - 16 times
730 - 1 times
800 - 4 times
802 - 1 times
995 - 1 times
1000 - 8 times
1024 - 15 times
1200 - 1 times
1234 - 3 times
1257 - 1 times
1500 - 3 times
1800 - 1 times
2000 - 1 times
2020 - 1 times
2048 - 3 times
2054 - 2 times
2345 - 2 times
3600 - 1 times
4096 - 3 times
5000 - 3 times
10000 - 3 times
16384 - 9 times
28800 - 3 times
32224 - 1 times
32768 - 6 times
36000 - 4 times
65535 - 3 times
86400 - 2 times
90000 - 1 times
123456 - 3 times
1000000 - 2 times
50331652 - 2 times
67452301 - 1 times
305441741 - 17 times
592100561 - 17 times
806429235 - 10 times
823206451 - 10 times
839983667 - 10 times
883674386 - 17 times
1073741824 - 2 times
2147483647 - 2 times


Hope this helps. Of course, I'm assuming you're on a platform that has
grep and tclsh installed. If not, just use my results. But note that
the statistics above covers C, C++ and Perl code. I was too lazy to
separate them.
 
S

slebetman

<snip>
Save the script as sumstat.tcl. To run it, you can type the following:

tclsh sumstat.tcl stat.txt
<snip>
Hope this helps. Of course, I'm assuming you're on a platform that has
grep and tclsh installed. If not, just use my results. But note that
the statistics above covers C, C++ and Perl code. I was too lazy to
separate them.

I've modified the sumstat.tcl script to further analyse results:

============================ begin sumstat.tcl script
#! /usr/bin/env tclsh
if {$argc < 1} {
set input stdin
} else {
set input [open $argv "r"]
}
array set numbers {}
while {[eof $input] == 0} {
set line [gets $input]
# extract words:
set line [split $line " "]
foreach n $line {
# get and count the number:
if {[string is integer -strict $n]} {
set n [expr $n] ;# normalise the number
if {[info exists numbers($n)]} {
incr numbers($n)
} else {
set numbers($n) 1
}
}
}
}
if {$input != "stdin"} {
close $input
}

# now we are done processing, print out stats:
set total 0
set bits8 0
set bits16 0
set bits32 0
puts "Literals found:"
foreach n [lsort -integer [array names numbers]] {
puts " $n - $numbers($n) times"
if {$n < 256 && $n > -128} {
incr bits8 $numbers($n)
} elseif {$n < 65536 && $n > -32768} {
incr bits16 $numbers($n)
} else {
incr bits32 $numbers($n)
}
incr total $numbers($n)
}
set percent8 [format "%0.2f" [expr $bits8*100.0/$total]]
set percent16 [format "%0.2f" [expr $bits16*100.0/$total]]
set percent32 [format "%0.2f" [expr $bits32*100.0/$total]]
puts "\nStatistics:"
puts " 8bit numbers - $bits8 ($percent8%)"
puts " 16bit numbers - $bits16 ($percent16%)"
puts " 32bit numbers - $bits32 ($percent32%)"

============================ end script

also, befure I used: grep -r -e "= [0-9]". I now realise that this only
captures assignments and not integers used for other purposes such as a
= b + 12. I also modified the script to get all numbers in the source
code instead of just numbers after the = sign.

I ran this new script with:

grep -r -E "[0-9]+" code_dir/ | tclsh sumstat.tcl

piping the result of grep directly into the script. This simply gets
all lines containing numbers. The new result is obviously much larger
than before and is probably too long to post. The following is the a
snipped version of the result:

Literals found:
-2147483648 - 5 times
-976170042 - 2 times
-823302554 - 1 times
<snip>
-1 - 285 times
0 - 6704 times
1 - 5408 times
2 - 3600 times
3 - 2194 times
4 - 1735 times
<snip>
1936484395 - 1 times
1936484398 - 1 times
2147483647 - 2 times

Statistics:
8bit numbers - 34984 (83.66%)
16bit numbers - 6346 (15.18%)
32bit numbers - 486 (1.16%)
 
S

Stefan Wallentowitz

Am Sun, 11 Dec 2005 20:22:47 -0800 schrieb (e-mail address removed):

Thank you very, very much. I will use this approach and apply it on some
code specific to embedded systems. Although it doesn't in effect have
profiling character it gives a smart feeling of the subject-matter.

Bye and thanks,
Stefan
 
M

Michael Wojcik

Download lots of source code, preferably large ones like the Linux
kernel, Apache, Xfree86 and Mozilla. Then find all literals. I did the
following on my company's source codes:

grep -r -e "= *[0-9]" SONaccessGateway/ > stat.txt

As Walter Roberson suggested, it would probably be better to do this
on the result of preprocessing the source (ie, the output of trans-
lation phase 4, 5, or 6). That at least will catch literals that are
expressed using object-like macros, though not necessarily folded
constants (which could be computed during phase 7, or even not until
runtime).

Most implementations have an option to write the result of preprocess-
ing to a file.

It might be possible to get even more accurate results by analyzing
an intermediate result from a later translation phase, such as the
assembly-language output available from many implementations. I
recently used such analysis to check for large auto-class objects in
a large code base. (Details are implementation-dependent and OT,
obviously.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,835
Latest member
KetoRushACVBuy

Latest Threads

Top