Histogram of character frequencies

R

Richard Heathfield

(e-mail address removed) said:
James said:
Johannes Bauer wrote:
(e-mail address removed) schrieb:

int x[256]; // frequencies
Global.

It's completely acceptable to have variables defined at file scope in
C!

What's acceptable is not always a good idea. Global objects have many
disadvantages; they should be avoided except when necessary; they aren't
necessary in this case.

In this case they help simplify the code - the array gets initialized
to 0 at compile-time, instead of needing extra code for an
initialization loop bad for efficiency!

int main(void)
{
int x[256] = {0};
return 0;
}

This code takes advantage of the rules for partially-initialised aggregate
objects to set all x's ints to 0.

But no "conforming implementation" on Windows rejects it!

Wrong. (And even if you were right, which you are not, the fact that an
implementation accepts faulty code does not imply that it must interpret
that code in the way you expect.)
I don't
believe any C compiler anywhere would reject it.

Your beliefs don't enter into it. Borland C rejects it when invoked in
conforming mode. So does gcc.
OK you're right I should remember that. However I don't think it's the
end of the world - the standard library is always linked in so the
right functions will be found in the end by the linker.

Headers aren't about finding the functions. They are for making sure that
the function is being called in the right way.

I don't really understand the problem with feof - it just checks if
the EOF indicator is set in a given FILE * struct. Anyway I'll read
about it.

The normal way to use feof incorrectly is to assume that it is predictive.
It is not.
 
K

Keith Thompson

Thanks again for all the suggestions, though I think some people are a
bit fussy in their answers.

No, you're entirely too fussy in your responses to the valuable
answers you're getting.
Here is a solution to Exercise 1.14. It deals well with control
characters too.

I don't believe that the code you posted is the same as the code
you're actually using. In most implementations you can get away with
calling an external function with no visible declaration (the compiler
will assume that the function returns int, and if it does, you just
might be ok). But they will *not* let you get away with referring to
external variables with no declaration.

Your calls to feof(), getchar(), and printf() just might happen to
work. Your reference to stdin would almost certainly cause the
compiler to reject your program.

I suspect that your actual code has the mandatory "#include <stdio.h>"
at the top, but you copy-and-pasted only part of the program when you
posted it.
// make histogram of character frequencies

int x[256]; // frequencies

This doesn't need to be global. You can declare

int x[256] = { 0 };

inside main().

Where does the number 256 come from? You seem to be assuming that
type char is 8 bits (not guaranteed), and that either that it's
unsigned or that you'll never encounter a character with a negative
value. When I added the "#include <stdio.h>" to your program and fed
it its own executable (a binary file), it died with a segmentation
fault, probably because it read a character with a negative value.

See the constant UCHAR_MAX, declared in <limits.h>. (getchar(), when
it doesn't return EOF, gives you a character value represented as an
unsigned char and then converted to int, so storing the result of
getchar() in an int would avoid the problem of negative values.)

The following references are to the comp.lang.c FAQ, available at
void main()

Questions 11.12a, 11.12b, 11.14a, 11.14b, 11.15.
{
char c;

Question 12.1. (getchar() returns a result of type int; you're
storing the result in an object of type char, potentially losing
information.)
int i, y=0, z;
while(! feof(stdin) )

Question 12.2.
if(++x[c=getchar()]>y)

If c is negative, you attempt to modify an element outside the bounds
of the array x. If c exceeds 255 (as it can do on an implementation
with CHAR_BIT>8), the same thing can happen; you're not likely to run
into such an implementation, but it's easy enough to write portable
code.
y=x[z=c];
do {
for(i=0; i<256; i++)

Again said:
if(x>0)
printf("%s", x>y ? " * " : " ");
printf("\n");
} while(y--);
for(i=0; i<256; i++)


256 again.
if(x>0)
if(i>32)


Where do you get the value 32? It happens to be the code for the ' '
character in ASCII. Your code will be much clearer if you use a
character constant ' ' rather than an integer constant 32.

Presumably the point is to determine whether i is a printable character.
printf(" %c ", i);
else
printf("%02x ", i);

The use of "%02x" means that your output could be misaligned on a
system with characters bigger than 8 bits. That's probably not a big
problem.
printf("\n");

I recommend adding a "return 0;" here.

Finally, C allows multi-character identifiers for a very good reason.
Your use of x, y, z, i, and c makes your code difficult to read. The
names x, y, and z implies that these three variables are related in
some coherent manner (perhaps as coordinates), but they're not.
Consider naming your array something like "freq" or "frequencies".
I'm not going to take the time to figure out what "y" and "z" are
supposed to be.

You seem to value terseness over readability. This is not a good
choice. For example, you use braces only where they're absolutely
necessary; I prefer to use braces for all control statements, even
when only one statment is being controlled. For example, where you
write:

for(i=0; i<256; i++)
if(x>0)
if(i>32)
printf(" %c ", i);
else
printf("%02x ", i);

I might write:

for (i = 0; i < 256; i++) {
if (x > 0) {
if (i > 32) {
printf(" %c ", i);
}
else {
printf("%02x ", i);
}
}
}

It's a habit I picked up from Perl, which requires this, but it works
well in C. For one thing, it makes it easier to add an additional
statement if the braces are already there.
 
S

santosh

Keith said:
(e-mail address removed) writes:
if(x>0)
if(i>32)


Where do you get the value 32? It happens to be the code for the ' '
character in ASCII. Your code will be much clearer if you use a
character constant ' ' rather than an integer constant 32.

Presumably the point is to determine whether i is a printable
character. Use the isgraph() function, declared in <ctype.h>, for
this.


<snip>

Why not isprint() instead of isgraph()?
 
S

santosh

O_TEXT said:
// make histogram of character frequencies

int x[256]; // frequencies

This can only give histogram of byte frequencies.
Because there are far more than 256 characters. (see www.unicode.org)

A C "byte" may be wider than 8 bits. In any case the behaviour of OP's
code is implementation defined, as soon as he goes beyond the basic
source and execution character set guaranteed by the C standard.
 
O

O_TEXT

santosh a écrit :
O_TEXT said:
// make histogram of character frequencies

int x[256]; // frequencies
This can only give histogram of byte frequencies.
Because there are far more than 256 characters. (see www.unicode.org)

A C "byte" may be wider than 8 bits.

By usual definition, a byte is exactly 8 bits.
In any case the behaviour of OP's
code is implementation defined, as soon as he goes beyond the basic
source and execution character set guaranteed by the C standard.

Most platforms works with ASCII character set.

Do you know many platforms which work with a character set not
compatible with ASCII?
How do you program with a variety of distincts characters on distincts
platforms with the standard C?
 
C

Chris Dollin

O_TEXT said:
santosh a écrit :
O_TEXT said:
// make histogram of character frequencies

int x[256]; // frequencies
This can only give histogram of byte frequencies.
Because there are far more than 256 characters. (see www.unicode.org)
A C "byte" may be wider than 8 bits.

By usual definition, a byte is exactly 8 bits.

C's definition isn't the "usual" one; C's "bytes" are chars, and the
width of a `char` is implementation-defined.
Most platforms works with ASCII character set.

Some don't. The Standard doesn't exile such platforms (at least,
not for that reason).
Do you know many platforms which work with a character set not
compatible with ASCII?

ECBDIC; surely there are still EBCDIC implementations of C.
How do you program with a variety of distincts characters on distincts
platforms with the standard C?

Carefully, [locale] sensitively, and likely more and more unicodely.
It's just stuff.
 
J

James Kuyper

O_TEXT said:
santosh a écrit : ....

By usual definition, a byte is exactly 8 bits.

The usual definition is off-topic in this newsgroup, except insofar at
it needs to be mentioned as a contrast to the C standard's definition.
 
O

O_TEXT

santosh a écrit :
O_TEXT said:
santosh a écrit :
O_TEXT wrote:

// make histogram of character frequencies

int x[256]; // frequencies
This can only give histogram of byte frequencies.
Because there are far more than 256 characters. (see
www.unicode.org)
A C "byte" may be wider than 8 bits.
By usual definition, a byte is exactly 8 bits.

The exact meaning of the term byte depends on the context. A DSP's byte
may be 32 bits, Java's byte is 8 bits, and so on.

In C, a byte is synonymous with the "char" data type, which must be of
at least 8 bits, but may be anything greater. For a particular compiler
the macro CHAR_BIT in limits.h gives this value.
Most platforms works with ASCII character set.

Do you know many platforms which work with a character set not
compatible with ASCII?

Mainframes generally use a different character set.
How do you program with a variety of distincts characters on distincts
platforms with the standard C?

Within the C standard, you don't. You have to go beyond the C standard
if you need to process and display characters not covered by C's basic
character set.

Most major systems have semi-standardised libraries and routines for
this like iconv() etc. You'll have to consult the appropriate standards
and your system documentation for this.

The ICU works with most of platforms doesn't it?
 
S

santosh

O_TEXT said:
santosh a écrit :
O_TEXT said:
// make histogram of character frequencies

int x[256]; // frequencies
This can only give histogram of byte frequencies.
Because there are far more than 256 characters. (see
www.unicode.org)
A C "byte" may be wider than 8 bits.

By usual definition, a byte is exactly 8 bits.

The exact meaning of the term byte depends on the context. A DSP's byte
may be 32 bits, Java's byte is 8 bits, and so on.

In C, a byte is synonymous with the "char" data type, which must be of
at least 8 bits, but may be anything greater. For a particular compiler
the macro CHAR_BIT in limits.h gives this value.
Most platforms works with ASCII character set.

Do you know many platforms which work with a character set not
compatible with ASCII?

Mainframes generally use a different character set.
How do you program with a variety of distincts characters on distincts
platforms with the standard C?

Within the C standard, you don't. You have to go beyond the C standard
if you need to process and display characters not covered by C's basic
character set.

Most major systems have semi-standardised libraries and routines for
this like iconv() etc. You'll have to consult the appropriate standards
and your system documentation for this.
 
J

John Bode

Johannes said:
(e-mail address removed) schrieb:
int x[256]; // frequencies

It's completely acceptable to have variables defined at file scope in
C!

But it's often not a good idea. You shouldn't define a variable at
file scope unless it *needs* to be at file scope.
Why does everyone have this hangup about this? I took a class in C a
while back and my teacher always used void main() { ... }. I can
confirm that it works fine with both MicroSoft compiler and BorLand.

The document which defines the C language provides two possible
definitions for main():

int main(void)
int main(int argc, char **argv) /* or char *argv[], which is
equivalent */

Many implementations will load and execute the program with no
*apparent* problem if main() is typed void, but that's no guarantee
that the system has not been left in bad or inconsistent state. There
is at least one platform out there (Acorn?) that will not load the
program *at all* if main() is not typed int.

Using int main() is guaranteed to work everywhere (at least on every
hosted implementation). Using void main() is *not* guaranteed to work
everywhere.
I read the answers but mostly people only comment on trivial things
that aren't even errors! I'll be glad to have substantial comments on
my code.

void main() is an error unless your particular compiler explicitly
supports it (which it probably doesn't).
 
J

John Bode

Hello everyone,

Thanks again for all the suggestions, though I think some people are a
bit fussy in their answers.

Here is a solution to Exercise 1.14. It deals well with control
characters too.

// make histogram of character frequencies

int x[256]; // frequencies

void main()

int main(void)
{
char c;

int c; /* getchar() returns int */
int i, y=0, z;
while(! feof(stdin) )
if(++x[c=getchar()]>y)
y=x[z=c];

feof() will not return true until *after* you try to read past the end
of file. You should check against the result of getchar() first:

while ((c = getchar()) != EOF)
{
if (++x[c] > y)
y = x[c]; /* z never gets used again, so we don't bother
with it */
}
if (c == EOF)
{
if (feof(stdin))
{
/* reached end-of-file condition */
}
else
{
/* Some other read error */
}
}

Also, years of experience have taught me that it's best to use
compound statements (i.e., braces) for all but the innermost scope of
a nested statement like that; it just makes things easier to read.
do {
for(i=0; i<256; i++)
if(x>0)
printf("%s", x>y ? " * " : " ");
printf("\n");
} while(y--);
for(i=0; i<256; i++)
if(x>0)
if(i>32)
printf(" %c ", i);
else
printf("%02x ", i);


Again, I'd suggest using compound statements for all but the innermost
scope; it will just make things easier to follow in the future.
 
K

Keith Thompson

santosh said:
Keith said:
(e-mail address removed) writes:
if(x>0)
if(i>32)


Where do you get the value 32? It happens to be the code for the ' '
character in ASCII. Your code will be much clearer if you use a
character constant ' ' rather than an integer constant 32.

Presumably the point is to determine whether i is a printable
character. Use the isgraph() function, declared in <ctype.h>, for
this.


<snip>

Why not isprint() instead of isgraph()?


Because the original program prints the hex code, not the character
itself, for the space character. isgraph() tests for printing
characters other than ' '.
 
C

Chris Torek

santosh a écrit :
By usual definition, a byte is exactly 8 bits.

See said:
Most platforms works with ASCII character set.

Do you know many platforms which work with a character set not
compatible with ASCII?

Sure: a number of IBM mainframes use EBCDIC.
How do you program with a variety of distincts characters on
distincts platforms with the standard C?

It is mostly a matter of finding assumptions about character sets
(such as the assumption that "all alphabetic characters are
contiguous") and cleaning them up. There is usually a tradeoff
involved -- you may need an extra lookup table here or there, for
instance -- but in general the cost of accomodating different
"native text" encodings is relatively low, especially when compared
with the cost of accomodating UTF-8, UTF-16, Unicode, and/or
"internationalization".
 
B

Barry Schwarz

On Sat, 1 Dec 2007 09:21:15 -0800 (PST),
Hello everyone,

Thanks again for all the suggestions, though I think some people are a
bit fussy in their answers.

Here is a solution to Exercise 1.14. It deals well with control
characters too.

For some version of solution and deals well.
// make histogram of character frequencies

int x[256]; // frequencies

void main()

Trolling are you?
{
char c;
int i, y=0, z;
while(! feof(stdin) )

And again?
if(++x[c=getchar()]>y)

If char defaults to signed, this will invoke undefined behavior just
after reading the last char (if you can generate an end of file
condition on stdin).
y=x[z=c];

What do you use z for?
do {
for(i=0; i<256; i++)
if(x>0)
printf("%s", x>y ? " * " : " ");


You viewing device supports 256 character on a single line?
printf("\n");
} while(y--);
for(i=0; i<256; i++)
if(x>0)
if(i>32)


What do you think is magical about 32? My system has numerous
unprintable characters above that value.
printf(" %c ", i);
else
printf("%02x ", i);
printf("\n");
}


Remove del for email
 
B

Barry Schwarz

Johannes Bauer wrote:
(e-mail address removed) schrieb:

int x[256]; // frequencies
Global.

It's completely acceptable to have variables defined at file scope in
C!

What's acceptable is not always a good idea. Global objects have many
disadvantages; they should be avoided except when necessary; they aren't
necessary in this case.

In this case they help simplify the code - the array gets initialized
to 0 at compile-time, instead of needing extra code for an
initialization loop bad for efficiency![/QUOTE]

But you could achieve the same effect without any of the problems that
global variables cause simply by declaring the array static inside the
function.

snip
OK you're right I should remember that. However I don't think it's the
end of the world - the standard library is always linked in so the
right functions will be found in the end by the linker.

The headers have very little to do with what the linker will link in
with your code and everything to do with the compiler generating
correct code. Leaving out stdlib.h and calling malloc introduces
undefined behavior. Leaving out string.h and passing anything other
than a void* or char* to memcpy or memset introduces undefined
behavior.

snip


Remove del for email
 
P

Peter 'Shaggy' Haywood

Groovy hepcat (e-mail address removed) was jivin' in comp.lang.c
Johannes Bauer wrote:
(e-mail address removed) schrieb:

int x[256]; // frequencies
Global.

It's completely acceptable to have variables defined at file scope
in C!

What's acceptable is not always a good idea. Global objects have many
disadvantages; they should be avoided except when necessary; they
aren't necessary in this case.

In this case they help simplify the code - the array gets initialized
to 0 at compile-time, instead of needing extra code for an
initialization loop bad for efficiency![/QUOTE]

Others have shown you how to initialise block scope arrays. The
generated object code may simply be a loop in which elements are
assigned a given value. In that case initialisation may be no more
efficient than a loop you write yourself. This is also true of the
implicit initialisation of a file scope array.
Code that makes extensive use of things like global variables is often
called "spaghetti code". Only meat ball programmers write code like
that.
But no "conforming implementation" on Windows rejects it! I don't
believe any C compiler anywhere would reject it.

Let's test that assertion, shall we? I reboot to Windoze (because I'm
using Linux), open a console window and enter these lines:

-----------------------------------------------------------------------
copy con testing.c
#include <stdio.h>

void main(void)
{
puts("Hello, World!");
}
^z
bcc32 -A -etesting.exe -w testing.c
-----------------------------------------------------------------------

The resulting output from Borland Builder is as follows:

-----------------------------------------------------------------------
Borland C++ 5.3 for Win32 Copyright (c) 1993, 1998 Borland International
testing.c:
Error testing.c 4: main must have a return type of int in function main
*** 1 errors in Compile ***
-----------------------------------------------------------------------

That's an error message (which halts compilation), not merely a warning
(which allows compilation to continue). When invoked with the -A
command line option (which forces it to be standard compliant) the
Borland compiler rejects void as a return type for main(). Not only has
"any C compiler anywhere" rejected it, but a "'conforming
implementation' on Windows" has rejected it.
Clearly, therefore, you are wrong.
OK you're right I should remember that. However I don't think it's the
end of the world - the standard library is always linked in so the
right functions will be found in the end by the linker.

Who says? The library may not be linked without the compiler magic
contained in the headers. Or they may be linked, but functions not
called properly. The point is that failing to include the proper
headers is a very serious error, and you must understand this.
I don't really understand the problem with feof - it just checks if
the EOF indicator is set in a given FILE * struct. Anyway I'll read
about it.

Many newbies think feof()'s purpose is to indicate when the end of a
file is reached by a read function. This is incorrect. Its purpose is
to indicate when a file stream's end of file indicator is set. This
only happens when you try to read from a stream that has *already*
reached the end. To explain this more clearly, consider the following
situation.
Suppose you have a stream containing three bytes, and you are reading
one byte at a time, using getchar(), in a loop, like so:

while(!feof(stdin))
{
c = getchar();
putchar(c);
}

On the first iteration of the loop you test the end of file indicator
for the input stream (stdin in this example), and it is not set, so you
then read the first byte and write this out. On the second iteration of
the loop you test the end of file indicator again, and it is not set,
so you then read the second byte and write this out. On the third
iteration of the loop you test the end of file indicator again, and it
is not set, so you then read the third byte and write this out. So far
so good. But the end of file indicator is still *not* set, and feof()
will return false. So you iterate a fourth time and try another read.
The read fails, of course, because there are no more bytes in the
stream. This is the perfect time to exit the loop; and since getchar()
returns EOF to indicate that it failed to read a byte, you have the
perfect way to detect this situation. However, you ignore this value
and simply continue processing the (now invalid) data you think you've
read from the file. You send EOF to stdout. *Now* the end of file
indicator is set, and feof() returns true. But it's too late. You don't
test for this until the beginning of the fifth iteration, *after*
you've used the invalid data. You've read in three valid bytes and
written out four bytes, one of which is not valid.
What you should be doing is this:

int c; /* c must be an int so we can detect EOF. */

while(EOF != (c = getchar()))
{
putchar(c);
}
if(!feof(stdin)) /* Or we could use if(ferror(stdin)). */
{
/* File read error: handle it somehow. */
}

Here we attempt to read a byte with getchar(), and only enter the loop
if the return value does not indicate a failure to read a byte. After
the failure code (EOF) has been detected, the loop is exited, and we
then attempt to determine whether the failure occurred due to an error
or an end of file condition. Here's a breakdown of how it works (using
the same 3 byte example input as before).
On the first iteration we read the first byte and test whether the
read was successful. It was, so we output the byte. On the second
iteration we read the second byte and test whether the read was
successful. It was, so we output the byte. On the third iteration we
read the third byte and test whether the read was successful. It was,
so we output the byte. On the fourth iteration we read a byte, but the
stream is exhausted and the read fails; getchar() returns EOF. We
detect this and exit the loop. We've read in three valid bytes and
written out three bytes, all of which are valid. *Now* we call feof()
to test whether the failure was due to an end of file condition, and,
if so, skip the error handling code. However, if feof() returns false,
then the read failure must have been due to an error, in which case we
handle the error somehow (perhaps by emitting a diagnostic message and
quitting).
 
F

Flash Gordon

Barry Schwarz wrote, On 06/12/07 02:14:
(e-mail address removed) wrote:
Johannes Bauer wrote:
(e-mail address removed) schrieb:

int x[256]; // frequencies
Global.
It's completely acceptable to have variables defined at file scope in
C!
What's acceptable is not always a good idea. Global objects have many
disadvantages; they should be avoided except when necessary; they aren't
necessary in this case.
In this case they help simplify the code - the array gets initialized
to 0 at compile-time, instead of needing extra code for an
initialization loop bad for efficiency!

But you could achieve the same effect without any of the problems that
global variables cause simply by declaring the array static inside the
function.[/QUOTE]

That still has a number of the problems of global variables, just not
all of them.
The headers have very little to do with what the linker will link in
with your code and everything to do with the compiler generating
correct code. Leaving out stdlib.h and calling malloc introduces
undefined behavior. Leaving out string.h and passing anything other
than a void* or char* to memcpy or memset introduces undefined
behavior.

Actually, any call to memcpy or memset without a declaration in scope
invokes undefined behaviour even if you pass void* parameters. This is
because the return type is void*, and the return type being wrong causes
undefined behaviour even if you do not use the value returned. I can
even think of ways it could cause problems!
 
R

RoS

In data Fri, 07 Dec 2007 00:52:19 +1100, Peter 'Shaggy' Haywood
scrisse:
while(!feof(stdin))
{
c = getchar();
putchar(c);
}

On the first iteration of the loop you test the end of file indicator
for the input stream (stdin in this example), and it is not set, so you
then read the first byte and write this out. On the second iteration of
the loop you test the end of file indicator again, and it is not set,
so you then read the second byte and write this out. On the third
iteration of the loop you test the end of file indicator again, and it
is not set, so you then read the third byte and write this out. So far
so good. But the end of file indicator is still *not* set, and feof()
will return false. So you iterate a fourth time and try another read.
The read fails, of course, because there are no more bytes in the
stream. This is the perfect time to exit the loop; and since getchar()
returns EOF to indicate that it failed to read a byte, you have the
perfect way to detect this situation. However, you ignore this value
and simply continue processing the (now invalid) data you think you've
read from the file. You send EOF to stdout. *Now* the end of file
indicator is set, and feof() returns true. But it's too late. You don't
test for this until the beginning of the fifth iteration, *after*
you've used the invalid data. You've read in three valid bytes and
written out four bytes, one of which is not valid.
What you should be doing is this:

int c; /* c must be an int so we can detect EOF. */

while(EOF != (c = getchar()))
{
putchar(c);
}
if(!feof(stdin)) /* Or we could use if(ferror(stdin)). */
{
/* File read error: handle it somehow. */
}

so what is right?

while(1)
{
c = getchar();
if( feof(stdin) || ferror(stdin) ) break;
putchar(c);
}

or

while(1)
{
c = getchar();
if( feof(stdin) ) break;
putchar(c);
}

while(1)
{
c = getchar();
if( ferror(stdin) ) break;
putchar(c);
}

or no one of above

thank you
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top