taking a "word" as input

N

Nick Keighley

C takes input character by character.

nope. It can read lines (fgets()) or arbitary blocks (fread())
I did not find any Standard Library
function that can take a word as input.

correct, there aren't any.

So I want to write one of my own
to be used with "Self Referential Structures" of section 6.5 of K&R2. K&R2
has their own version of <getword> which, I think,  is quite different
from what I need:

<getword> will have following properties:

 1.) If the word contains any number like "beauty1" or "win2e" it will
 discard it, K&R2's <getword> does not. My <getword> will only take
 pure-words like "beauty", "wine" etc.

take a look at isalpha()

2.) we can store each word by using <array of pointers> pointing to those
words and since words themselves are  strings, which in
reality, are <arrays of chars>, so we will have <array of pointers> to
those <arrays of chars>.

char *word_table [100];

or you think using a 2D array is a better idea ?

are all your words the same size?
If you use the array of pointers you'll have to get the memory
for each word from somewhere (eg. malloc())
 
A

arnuld

C takes input character by character. I did not find any Standard Library
function that can take a word as input. So I want to write one of my own
to be used with "Self Referential Structures" of section 6.5 of K&R2. K&R2
has their own version of <getword> which, I think, is quite different
from what I need:

<getword> will have following properties:


1.) If the word contains any number like "beauty1" or "win2e" it will
discard it, K&R2's <getword> does not. My <getword> will only take
pure-words like "beauty", "wine" etc.


2.) we can store each word by using <array of pointers> pointing to those
words and since words themselves are strings, which in
reality, are <arrays of chars>, so we will have <array of pointers> to
those <arrays of chars>.


or you think using a 2D array is a better idea ?
 
S

santosh

arnuld said:
C takes input character by character. I did not find any Standard
Library function that can take a word as input. So I want to write one
of my own to be used with "Self Referential Structures" of section 6.5
of K&R2. K&R2
has their own version of <getword> which, I think, is quite different
from what I need:

<getword> will have following properties:


1.) If the word contains any number like "beauty1" or "win2e" it will
discard it, K&R2's <getword> does not. My <getword> will only take
pure-words like "beauty", "wine" etc.

What about words with other characters like hyphen? What about
constructs like "get_name"? Will you discard them too. What about words
that end with a ; or ...? What about words that contain symbols like #@
etc? Or words that end with an exclamation mark? Or words within
parenthesis or braces?

Just giving you some food for thought as to what exactly you are going
to consider a word and what you will reject. This can be far trickier
than one first imagines.
2.) we can store each word by using <array of pointers> pointing to
those words and since words themselves are strings, which in
reality, are <arrays of chars>, so we will have <array of pointers> to
those <arrays of chars>.

That's one way yes, suitable when you don't know the lengths of words in
advance, or you don't want to possibly waste storage with statically
allocated arrays.
or you think using a 2D array is a better idea ?

Depends on your requirements really, and the type and frequency of input
you expect. Will you put an upper limit on the length of words? It
hardly makes sense to accept words longer than about 64 characters if
you are dealing with normal English text. Static 2D arrays are
undoubtedly easier to work with but are less flexible than dynamically
allocated arrays. Since statically allocated arrays are of fixed size
it's possible for some elements to remain unused and hence wasted. OTOH
a large number of small allocations may lead to memory fragmentation
and also some wastage due to malloc bookeeping and possibly also a
slowdown in speed if you'll be reading a very large number of words
from a file. For input from a human it will not matter.

One efficient method is to use a single dynamically allocated array in
which words are stored sequentially. The length of each word could be
specified by either one or two bytes prefixing the word itself. This
results in very efficient storage, but is grossly inefficient if you
want to insert and delete words at random. For this a hash table based
approach is probably the best. OTOH a tree is very convenient for quick
searching and sorting.

If you tell us more details about the type and volume of input you
expect and the facilities (like searching, insertion, etc.) you plan to
implement, perhaps a tailored approach can be suggested.
 
N

Nick Keighley

.> On Tue, 29 Apr 2008 02:47:18 -0700,Nick Keighleywrote:

It depends on the user, what he likes to input at run-time.

in other words, no.

santosh has pointed out some of the design drivers for this.
So decide do you want a fixed size (limits word size and wastes space)
or a variable size (harder to program).


Note Well

yes. I came up with this code and as you can see it does not do what I
want. I want to take every word into the input but it only takes 1st for
obvious reasons. I am not able to think of the way to take all the words
of the input:

1. after you read a word you need to skip to the next word.

eg. read until you get a letter

2. you need somewhere to store the words. Either a 2D array or
use malloc().

#include <stdio.h>
#include <ctype.h>

enum MAXSIZE { MAXWORD = 100 };

char *getword( char *, int );

int main(void) {

  char buffer[MAXWORD];

this only holds one word

char buffer[MAXNUMWORDS][MAXWORD];
OR char* buffer [MAXWORD]

  getword( buffer, MAXWORD );

pass the appropriate argument

<snip>
 
N

Nick Keighley

all of them will be discarded. Only words containing letters like
"santosh" will be considered, nothing else.


yes, exactly, input will be at run-time only.

I don't understand what you mean here

ok, make the upper limit to 64 :) , I usually take it 100 as my style.


you want to say that there will be 2 types of implementations if
efficiency is my concern:

  1.) input from human
  2.) input from a text-file

 ??

couldn't there be a single implementation for both types of inputs ?

yes. But file or human might influence your design. People type
v e r y s l o w l y so a human input only program doesn't need to
be fast (for this problem). The file input one should work just fine
with people.

The basic problem is to sort, count and print the sorted words.  We are
not going to save a word in an array if it has already appeared, we will
just increase the count for that word.  

that didn't really answer the question...
 
A

arnuld

by accident, it is actually exercise 6-4 of K&R2 :)


How about this code. It works fine:


/* A program that takes a single word as input. It will discard
* the whole input if it contains anything other than the 26 alphabets
* of English. If the input word contains more than 30 letters then only
* the extra letters will be discarded . For general purpose usage of
* English it does not make any sense to use a word larger than this size.
* Nearly every general purpose word can be expressed in a word with less
* than or equal to 30 letters.
*
* version 1.1
*
*/


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>


enum MAXSIZE { WORDSIZE = 30 };

int getword( char *, int );


int main( void )
{
char ac[WORDSIZE];

if( getword( ac, WORDSIZE ) )
{
printf("%s\n", ac);
}

return EXIT_SUCCESS;

}


int getword( char *word, int max_length )
{
int c;
char *w = word;


while( isspace( c = getchar() ) )
{
;
}

while( --max_length )
{
if( isalpha( c ) )
{
*w++ = c;
}
else if( c == '\n' || c == EOF || isspace( c ) )
{
*w = '\0';
break;
}
else
{
return 0;
}

c = getchar();
}

/* I can simply ignore the if condition and directly write the '\0'
onto the last element because in worst case it will only rewrite
the '\n' that is put in there by else if clause.

or in else if clause, I could replace break with return word[0].

I thought these 2 ideas will be either inefficient or
a bad programming practice, so I did not do it.
*/
if( *w != '\0' )
{
*w = '\0';
}



return word[0];
}


========== OUTPUT ============
Welcome to the Emacs shell

/home/arnuld/programs/C $ gcc -ansi -pedantic -Wall -Wextra getword.c
/home/arnuld/programs/C $ ./a.out
like this
like
/home/arnuld/programs/C $ ./a.out
like3
/home/arnuld/programs/C $ ./a.out
9like
/home/arnuld/programs/C $ ./a.out
like ll
like
/home/arnuld/programs/C $
 
B

Ben Bacarisse

I don't have K&R2 so I don't know the end point of this exercise, so I
may have this wrong...
How about this code. It works fine:

/* A program that takes a single word as input. It will discard
* the whole input if it contains anything other than the 26 alphabets
* of English. If the input word contains more than 30 letters then only
* the extra letters will be discarded . For general purpose usage of
* English it does not make any sense to use a word larger than this size.
* Nearly every general purpose word can be expressed in a word with less
* than or equal to 30 letters.
*
* version 1.1
*
*/


#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>


enum MAXSIZE { WORDSIZE = 30 };

int getword( char *, int );


int main( void )
{
char ac[WORDSIZE];

if( getword( ac, WORDSIZE ) )
{
printf("%s\n", ac);
}

return EXIT_SUCCESS;

}


int getword( char *word, int max_length )
{
int c;
char *w = word;


while( isspace( c = getchar() ) )
{
;
}

I find { ; } a messy way of saying nothing, but that is a style
point. More important, if this will be used to read more than one
word (eventually) you need to skip anything that you don't count as a
word character, not just spaces.
while( --max_length )
{
if( isalpha( c ) )
{
*w++ = c;
}
else if( c == '\n' || c == EOF || isspace( c ) )
{
*w = '\0';
break;
}
else
{
return 0;

When the word ends because of this condition, why do you return 0
rather than the word you have read? You do have a word to return.
}

c = getchar();
}

/* I can simply ignore the if condition and directly write the '\0'
onto the last element because in worst case it will only rewrite
the '\n' that is put in there by else if clause.

I think the comment is confusing. Without the if below, you re-write
a 0 that is already there. A \n is never put into the buffer.
or in else if clause, I could replace break with return word[0].

I thought these 2 ideas will be either inefficient or
a bad programming practice, so I did not do it.
*/
if( *w != '\0' )
{
*w = '\0';
}

I'd just write *w = '\0';
return word[0];

That's a char. Given what you said about conversions and clarity, you
should really write return word[0] != '\0'; or maybe return !!word[0];
 
A

arnuld

..> On Tue, 29 Apr 2008 02:47:18 -0700, Nick Keighley wrote:

are all your words the same size?

It depends on the user, what he likes to input at run-time.

If you use the array of pointers you'll have to get the memory
for each word from somewhere (eg. malloc())

yes. I came up with this code and as you can see it does not do what I
want. I want to take every word into the input but it only takes 1st for
obvious reasons. I am not able to think of the way to take all the words
of the input:




#include <stdio.h>
#include <ctype.h>


enum MAXSIZE { MAXWORD = 100 };

char *getword( char *, int );


int main(void) {

char buffer[MAXWORD];

getword( buffer, MAXWORD );

printf("--------------------\n");
printf("%s\n", buffer);

return 0;
}



char *getword( char *word, int max )
{
int c, i;

i = 0;

while( isalpha(c = getchar()) && i < max - 1 )
{
word[i++] = c;
}

word = '\0';

return word;
}

============= OUTPUT =================
/home/arnuld/programs/C $ gcc -ansi -pedantic -Wall -Wextra test.c
/home/arnuld/programs/C $ ./a.out
like that
 
A

arnuld

What about words with other characters like hyphen? What about
constructs like "get_name"? Will you discard them too. What about words
that end with a ; or ...? What about words that contain symbols like #@
etc? Or words that end with an exclamation mark? Or words within
parenthesis or braces?

all of them will be discarded. Only words containing letters like
"santosh" will be considered, nothing else.



That's one way yes, suitable when you don't know the lengths of words in
advance, or you don't want to possibly waste storage with statically
allocated arrays.

yes, exactly, input will be at run-time only.

Depends on your requirements really, and the type and frequency of input
you expect. Will you put an upper limit on the length of words? It
hardly makes sense to accept words longer than about 64 characters if
you are dealing with normal English text.

ok, make the upper limit to 64 :) , I usually take it 100 as my style.


Static 2D arrays are
undoubtedly easier to work with but are less flexible than dynamically
allocated arrays. Since statically allocated arrays are of fixed size
it's possible for some elements to remain unused and hence wasted. OTOH
a large number of small allocations may lead to memory fragmentation and
also some wastage due to malloc bookeeping and possibly also a slowdown
in speed if you'll be reading a very large number of words from a file.
For input from a human it will not matter.


you want to say that there will be 2 types of implementations if
efficiency is my concern:

1.) input from human
2.) input from a text-file

??

couldn't there be a single implementation for both types of inputs ?

One efficient method is to use a single dynamically allocated array in
which words are stored sequentially. The length of each word could be
specified by either one or two bytes prefixing the word itself. This
results in very efficient storage, but is grossly inefficient if you
want to insert and delete words at random. For this a hash table based
approach is probably the best. OTOH a tree is very convenient for quick
searching and sorting.
If you tell us more details about the type and volume of input you
expect and the facilities (like searching, insertion, etc.) you plan to
implement, perhaps a tailored approach can be suggested.


The basic problem is to sort, count and print the sorted words. We are
not going to save a word in an array if it has already appeared, we will
just increase the count for that word.

K&R2 seems to suggest that a doubly-linked list using binary search is
the most efficient method to use, described in section 6.5 and is already
solved. Actually I am not able to understand the <getword> function of the
authors which actually is different from what I want, hence I need to
create one of my own.
 
B

Ben Bacarisse

arnuld said:
It is for the trailing spaces, any white-spaces, that come before the
word.

Yes I know what it is for. I was suggesting that you could do
better. If this is all you need, then fine, but the usual goal is
to make flexible functions.
word doe snot end here. If the next character we are reading is other than
a character, any whitespace or EOF, then it will not be a letter e.g.
"Ben2" or "usen@et" and in that case I am going to discard the whole
word.

Again, I know that. My reading of the exercise is that the program
would take the input:

Can you count these words?
"Yes, I can".

and report eight words none occurring more than once (Can != can for
the moment). If you want to just stop on punctuation, fine, but that
seems an odd choice. That is all I was saying.
That's a char. Given what you said about conversions and clarity, you
should really write return word[0] != '\0'; or maybe return !!word[0];

I don't understand your point. word[0] is char but the function is
supposed to return an integer and hence there is an implicit conversion
from char to int.

I should have added a smiley. You stated in another message that you
wanted all conversions to be explicit. In that case I'd used an int
where a char was needed. Here, you do the reverse quite happily!

C's implicit conversion are good and there is no need to make them all
explicit. Your return statement is fine just as it is. I'll remember
to make my jokes stand out more!
 
A

arnuld

I don't have K&R2 so I don't know the end point of this exercise, so I
may have this wrong...

I knew this ;)


I find { ; } a messy way of saying nothing, but that is a style
point. More important, if this will be used to read more than one
word (eventually) you need to skip anything that you don't count as a
word character, not just spaces.

It is for the trailing spaces, any white-spaces, that come before the
word.



When the word ends because of this condition, why do you return 0 rather
than the word you have read? You do have a word to return.

word doe snot end here. If the next character we are reading is other than
a character, any whitespace or EOF, then it will not be a letter e.g.
"Ben2" or "usen@et" and in that case I am going to discard the whole word.



I'd just write *w = '\0';

ok, fine, will do that.

That's a char. Given what you said about conversions and clarity, you
should really write return word[0] != '\0'; or maybe return !!word[0];


I don't understand your point. word[0] is char but the function is
supposed to return an integer and hence there is an implicit conversion
from char to int. This conversion is useful in the while loop that I am
writing as part of a doubly-linked list program. For the full program see
my other thread titled: "sorting using a doubly-linked list"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,163
Latest member
Sasha15427
Top