Text processing

J

jacob navia

Following the series about commenting code, here is an installment about
text processing.

What do you think?

Thanks for your attention.

-----------------------------------------------------------------------
Text manipulation

Text files are a widely used format for storing data. They are usually
quite compact (no text processing formats like bold, italics, or other
font related instructions) and they are widely portable if written in
the ASCII subset of text data.

A widely used application of text files are program files. Most
programming languages (and here C is not an exception) store the
program in text format.

So let's see a simple application of a text manipulating program.
The task at hand is to prepare a C program text to be translated
into several languages. Obviously, the character string:

"Please enter the file name"

will not be readily comprehensible to a spanish user. It would
be better if the program would show in Spain the character string:

"Entre el nombre del fichero por favor"

To prepare this translation, we need to extract all character
strings from the program text and store them in some table.
Instead of referencing directly a character string, the program
will reference a certain offset from our table. In the above
example the character string would be replaced by

StringTable[6]

To do this transformation we will write into the first line
of our program:

static char *StringTable[];

Then, in each line where a character string appears we will
replace it with an index into the string table.

printf("Please enter the file name");

will become

printf(StringTable[x]);

where "x" will be the index for that string in our table.

At the end of the file we will append the definition of our
string table with:

static char *StringTable[] = {
...,
...,
"Please enter the file name",
...,

NULL
};

After some hours of work, we come with the following solution. We test a
bit, and it seems to work.
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <strings.h>
4
5 // Reads a single character constant
6 static int ReadCharConstant(FILE *infile)
7 {
8 int c;
9 c = fgetc(infile);
10 putchar('\'');
11 while (c != EOF && c != '\'') {
12 putchar(c);
13 if (c == '\\') {
14 c = fgetc(infile);
15 if (c == EOF)
16 return EOF;
17 putchar(c);
18 }
19 c = fgetc(infile);
20 }
21 if (c != EOF){
22 putchar(c);
23 c = fgetc(infile);
24 }
25 return c;
26 }
27
28 static int ReadLongComment(FILE *infile)
29 {
30 int c;
31 putchar('/');
32 putchar('*');
33 c = fgetc(infile);
34
35 do {
36
37 while (c != '*' && c != EOF) {
38 putchar(c);
39 c = fgetc(infile);
40 }
41 if (c == '*') {
42 putchar(c);
43 c = fgetc(infile);
44 }
45 } while (c != '/' && c != EOF); /* Problem 2 */
46 if (c == '/')
47 putchar(c);
48 return c;
49 }
50
51 static int ReadLineComment(FILE *infile)
52 {
53 int c = fgetc(infile);
54
55 putchar('/'); putchar('/');
56 while (c != EOF && c != '\n') {
57 putchar(c);
58 c = fgetc(infile);
59 }
60 return c;
61 }
62 static char *stringBuffer;
63 static char *stringBufferPointer;
64 static char *stringBufferEnd;
65 static size_t stringBufferSize;
66 static unsigned stringCount;
67
68 #define BUFFER_SIZE 1024
69
70 static void OutputStrings(void)
71 {
72 char *p = stringBuffer,*strPtr;
73 printf("\nstatic char *StringTable[]={\n");
74 while (*p) {
75 printf("\t\"%s\",\n",p);
76 p += strlen(p)+1;
77 }
78 printf("\tNULL\n};\n");
79 free(stringBuffer);
80 stringBuffer = NULL;
81 }
82 static void PutCharInBuffer(int c)
83 {
84 if (stringBufferPointer == stringBufferEnd) {
85 size_t newSize = stringBufferSize + BUFFER_SIZE;
86 char *tmp = realloc(stringBuffer,newSize);
87 if (tmp == NULL) {
88 fprintf(stderr,"Memory exhausted\n");
89 exit(EXIT_FAILURE);
90 }
91 stringBuffer = tmp;
92 stringBufferPointer = tmp+stringBufferSize;
93 stringBufferSize += BUFFER_SIZE;
94 stringBufferEnd = tmp + stringBufferSize;
95 }
96 *stringBufferPointer++ = c;
97 }
98
99 static int ReadString(FILE *infile)
100 {
101 int c;
102 if (stringBuffer == NULL) {
103 stringBuffer = malloc(BUFFER_SIZE);
104 if (stringBuffer == NULL)
105 return EOF;
106 stringBufferPointer = stringBuffer;
107 stringBufferEnd = stringBufferPointer+BUFFER_SIZE;
108 stringBufferSize = BUFFER_SIZE;
109 }
110 c = fgetc(infile);
111 while (c != EOF && c != '"') {
112 PutCharInBuffer(c);
113 if (c == '\\') {
114 c = fgetc(infile);
115 if (c != '\n')
116 PutCharInBuffer(c);
117 }
118 c = fgetc(infile);
119 }
120 if (c == EOF)
121 return EOF;
122 PutCharInBuffer(0);
123 printf("StringTable[%d]",stringCount);
124 stringCount++;
125 return fgetc(infile);
126 }
127
128 static int ProcessChar(int c,FILE *infile)
129 {
130 switch (c) {
131 case '\'':
132 c = ReadCharConstant(infile);
133 break;
134 case '"':
135 c = ReadString(infile);
136 break;
137 case '/':
138 c = fgetc(infile);
139 if (c == '*')
140 c = ReadLongComment(infile);
141 else if (c == '/')
142 c = ReadLineComment(infile);
143 else {
144 putchar(c);
145 c = fgetc(infile);
146 }
147 break;
148 case '#':
149 putchar(c);
150 while (c != EOF && c != '\n') {
151 c = fgetc(infile);
152 putchar(c);
153 }
154 if (c == '\n')
155 c=fgetc(infile);
156 break;
157 default:
158 putchar(c);
159 c = fgetc(infile);
160 break;
161 }
162 return c;
163 }
164 int main(int argc,char *argv[])
165 {
166 FILE *infile;
167
168 if (argc < 2) {
169 fprintf(stderr,"Usage: strings <file name>\n");
170 return EXIT_FAILURE;
171 }
172 if (!strcmp(argv[1],"-")) {
173 infile = stdin;
174 } else {
175 infile = fopen(argv[1],"r");
176 if (infile == NULL) {
177 fprintf(stderr,"Can't open %s\n",argv[1]);
178 return EXIT_FAILURE;
179 }
180 }
181 int c = fgetc(infile);
182 printf("static char *StringTable[];\n");
183 while (c != EOF) {
184 c = ProcessChar(c,infile);
185 }
186 PutCharInBuffer(0);
187 PutCharInBuffer(0);
188 OutputStrings();
189 }


The general structure of this program is simple. We
o Open the given file to process
o We process each character
o We are interested only in the following tokens:
Char constants
Comments
Character strings
Preprocessor directives

Why those?
Char constants could contain double quotes, what would lead the other
parts of our programs to see strings where there aren't any. For instance:

case'"':

would be misunderstood as the start of a never ending string.

Comments are necessary since we should not process strings in comments.
Preprocessor directives should be ignored since we do NOT want to
translate

#include "myfile.h"

Our string parsing routine stores the contents of each string in a buffer
that is grown if needed, printing into standard output only the

StringTable[x]

instead of the stored string. Each string is finished with a zero, and
after the last string we store additional zeroes to mark the end of
the buffer.

After the whole file is processed we write the contents of the buffer
in the output (written to stdout) and that was it. We have extracted
the strings into a table.

Analysis
-------
Our program seems to work, but there are several corner cases that
it doesn't handle at all.

For instance it is legal in C to write:

"String1" "String2"

and this will be understood as

"String1String2"

by the compiler. Our translation amkes this into:

StringTable[0] StringTable[1]

what is a syntax error.

Another weak point is that a string can be present several times in our
table
since we do not check if the string is present before storing it in our
table.

And there are many corner cases that are just ignored. For instance you can
continue a single line comment with a backslash, a very bad idea of course
but a legal one. We do not follow comments like these:

// This is a comment \
and this line is a comment too

And (due to low level of testing) there could be a lot of hidden bugs in it.

But this should be a simple utility to quickly extract the strings from a
file without too much manual work. We know we do not use the features it
deosn't support, and it will serve our purposes well.

What is important to know is that there is always a point where we stop
developing and decided that we will pass to another thing. Either because
we get fed up or because our boss tell us that we should do xxx instead of
continuing the development of an internal utility.

In this case we stop the first development now. See the exercises for
the many ways as to how we could improve this simple program.

Exercises:

1: This filter can read from stdin and write to stdout. Add a command line
option to specify the name of an output file. How many changes you would
need to do in the code to implement that?

2: The program can store a string several times. What would be needed to
avoid that? What data structure would you recommend?

3: Implement the concatenation of strings, i.e.

"String1" "String2" --> "String1String2"

4: Seeing in the code

printf(StringTable[21]);

is not very easy to follow. Implement the change so that we would have
instead in the output:

// StringTable[21]--> "Please enter the file name"
printf(StringTable[21]);

i.e. each line would be preceeded with one or several comment lines that
describe the strings being used.

5: Add an option so that the name of the string table can be changed from
"StringTable" to some other name. The reason is that a user complained
that the "new" string table destroyed her program: she had a
"StringTable"
variable in her program!
How could you do this change automatically?

6: The program needs to be part of an IDE where the IDE will need to
call the program as a routine (not as an independent program).
What would be needed to do that? What do you think about the global
variables used in the original program?
 
M

Mark Storkamp

jacob navia said:
A widely used application of text files are program files. Most
programming languages (and here C is not an exception) store the
program in text format.

I'm not sure who your target audience is, but I find this to be less
precise than I believe it should be. I think most people, when they
think of programs, are thinking of the executable binaries, not the
source code. While some languages, like BASIC, and other interpreted
languages do indeed save their programs as text files, C and most others
do not.
 
K

Kleuskes & Moos

Following the series about commenting code, here is an installment about
text processing.

What do you think?

Thanks for your attention.
<snip>

Without actually delving into the code, i notice the attempt is pretty much
useless since many, nay, most languages do limit themselves to the ascii-
standard. Even for an example program, this is a serious omission if you're
claiming to do machine translations.

-------------------------------------------------------------------------------
_______________________________________
/ Yow! Did something bad happen or am I \
\ in a drive-in movie?? /
---------------------------------------
\
\
___
{~._.~}
( Y )
()~*~()
(_)-(_)
-------------------------------------------------------------------------------
 
B

BartC

Mark Storkamp said:
I'm not sure who your target audience is, but I find this to be less
precise than I believe it should be. I think most people, when they
think of programs, are thinking of the executable binaries, not the
source code. While some languages, like BASIC, and other interpreted
languages do indeed save their programs as text files, C and most others
do not.

I think he means program source code.
 
J

jacob navia

Le 26/09/11 16:47, BartC a écrit :
I think he means program source code.

Well, just a summary reading of the article would let you know
that I am aiming at the C source code text...
 
J

jacob navia

Le 26/09/11 15:34, Mark Storkamp a écrit :
I'm not sure who your target audience is, but I find this to be less
precise than I believe it should be. I think most people, when they
think of programs, are thinking of the executable binaries, not the
source code. While some languages, like BASIC, and other interpreted
languages do indeed save their programs as text files, C and most others
do not.

Yes, I will add source code explicitely.

Thanks
 
J

jacob navia

Le 26/09/11 14:18, Ben Bacarisse a écrit :
Type: string.h



There are several other than just adjacent string literals:

1. wide strings

True, they are not supported. I will mention that.
2. strings used in initialisers (two cases: char s[] = "a"; and
char *s = "a"; at file scope)

The second case would work

char *a = StringTable[12];

The first not. I will mention that. The solution would be to do this
only in functions.
3. escaped newlines

They are supported (modulo bugs...)
4. _Pragma("abc")

Completely forgot. Thanks

It's probably reasonable to ignore trigraphs and digraphs (maybe even
_Pragma) but the others are significant, I think.

You are right
You also have two bugs. Empty strings don't work (do you test? that was
my second test case) to the extent that they can cause the resulting
program to access beyond the end of the string table.

I will look at it. I missed that case. I tested with all
the source of the container library...

And, second, the
output does not compile because the initial

static char *StringTable[];

is a tentative definition of an object with incomplete type (that's a
constraint violation).

The output WILL compile under gcc/Macintosh OSX.

What system are you using.
From a high-level design point of view, printf strings present a
problem. The order in which components are printed may have to change
from language to language, but printf's arguments are in a fixed order.

<snip>


Thanks Ben
 
B

BartC

Ben Bacarisse said:
There are several other than just adjacent string literals:

And fopen("file") perhaps. Maybe other cases where you don't want to
translate.
From a high-level design point of view, printf strings present a
problem. The order in which components are printed may have to change
from language to language, but printf's arguments are in a fixed order.

That's one problem. My experience is that translation of messages in a
program is always a lot trickier than it seems at first.

And the approach is simplistic. For example, when the same string occurs in
several places, is it repeated in the string table? That will be more work
to translate. But then identical strings can also have different
translations depending on context. Or abbreviations such as "R", (meaning
Red), or "s" to be added to make a plural, which in isolation in a table
will be quite puzzling!

But I got the impression this is just a programming tutorial. Coders who are
at this level probably will not have internationalisation of their programs
as a priority!
 
N

Nobody

I'm not sure who your target audience is, but I find this to be less
precise than I believe it should be. I think most people, when they
think of programs, are thinking of the executable binaries, not the
source code. While some languages, like BASIC, and other interpreted
languages do indeed save their programs as text files, C and most others
do not.

Another counter-example is that many older dialects of BASIC *didn't*
store their programs as text files, but used a tokenised format for the
sake of space and performance.
 
B

Ben Bacarisse

jacob navia said:
Le 26/09/11 14:18, Ben Bacarisse a écrit :
Type: string.h



There are several other than just adjacent string literals:

1. wide strings

True, they are not supported. I will mention that.
2. strings used in initialisers (two cases: char s[] = "a"; and
char *s = "a"; at file scope)

The second case would work

char *a = StringTable[12];

I don't think so. Does it not contravene 6.6 p9?
The first not. I will mention that. The solution would be to do this
only in functions.

That's a solution for the second case not the first -- and you don't
think there's a problem with it.
They are supported (modulo bugs...)

OK, then I found bugs in every case I tried. Maybe we are talking at
cross-purposes. I mean a \ at the end of a source line.
4. _Pragma("abc")

Completely forgot. Thanks
It's probably reasonable to ignore trigraphs and digraphs (maybe even
_Pragma) but the others are significant, I think.

You are right
You also have two bugs. Empty strings don't work (do you test? that was
my second test case) to the extent that they can cause the resulting
program to access beyond the end of the string table.

I will look at it. I missed that case. I tested with all
the source of the container library...

And, second, the
output does not compile because the initial

static char *StringTable[];

is a tentative definition of an object with incomplete type (that's a
constraint violation).

The output WILL compile under gcc/Macintosh OSX.

What system are you using.

gcc 4.5.2 but I am not sure that really matters. Is what I said not
right about it being a tentative definition of an object with incomplete
type?

<snip>
 
B

Ben Pfaff

jacob navia said:
Following the series about commenting code, here is an installment about
text processing.

What do you think?

There is existing software that does a pretty good job with
string translations (e.g. GNU gettext). I don't know whether you
are actually writing new software that also handles this, or
whether you are just pointing out a way that it can be done to
people new to the topic. If it's the former, it seems somewhat
wasteful (what's inadequate about current attempts?); if it's the
latter, that makes some sense to me.
 
J

jacob navia

Le 26/09/11 20:11, Ben Pfaff a écrit :
There is existing software that does a pretty good job with
string translations (e.g. GNU gettext). I don't know whether you
are actually writing new software that also handles this, or
whether you are just pointing out a way that it can be done to
people new to the topic. If it's the former, it seems somewhat
wasteful (what's inadequate about current attempts?); if it's the
latter, that makes some sense to me.

I have written in the context of my IDE a better software than that.
The objective here is to present simple code that does some
filtering, and discuss its problems and ways to improve it an hand
of actual code.

This improves the S/N ration of this group.
 
H

Harald van Dijk

And, second, the
output does not compile because the initial

  static char *StringTable[];

is a tentative definition of an object with incomplete type (that's a
constraint violation).

6.9.2p3 says the declared type of a tentative definition with internal
linkage must not be an incomplete type, but it isn't a constraint,
which matters because it means compilers are not required to issue any
diagnostics. And I wonder if that is really meant to apply to the
declared type, rather than the composite type for the final implicit
definition mentioned in p2. Compilers are already required to accept
"int array[]; int array[20] = {1};" -- without the static keyword --
and they would surely need to treat the static keyword specially to
reject it if present.
 
K

Keith Thompson

Kleuskes & Moos said:
Without actually delving into the code, i notice the attempt is pretty much
useless since many, nay, most languages do limit themselves to the ascii-
standard.
[...]

I think you accidentally a word there.
 
H

Harald van Dijk

And, second, the
output does not compile because the initial
  static char *StringTable[];
is a tentative definition of an object with incomplete type (that's a
constraint violation).

6.9.2p3 says the declared type of a tentative definition with internal
linkage must not be an incomplete type, but it isn't a constraint,
which matters because it means compilers are not required to issue any
diagnostics. And I wonder if that is really meant to apply to the
declared type, rather than the composite type for the final implicit
definition mentioned in p2. Compilers are already required to accept
"int array[]; int array[20] = {1};" -- without the static keyword --
and they would surely need to treat the static keyword specially to
reject it if present.

I feel I should add that the standard does clearly and unambiguously
disallow this, so regardless of the intent, programs should avoid this
construct.
 
J

jacob navia

Le 26/09/11 22:33, Harald van Dijk a écrit :
And, second, the
output does not compile because the initial
static char *StringTable[];
is a tentative definition of an object with incomplete type (that's a
constraint violation).

6.9.2p3 says the declared type of a tentative definition with internal
linkage must not be an incomplete type, but it isn't a constraint,
which matters because it means compilers are not required to issue any
diagnostics. And I wonder if that is really meant to apply to the
declared type, rather than the composite type for the final implicit
definition mentioned in p2. Compilers are already required to accept
"int array[]; int array[20] = {1};" -- without the static keyword --
and they would surely need to treat the static keyword specially to
reject it if present.

I feel I should add that the standard does clearly and unambiguously
disallow this, so regardless of the intent, programs should avoid this
construct.

You are right.

One way of getting rid of the problem is to avoid the static qualifier
but that would mean surely a link error when used in many files.

Since I write into stdout there isn't the possibility of rewinding to
add the size of the table after we know how many strings there are.

Mmm, there must be a solution but it doesn't come immediately. I think
I have to sleep, I had a hell of a day today at the job.

jacob
 
B

Ben Bacarisse

Harald van Dijk said:
And, second, the
output does not compile because the initial

  static char *StringTable[];

is a tentative definition of an object with incomplete type (that's a
constraint violation).

6.9.2p3 says the declared type of a tentative definition with internal
linkage must not be an incomplete type, but it isn't a constraint,
which matters because it means compilers are not required to issue any
diagnostics. And I wonder if that is really meant to apply to the
declared type, rather than the composite type for the final implicit
definition mentioned in p2. Compilers are already required to accept
"int array[]; int array[20] = {1};" -- without the static keyword --
and they would surely need to treat the static keyword specially to
reject it if present.

Yes, it's very odd. I assume there is some advantage in knowing the
complete type of objects with internal linkage at the get go. I can't
think of one, though.

However, I don't think 6.9.2p3 makes much sense if the final composite
type is assumed to be the intended meaning because, I don't think the
final composite type *can* be incomplete? For example, a translation
union with nothing other than

int array[];

is fine and causes the type of array to be int [1] by at the end.
 
H

Harald van Dijk

However, I don't think 6.9.2p3 makes much sense if the final composite
type is assumed to be the intended meaning because, I don't think the
final composite type *can* be incomplete?  For example, a translation
union with nothing other than

  int array[];

is fine and causes the type of array to be int [1] by at the end.

I was thinking of incomplete structure and union types.
 
B

Ben Bacarisse

Harald van Dijk said:
However, I don't think 6.9.2p3 makes much sense if the final composite
type is assumed to be the intended meaning because, I don't think the
final composite type *can* be incomplete?  For example, a translation
union with nothing other than

  int array[];

is fine and causes the type of array to be int [1] by at the end.

I was thinking of incomplete structure and union types.

I'd considered that and ruled them out. Composite types must be
compatible, and an incomplete struct type is not compatible with a
complete one declared in the same translation unit (I think!).
 
H

Harald van Dijk

Harald van Dijk said:
However, I don't think 6.9.2p3 makes much sense if the final composite
type is assumed to be the intended meaning because, I don't think the
final composite type *can* be incomplete?  For example, a translation
union with nothing other than
  int array[];
is fine and causes the type of array to be int [1] by at the end.
I was thinking of incomplete structure and union types.

I'd considered that and ruled them out.  Composite types must be
compatible, and an incomplete struct type is not compatible with a
complete one declared in the same translation unit (I think!).

An incomplete struct type is completed by a struct definition with the
same tag in the same scope, see 6.7.2.3p4 for the official wording and
p12 for an example.

struct X x; /* tentative definition with incomplete type */
struct X { int a; }; /* completed here */
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top