The infamous ^Z problem

E

Eigenvector

I've been surfing the FAQ and Google for about a week and haven't quite
figured out this one.

I have a file that changes on a periodic basis and every once and a while
^Zs will appear in the file for reasons I don't want to get into. I need to
get rid of those ^Z's and need to do it via a C code as it is the only tool
available to me that can handle the file size.

So I cooked up some code, tried it out on one platform - and it works great,
it doesn't work so great on another and I am trying to understand why. I
did my best to code standard but perhaps that is where I'm failing.

#include <stdio.h>
int main(int argc, char *argv[])
{
FILE *infile, *outfile;
int c; /*picked that up from the FAQ */
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary part up
from this google group*/
{
printf("Cannot open file\n");
exit(1);
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)
{
printf("Cannot open output file\n");
exit(1);
}
while ((c=fgetc(infile)) != EOF )
{
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than one
character defined for type char */
{
c='_'; /*replace bad control char with something innocuous */
}
fputs(c,outfile);
}
fclose(infile);
fclose(outfile);
}
Yeah it's a pretty primitive code, but I'm more interested in getting the
basics working before I go in and optimize the way it handles the input
file. This compiles on xlC and HP's ANSI C compilers.

In the first if statement dealing with the ^Z, the program doesn't detect
the control characters in the file, in the second statement the compiler
complains about syntax. If I set c as typecast char, it finds the control
characters, replaces them, but then blows away the EOF character and nukes
the file.

I have the suspicion that its the way I'm defining the c==\0x1a that is
leading my astray here. I can't find any good consistent documentation on
exactly how to represent hex or octal in c code or string/character
operations.
 
T

Terminal Crazy

if(c == 0x1a) /* This is where I'm having a problem */

/* if(c == '\0x1a') This fails with compiler error - more than one
character defined for type char */

don't use the single quotes around the value...
0x1a is an int not a char
c is an int also.

HTH
 
K

Keith Thompson

Terminal Crazy said:
/* if(c == '\0x1a') This fails with compiler error - more than one
character defined for type char */

don't use the single quotes around the value...
0x1a is an int not a char
c is an int also.

You're using the wrong syntax for a hexadecimal escape in a character
literal. You want '\x1a', not '\0x1a'. 0x1a (without the quotation
marks) will also work, but '\x1a' makes it clearer that you're dealing
with a character.
 
O

Old Wolf

if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary part up
from this google group*/
{
printf("Cannot open file\n");
exit(1);
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)

Open mode should be "w+b". You want to write it the
same way you read it.
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than one
character defined for type char */

There are four characters in that constant: '\0', 'x',
'1', and 'a'. I think you mean '\x1a', although the
uncommented code is also correct and does the same thing.
Yeah it's a pretty primitive code, but I'm more interested in getting the
basics working before I go in and optimize the way it handles the input
file. This compiles on xlC and HP's ANSI C compilers.

Out of interest, how were you planning on optimising
this? (I think you'll find that reading in a block
at a time won't gain you anything).
In the first if statement dealing with the ^Z, the program doesn't detect
the control characters in the file, in the second statement the compiler
complains about syntax. If I set c as typecast char, it finds the control
characters, replaces them, but then blows away the EOF character and nukes
the file.

It doesn't seem possible that your posted code won't
find the 0x1a characters. There must be some other
problem, e.g. this isn't your real code, or the
non-binary output is munging up.
I can't find any good consistent documentation on exactly
how to represent hex or octal in c code or string/character
operations.

Try the C Standard, or "The C Programming Language"
by Kernighan & Ritchie.
 
K

Keith Thompson

Eigenvector said:
I've been surfing the FAQ and Google for about a week and haven't
quite figured out this one.

I have a file that changes on a periodic basis and every once and a
while ^Zs will appear in the file for reasons I don't want to get
into. I need to get rid of those ^Z's and need to do it via a C code
as it is the only tool available to me that can handle the file size.

So I cooked up some code, tried it out on one platform - and it works
great, it doesn't work so great on another and I am trying to
understand why. I did my best to code standard but perhaps that is
where I'm failing.

#include <stdio.h>
int main(int argc, char *argv[])
{
FILE *infile, *outfile;
int c; /*picked that up from the FAQ */
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary
part up from this google group*/

What if argv[1] doesn't exist? Check the value of argc.
{
printf("Cannot open file\n");

Error messages are traditionally written to stderr rather than stdout.

The only portable values for the argument to exit() are 0,
EXIT_SUCCESS, and EXIT_FAILURE. In this case, I'd recommend using
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)

You opened the input file in binary mode, "rb", which seems correct,
but you opened the output file in text mode *and* update mode, even
though you only write to it. For consistency, use "wb" (write-only,
binary mode).
{
printf("Cannot open output file\n");
exit(1);
}
while ((c=fgetc(infile)) != EOF )
{
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than
one character defined for type char */

0x1a should work. '\x1a' is equivalent and probably clearer.

(A compiler *could* accept '\0x1a', but it does't mean what you think
it means. The \0 represents a null character, and it's followed by
characters 'x', '1', and 'a'. Multi-character character literals are
legal, but their meaning is implementation-defined; they're hardly
ever useful.)
{
c='_'; /*replace bad control char with something innocuous */
}
fputs(c,outfile);
}
fclose(infile);
fclose(outfile);

"return 0;" or "exit(0);".
}
Yeah it's a pretty primitive code, but I'm more interested in getting
the basics working before I go in and optimize the way it handles the
input file. This compiles on xlC and HP's ANSI C compilers.

In the first if statement dealing with the ^Z, the program doesn't
detect the control characters in the file,

I don't know why it would cause that problem. I suspect you may be
misinterpreting the symptoms, but it's hard to tell.
in the second statement the
compiler complains about syntax. If I set c as typecast char, it
finds the control characters, replaces them, but then blows away the
EOF character and nukes the file.

I have the suspicion that its the way I'm defining the c==\0x1a that
is leading my astray here. I can't find any good consistent
documentation on exactly how to represent hex or octal in c code or
string/character operations.

Really? Any decent C reference should explain that. If nothing else,
you can get the latest draft of the C standard at
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf>; see
sections 6.4.4.4 and 6.4.5.
 
E

Eigenvector

Keith Thompson said:
Eigenvector said:
I've been surfing the FAQ and Google for about a week and haven't
quite figured out this one.

I have a file that changes on a periodic basis and every once and a
while ^Zs will appear in the file for reasons I don't want to get
into. I need to get rid of those ^Z's and need to do it via a C code
as it is the only tool available to me that can handle the file size.

So I cooked up some code, tried it out on one platform - and it works
great, it doesn't work so great on another and I am trying to
understand why. I did my best to code standard but perhaps that is
where I'm failing.

#include <stdio.h>
int main(int argc, char *argv[])
{
FILE *infile, *outfile;
int c; /*picked that up from the FAQ */
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary
part up from this google group*/

What if argv[1] doesn't exist? Check the value of argc.
{
printf("Cannot open file\n");

Error messages are traditionally written to stderr rather than stdout.

The only portable values for the argument to exit() are 0,
EXIT_SUCCESS, and EXIT_FAILURE. In this case, I'd recommend using
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)

You opened the input file in binary mode, "rb", which seems correct,
but you opened the output file in text mode *and* update mode, even
though you only write to it. For consistency, use "wb" (write-only,
binary mode).

I won't argue the advantages of reading and writing cleanly, although you
are certainly correct here. I'm just trying to pound out something that
will work - more concept than production code. Although I will take your
suggestions to heart.
0x1a should work. '\x1a' is equivalent and probably clearer.

Okay, I see now where I went wrong. \0 is for octal representation than
hex. Let me go back and try the '\x1a` and see if I do better.

(A compiler *could* accept '\0x1a', but it does't mean what you think
it means. The \0 represents a null character, and it's followed by
characters 'x', '1', and 'a'. Multi-character character literals are
legal, but their meaning is implementation-defined; they're hardly
ever useful.)


"return 0;" or "exit(0);".


<OT>Since you're using Unix-like systems, "man tr".</OT>

Actually `tr` absolutely doesn't work here, the ^Z is its death (same with
sed, batch VI, and a host of other shell related commands), but I won't
discuss that here. Besides I will at some point need to port this to
Windoze.
I don't know why it would cause that problem. I suspect you may be
misinterpreting the symptoms, but it's hard to tell.

Agreed it's hard to diagnose code over a newsgroup. In the code I have
working I put a puts() statement in the if branch to output whenever the
conditional was met. When I use the 0x1a notation the if conditional is
never accessed, although the program completes normally.
 
E

Eigenvector

Old Wolf said:
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary part
up
from this google group*/
{
printf("Cannot open file\n");
exit(1);
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)

Open mode should be "w+b". You want to write it the
same way you read it.
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than
one
character defined for type char */

There are four characters in that constant: '\0', 'x',
'1', and 'a'. I think you mean '\x1a', although the
uncommented code is also correct and does the same thing.
Yeah it's a pretty primitive code, but I'm more interested in getting the
basics working before I go in and optimize the way it handles the input
file. This compiles on xlC and HP's ANSI C compilers.

Out of interest, how were you planning on optimising
this? (I think you'll find that reading in a block
at a time won't gain you anything).
In the first if statement dealing with the ^Z, the program doesn't detect
the control characters in the file, in the second statement the compiler
complains about syntax. If I set c as typecast char, it finds the
control
characters, replaces them, but then blows away the EOF character and
nukes
the file.

It doesn't seem possible that your posted code won't
find the 0x1a characters. There must be some other
problem, e.g. this isn't your real code, or the
non-binary output is munging up.

This is the real code, albeit definitely primitive. I would have thought it
would have found the ^Zs too, but on a SINGLE platform it doesn't. I trust
that the platform is ANSI compliant, so that tells me that my problems lie
with the code ultimately.
 
K

Keith Thompson

Eigenvector said:
Actually `tr` absolutely doesn't work here, the ^Z is its death (same with
sed, batch VI, and a host of other shell related commands), but I won't
discuss that here. Besides I will at some point need to port this to
Windoze.
[...]

Since your original program *should* have worked, and since the "tr"
command does work for me, I'm beginning to suspect that the characters
in your file aren't what you think they are.

The "^Z" character is 26 decimal, '\032' octal, or '\x1a' hexadecimal
(that's ASCII-specific). How do you know that's what's in your file?

This works for me on a Unix system:

tr '\032' _ < tmp.txt

(the quotation marks are necessary).

Try writing a program that prints (say, in decimal) the value of any
non-printable character; you can use the isprint() function, declared
in <ctype.h>. You can also exclude '\n' characters. If you get
values other than 26 (154, maybe?), that's probably your problem.
 
J

Joachim Schmitz

Eigenvector said:
Old Wolf said:
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary
part up
from this google group*/
{
printf("Cannot open file\n");
exit(1);
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)

Open mode should be "w+b". You want to write it the
same way you read it.
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than
one
character defined for type char */

There are four characters in that constant: '\0', 'x',
'1', and 'a'. I think you mean '\x1a', although the
uncommented code is also correct and does the same thing.
Yeah it's a pretty primitive code, but I'm more interested in getting
the
basics working before I go in and optimize the way it handles the input
file. This compiles on xlC and HP's ANSI C compilers.

Out of interest, how were you planning on optimising
this? (I think you'll find that reading in a block
at a time won't gain you anything).
In the first if statement dealing with the ^Z, the program doesn't
detect
the control characters in the file, in the second statement the compiler
complains about syntax. If I set c as typecast char, it finds the
control
characters, replaces them, but then blows away the EOF character and
nukes
the file.

It doesn't seem possible that your posted code won't
find the 0x1a characters. There must be some other
problem, e.g. this isn't your real code, or the
non-binary output is munging up.

This is the real code, albeit definitely primitive. I would have thought
it would have found the ^Zs too, but on a SINGLE platform it doesn't. I
trust that the platform is ANSI compliant, so that tells me that my
problems lie with the code ultimately.
^Z typically is used for Job-Control, in Shells that support it, to suspend
the forground job, so the shell would consum it before your program gets a
chance... check the output of 'stty -a'

This is of course OT here and also I might be utterly wrong...

Bye, Jojo
 
K

Keith Thompson

Joachim Schmitz said:
^Z typically is used for Job-Control, in Shells that support it, to suspend
the forground job, so the shell would consum it before your program gets a
chance... check the output of 'stty -a'

This is of course OT here and also I might be utterly wrong...

That doesn't apply when reading from a file, as the OP is doing.
 
E

Eigenvector

Keith Thompson said:
Eigenvector said:
Actually `tr` absolutely doesn't work here, the ^Z is its death (same
with
sed, batch VI, and a host of other shell related commands), but I won't
discuss that here. Besides I will at some point need to port this to
Windoze.
[...]

Since your original program *should* have worked, and since the "tr"
command does work for me, I'm beginning to suspect that the characters
in your file aren't what you think they are.

The "^Z" character is 26 decimal, '\032' octal, or '\x1a' hexadecimal
(that's ASCII-specific). How do you know that's what's in your file?

This works for me on a Unix system:

tr '\032' _ < tmp.txt

(the quotation marks are necessary).

Try writing a program that prints (say, in decimal) the value of any
non-printable character; you can use the isprint() function, declared
in <ctype.h>. You can also exclude '\n' characters. If you get
values other than 26 (154, maybe?), that's probably your problem.


Well thanks for the tips all. I believe the solution was how I was defining
\x1a. Once I got the syntax on it correct using '\x1a' the code on that
remaining system worked.

Frankly I'm stunned at how fast the program works on the the huge files I
have to process - much much faster than the built-in OS code.
 
K

Keith Thompson

Eigenvector said:
Well thanks for the tips all. I believe the solution was how I was defining
\x1a. Once I got the syntax on it correct using '\x1a' the code on that
remaining system worked.

That's very surprising. The code you originally posted used the
integer constant 0x1a (without quotation marks), which should have
worked. <OT>The "tr" command should also have worked for you.</OT>
There were some other problems in your code (which were already
pointed out), but I don't think any of them should have prevented it
from working.

But in any case, I'm glad your problem is solved.
 
B

Bill Latvin

I've been surfing the FAQ and Google for about a week and haven't quite
figured out this one.

I have a file that changes on a periodic basis and every once and a while
^Zs will appear in the file for reasons I don't want to get into. I need to
get rid of those ^Z's and need to do it via a C code as it is the only tool
available to me that can handle the file size.

So I cooked up some code, tried it out on one platform - and it works great,
it doesn't work so great on another and I am trying to understand why. I
did my best to code standard but perhaps that is where I'm failing.

#include <stdio.h>
int main(int argc, char *argv[])
{
FILE *infile, *outfile;
int c; /*picked that up from the FAQ */
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary part up
from this google group*/
{
printf("Cannot open file\n");
exit(1);
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)
{
printf("Cannot open output file\n");
exit(1);
}
while ((c=fgetc(infile)) != EOF )
{
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than one
character defined for type char */
{
c='_'; /*replace bad control char with something innocuous */
}
fputs(c,outfile);
}
fclose(infile);
fclose(outfile);
}
Yeah it's a pretty primitive code, but I'm more interested in getting the
basics working before I go in and optimize the way it handles the input
file. This compiles on xlC and HP's ANSI C compilers.

In the first if statement dealing with the ^Z, the program doesn't detect
the control characters in the file, in the second statement the compiler
complains about syntax. If I set c as typecast char, it finds the control
characters, replaces them, but then blows away the EOF character and nukes
the file.

I have the suspicion that its the way I'm defining the c==\0x1a that is
leading my astray here. I can't find any good consistent documentation on
exactly how to represent hex or octal in c code or string/character
operations.

It didn't compile cleanly with xlC for me until I changed this:

if ( (infile = fopen(argv[1], "rb") == NULL)
to this:
if ( (infile = fopen(argv[1], "rb") ) == NULL)

and this:
fputs(c,outfile);
to this:
fputc(c,outfile);

Then it compiled, and worked correctly.

Bill
 
E

Eigenvector

Bill Latvin said:
I've been surfing the FAQ and Google for about a week and haven't quite
figured out this one.

I have a file that changes on a periodic basis and every once and a while
^Zs will appear in the file for reasons I don't want to get into. I need
to
get rid of those ^Z's and need to do it via a C code as it is the only
tool
available to me that can handle the file size.

So I cooked up some code, tried it out on one platform - and it works
great,
it doesn't work so great on another and I am trying to understand why. I
did my best to code standard but perhaps that is where I'm failing.

#include <stdio.h>
int main(int argc, char *argv[])
{
FILE *infile, *outfile;
int c; /*picked that up from the FAQ */
if ( (infile = fopen(argv[1], "rb") == NULL) /*picked the binary part
up
from this google group*/
{
printf("Cannot open file\n");
exit(1);
}
if ( (outfile = fopen("Clean_file", "w+")) == NULL)
{
printf("Cannot open output file\n");
exit(1);
}
while ((c=fgetc(infile)) != EOF )
{
if(c == 0x1a) /* This is where I'm having a problem */
/* if(c == '\0x1a') This fails with compiler error - more than
one
character defined for type char */
{
c='_'; /*replace bad control char with something innocuous */
}
fputs(c,outfile);
}
fclose(infile);
fclose(outfile);
}
Yeah it's a pretty primitive code, but I'm more interested in getting the
basics working before I go in and optimize the way it handles the input
file. This compiles on xlC and HP's ANSI C compilers.

In the first if statement dealing with the ^Z, the program doesn't detect
the control characters in the file, in the second statement the compiler
complains about syntax. If I set c as typecast char, it finds the control
characters, replaces them, but then blows away the EOF character and nukes
the file.

I have the suspicion that its the way I'm defining the c==\0x1a that is
leading my astray here. I can't find any good consistent documentation on
exactly how to represent hex or octal in c code or string/character
operations.

It didn't compile cleanly with xlC for me until I changed this:

if ( (infile = fopen(argv[1], "rb") == NULL)
to this:
if ( (infile = fopen(argv[1], "rb") ) == NULL)

and this:
fputs(c,outfile);
to this:
fputc(c,outfile);

Then it compiled, and worked correctly.

Bill


I apologize for that, I typed it in incorrectly. My code sheet says fputc
not fputs - sorry for any confusion that caused. I don't have the ability
to cut and paste code from this particular system, I have to rely on
transcription.
 
W

Walter Roberson

Eigenvector wrote:
Then you should probably be using EOF. The Unix/Linux equivalent
of ^Z is ^D. Depends on where your input originates.

Unix/Linux are not hardwired to ^D; that's merely the most common
defaults. The technical details of adjusting the end-of-file
character are off-topic for this newsgroup though.
 
K

Keith Thompson

CBFalconer said:
Then you should probably be using EOF. The Unix/Linux equivalent
of ^Z is ^D. Depends on where your input originates.

Eigenvector was talking about ^Z characters (ASCII 26) in a file.
Since he was opening the input file in binary mode, the character used
to signal an end-of-file on interactive input should be irrelevant.

I still don't know why he was having the problems he described, but
he's since solved them.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top