extracting text from powerpoint file

C

code_wrong

hi,
I decided to extract the text from some powerpoint files. The results have
thrown up some questions.

When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why?

anyone care to investigate and enlighten me? .. the code is below all you
need to do is comment and uncommment to achieve the differences I am talking
about

To use the program (with MS Windows) all you need to do is drag the file you
want to process onto the .exe file

cheeers
cw

the program:
############

#include<stdio.h>
#include<ctype.h>

void writeFile(FILE *infile,FILE *outfile);

int main(int argc, char *argv[])
{
FILE *outfile = NULL; //the file to write to
FILE *infile = NULL; //the file to read

if(((infile=fopen(argv[1],"rb"))==NULL)||((outfile=fopen("new.txt","wb"))==NULL))
{
printf("error opening file - fatal error - goodbye");
getchar();
exit(1);
}
writeFile(infile,outfile);
fflush(stdout);
system("pause");
return 0;
}

void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
\n.;:<>?/|\\!\"£$%^&*()_-=+,#~[]{}";

int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/
if((isalpha(byte))||(isdigit(byte))||(byte==' ')||(byte == '\n'))
{
fputc(byte,outfile);
}
else
{ }

}
}

############
 
I

Irrwahn Grausewitz

code_wrong said:
When I use the 'char *valid' character array (in the program below) to
choose the characters to write in the new file... the result is totally
different to when I use the line with isalpha() and isdigit().

Yes .. There are more valid characters in the valid array but this is not
the problem .. Using it, I see extra spaces in the new file and it is more
difficult to read (in notepad there appears to be a space between each
character .. in wordpad there are boxes between characters).. why?
void writeFile(FILE *infile,FILE *outfile)
{
char *valid =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
\n.;:<>?/|\\!\"£$%^&*()_-=+,#~[]{}";

You'd better off declaring the array static, but that's not the
problem.
int byte;

while(1)
{
byte = fgetc(infile);/*read one byte*/
if(feof(infile)){break;}/*break from while at end of file*/

/*if(strchr(valid,byte))*/

I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))
{
fputc(byte,outfile);
}
else
{ }

}
}

Best regards
 
C

code_wrong

snip
I've only skimmed over your code, and won't comment style flaws, but
above line (the one giving you troubles, if uncommented, right?) does
not check for 0 bytes. In the strchr function, the terminating null
character is considered to be part of the string. You want something
like:

if( byte && strchr(valid,byte))

snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting

cheers
cw
 
M

Mike Wahler

code_wrong said:
snip


snip

Thanks, you have identified the line of code that was producing the
boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
So I guess the program reads a null character in the file and writes it to
the output file ...

wonder why there are so many null characters in the powerpoint file (every
second character) ....interesting

Well, it's a 'binary' file (as opposed to 'plain text'), in which embedded
zero characters are common. Your remark about 'every second character'
makes me guess that perhaps (at least part of) the data might be stored
as multibyte or 'wide' characters (e.g. Unicode). You might want to look
into that possibility.

-Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top