extracting text from powerpoint file

Discussion in 'C Programming' started by code_wrong, Sep 12, 2005.

  1. code_wrong

    code_wrong Guest

    hi,
    I decided to extract the text from some powerpoint files. The results have
    thrown up some questions.

    When I use the 'char *valid' character array (in the program below) to
    choose the characters to write in the new file... the result is totally
    different to when I use the line with isalpha() and isdigit().

    Yes .. There are more valid characters in the valid array but this is not
    the problem .. Using it, I see extra spaces in the new file and it is more
    difficult to read (in notepad there appears to be a space between each
    character .. in wordpad there are boxes between characters).. why?

    anyone care to investigate and enlighten me? .. the code is below all you
    need to do is comment and uncommment to achieve the differences I am talking
    about

    To use the program (with MS Windows) all you need to do is drag the file you
    want to process onto the .exe file

    cheeers
    cw

    the program:
    ############

    #include<stdio.h>
    #include<ctype.h>

    void writeFile(FILE *infile,FILE *outfile);

    int main(int argc, char *argv[])
    {
    FILE *outfile = NULL; //the file to write to
    FILE *infile = NULL; //the file to read

    if(((infile=fopen(argv[1],"rb"))==NULL)||((outfile=fopen("new.txt","wb"))==NULL))
    {
    printf("error opening file - fatal error - goodbye");
    getchar();
    exit(1);
    }
    writeFile(infile,outfile);
    fflush(stdout);
    system("pause");
    return 0;
    }

    void writeFile(FILE *infile,FILE *outfile)
    {
    char *valid =
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
    \n.;:<>?/|\\!\"£$%^&*()_-=+,#~[]{}";

    int byte;

    while(1)
    {
    byte = fgetc(infile);/*read one byte*/
    if(feof(infile)){break;}/*break from while at end of file*/

    /*if(strchr(valid,byte))*/
    if((isalpha(byte))||(isdigit(byte))||(byte==' ')||(byte == '\n'))
    {
    fputc(byte,outfile);
    }
    else
    { }

    }
    }

    ############
    code_wrong, Sep 12, 2005
    #1
    1. Advertising

  2. "code_wrong" <> wrote:
    <snip>
    >When I use the 'char *valid' character array (in the program below) to
    >choose the characters to write in the new file... the result is totally
    >different to when I use the line with isalpha() and isdigit().
    >
    >Yes .. There are more valid characters in the valid array but this is not
    >the problem .. Using it, I see extra spaces in the new file and it is more
    >difficult to read (in notepad there appears to be a space between each
    >character .. in wordpad there are boxes between characters).. why?

    <snip>
    >void writeFile(FILE *infile,FILE *outfile)
    >{
    > char *valid =
    >"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
    >\n.;:<>?/|\\!\"£$%^&*()_-=+,#~[]{}";


    You'd better off declaring the array static, but that's not the
    problem.

    > int byte;
    >
    > while(1)
    > {
    > byte = fgetc(infile);/*read one byte*/
    > if(feof(infile)){break;}/*break from while at end of file*/
    >
    > /*if(strchr(valid,byte))*/


    I've only skimmed over your code, and won't comment style flaws, but
    above line (the one giving you troubles, if uncommented, right?) does
    not check for 0 bytes. In the strchr function, the terminating null
    character is considered to be part of the string. You want something
    like:

    if( byte && strchr(valid,byte))
    > {
    > fputc(byte,outfile);
    > }
    > else
    > { }
    >
    > }
    >}


    Best regards
    --
    Irrwahn Grausewitz ()
    welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
    clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
    clc frequent answers: http://benpfaff.org/writings/clc
    Irrwahn Grausewitz, Sep 12, 2005
    #2
    1. Advertising

  3. code_wrong

    code_wrong Guest

    "Irrwahn Grausewitz" <> wrote in message
    news:...

    snip

    > I've only skimmed over your code, and won't comment style flaws, but
    > above line (the one giving you troubles, if uncommented, right?) does
    > not check for 0 bytes. In the strchr function, the terminating null
    > character is considered to be part of the string. You want something
    > like:
    >
    > if( byte && strchr(valid,byte))


    snip

    Thanks, you have identified the line of code that was producing the
    boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
    So I guess the program reads a null character in the file and writes it to
    the output file ...

    wonder why there are so many null characters in the powerpoint file (every
    second character) ....interesting

    cheers
    cw
    code_wrong, Sep 12, 2005
    #3
  4. code_wrong

    Mike Wahler Guest

    "code_wrong" <> wrote in message
    news:4325d6c4$...
    >
    > "Irrwahn Grausewitz" <> wrote in message
    > news:...
    >
    > snip
    >
    >> I've only skimmed over your code, and won't comment style flaws, but
    >> above line (the one giving you troubles, if uncommented, right?) does
    >> not check for 0 bytes. In the strchr function, the terminating null
    >> character is considered to be part of the string. You want something
    >> like:
    >>
    >> if( byte && strchr(valid,byte))

    >
    > snip
    >
    > Thanks, you have identified the line of code that was producing the
    > boxes/spaces in the output file. .... this one: if(strchr(valid,byte)) ...
    > So I guess the program reads a null character in the file and writes it to
    > the output file ...
    >
    > wonder why there are so many null characters in the powerpoint file (every
    > second character) ....interesting


    Well, it's a 'binary' file (as opposed to 'plain text'), in which embedded
    zero characters are common. Your remark about 'every second character'
    makes me guess that perhaps (at least part of) the data might be stored
    as multibyte or 'wide' characters (e.g. Unicode). You might want to look
    into that possibility.

    -Mike
    Mike Wahler, Sep 12, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bijoy Naick

    Extracting Powerpoint Charts

    Bijoy Naick, Jan 14, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    320
    MWells
    Jan 16, 2005
  2. Chris Berg

    Powerpoint file format

    Chris Berg, Sep 7, 2003, in forum: Java
    Replies:
    3
    Views:
    4,167
    Niels Dybdahl
    Sep 8, 2003
  3. cschang
    Replies:
    1
    Views:
    3,361
    Gerald Hubmaier
    Apr 6, 2005
  4. CS
    Replies:
    5
    Views:
    487
  5. cstudent79

    powerpoint text extractor... Help!

    cstudent79, Oct 13, 2003, in forum: C Programming
    Replies:
    4
    Views:
    428
    Joona I Palaste
    Oct 14, 2003
Loading...

Share This Page