help in C/C++ to read .DOC & PDF files

steve · Jun 4, 2006

hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

so that i can implement it in my program
any suggestions would be of great help
thanx in advance for the help

Ian Collins · Jun 4, 2006

steve said:
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files

Why all the shouting?

i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and

Simple? You'll be lucky.

Have a look at the xpdf or openoffice source to see why.

osmium · Jun 4, 2006

:

i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

so that i can implement it in my program
any suggestions would be of great help
thanx in advance for the help

This could really eat up the time if you insist it be highly automated. The
easiest, and highly manual, way is to export the .doc file as a text file
and then operate on that file. I don't know for sure, but it seems
reasonable that you might be able to do something similar with the .pdf
file. There may be third party - I don't mean to exclude freeware or
shareware - conversion programs.

ben · Jun 4, 2006

Ian Collins said:
Why all the shouting?

Simple? You'll be lucky.

Have a look at the xpdf or openoffice source to see why.

steve,

the internal format of pdf is *really* complex in my opinion. even if
you're a brilliant programmer there's still an awful lot of hoops to
just through so i think would require a lot of work. pdf's format spec
is over 1000 pages long. xpdf, as mentioned above, has a command line
utility in it called pdf2text, which takes a pdf as input and outputs
plain text and works on its own and doesn't require all the GUI stuff
round it. i suggest you get hold of pdf2text which is in xpdf whose
source is available for free and make use of that. for the .doc format
i don't know.

Malcolm · Jun 5, 2006

steve said:
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.

Richard Heathfield · Jun 8, 2006

Malcolm said:

The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.

Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

osmium · Jun 8, 2006

Richard Heathfield said:
Malcolm said:

Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

I couldn't tell if that was a serious post or not. I thought it might be
"humour".

Malcolm · Jun 8, 2006

Richard Heathfield said:
Malcolm said:

Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

We then apply the Viterbi algorithm. Essentially this runs the model in
reverse over the input, determining which mode is more likely to have
generated the sequence, and where the transition points must be.

Now if the embedded English strings are reasonably long, so the chance of
transition to gibberish is not too high, the algorithm will regard short
stretches of gibberish like C identifiers as English, on the balance of
probability. That is not to say that it will be perfect - if a string is
ended with a C identifier then the algorithm might well assign the
identifier to the gibberish. But it should do a reasonable job.

I'll try to find time to implement one.

Keith Thompson · Jun 10, 2006

Malcolm said:
The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

[...]

And is this supposed to work if the content is encrypted?

Jordan Abel · Jun 10, 2006

Keith Thompson said:
Malcolm said:

The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

Click to expand...

[...]

And is this supposed to work if the content is encrypted?

Or (even more likely than encrypted) compressed?

Malcolm · Jun 10, 2006

Keith Thompson said:
The way the program would work is to build two Markov models, one of PDF
/
doc gibberish, and one of English language. The we also give a
probability
of transitioning from English to gibberish and back again.

Click to expand...

[...]

And is this supposed to work if the content is encrypted?

Depends on the quality of the encryption.
By Markov modelling of English / non-English encrypted texts you might be
able to distinguish between them. It would obviously work if the encryption
was a substitution cipher.
I suspect that decent encryption would require a much more sophisticated
attack.

How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
C exercise	1	Feb 3, 2022
How to create PDF file in Batch	5	May 11, 2022
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
HELP , with operating system related program in c.	1	Mar 27, 2023
read pdf header in c	5	Sep 18, 2012
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023

help in C/C++ to read .DOC & PDF files

steve

Ian Collins

osmium

ben

Malcolm

Richard Heathfield

osmium

Malcolm

Keith Thompson

Jordan Abel

Malcolm

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads