help in C/C++ to read .DOC & PDF files

S

steve

hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

so that i can implement it in my program
any suggestions would be of great help
thanx in advance for the help
 
I

Ian Collins

steve said:
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
Why all the shouting?
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and

Simple? You'll be lucky.

Have a look at the xpdf or openoffice source to see why.
 
O

osmium

:

i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//

so that i can implement it in my program
any suggestions would be of great help
thanx in advance for the help

This could really eat up the time if you insist it be highly automated. The
easiest, and highly manual, way is to export the .doc file as a text file
and then operate on that file. I don't know for sure, but it seems
reasonable that you might be able to do something similar with the .pdf
file. There may be third party - I don't mean to exclude freeware or
shareware - conversion programs.
 
B

ben

Ian Collins said:
Why all the shouting?


Simple? You'll be lucky.

Have a look at the xpdf or openoffice source to see why.

steve,

the internal format of pdf is *really* complex in my opinion. even if
you're a brilliant programmer there's still an awful lot of hoops to
just through so i think would require a lot of work. pdf's format spec
is over 1000 pages long. xpdf, as mentioned above, has a command line
utility in it called pdf2text, which takes a pdf as input and outputs
plain text and works on its own and doesn't require all the GUI stuff
round it. i suggest you get hold of pdf2text which is in xpdf whose
source is available for free and make use of that. for the .doc format
i don't know.
 
M

Malcolm

steve said:
hi
i am writing a C program which can read TEXT , PDF,.DOC files
the program is to :
count the number of words,
lines,
characters and the frequency of each word and the phrases count in the
file and gives the output in EXCEL
THIS program is working very fine for TXT ( text ) files
but
i need some help :: how to RUN this program to read PDF & .DOC
files
i cant paste the source code as its too big
so please give me a sample C program which can read a DOC file or
a PDF file and
//*
print the same text in EXCEL as the output*//
The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.
 
R

Richard Heathfield

Malcolm said:
The way I would solve this problem is to build a Hidden Markov Model and
train it to distinguish English-language text from formatting gibberish.
Then you just apply the model to the data, which will be a stream of bytes
with ASCII string embedded in it, and extract the text.

Unfortunatley this isn't a simple program to write from scratch.

Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.
 
O

osmium

Richard Heathfield said:
Malcolm said:


Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.

I couldn't tell if that was a serious post or not. I thought it might be
"humour".
 
M

Malcolm

Richard Heathfield said:
Malcolm said:


Nor would it work if you tried it.

Take the C Standard PDF, run it through strings(1), and then grep for any
standard C library function you like. Tell me how many hits you get.
The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.

We then apply the Viterbi algorithm. Essentially this runs the model in
reverse over the input, determining which mode is more likely to have
generated the sequence, and where the transition points must be.

Now if the embedded English strings are reasonably long, so the chance of
transition to gibberish is not too high, the algorithm will regard short
stretches of gibberish like C identifiers as English, on the balance of
probability. That is not to say that it will be perfect - if a string is
ended with a C identifier then the algorithm might well assign the
identifier to the gibberish. But it should do a reasonable job.

I'll try to find time to implement one.
 
K

Keith Thompson

Malcolm said:
The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.
[...]

And is this supposed to work if the content is encrypted?
 
J

Jordan Abel

Keith Thompson said:
Malcolm said:
The way the program would work is to build two Markov models, one of PDF /
doc gibberish, and one of English language. The we also give a probability
of transitioning from English to gibberish and back again.
[...]

And is this supposed to work if the content is encrypted?

Or (even more likely than encrypted) compressed?
 
M

Malcolm

Keith Thompson said:
The way the program would work is to build two Markov models, one of PDF
/
doc gibberish, and one of English language. The we also give a
probability
of transitioning from English to gibberish and back again.
[...]

And is this supposed to work if the content is encrypted?
Depends on the quality of the encryption.
By Markov modelling of English / non-English encrypted texts you might be
able to distinguish between them. It would obviously work if the encryption
was a substitution cipher.
I suspect that decent encryption would require a much more sophisticated
attack.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top