extract text from a PDF file with JAVA

S

Sergio

Hi to all the newsgroup, this is my first post.
I'm approaching the text retrieving from PDF files with java.
I'm looking for some example code, tutorial, guide or similar.

I'm using, for the moment, PDFBox library but i notice a lot of errors
in its PDF parsing.
So i've tried with "Pjx" library and i've found a good example code in
this site:
http://www.jguru.com/faq/view.jsp?EID=1074237
....but i can't find a way to call "PdfParser.getContents()" method.

I will appreciate any advice.
Thanks in advance.

Sergio.
 
L

Lars Enderin

Oliver Wong skrev:
How can you "not find a way" to call a specific method? What did you
type and what error message was produced?

The method is declared private. It's not supposed to be called from
outside the class.
 
S

Sergio

Lars Enderin ha scritto:
The method is declared private. It's not supposed to be called from
outside the class.

first af all thanks for the answers.
i've made that method public before calling it.
My procedure's call is this (very simple):

File f = new File("sample.pdf");
String text = new String();
PdfParser p = new PdfParser();
Document doc = p.parse(f);
text = p.getContents();


These the errors displayed on the console:

Exception in thread "main" java.lang.ClassCastException:
java.lang.String
at com.etymon.pj.PdfParser.parse(PdfParser.java:427)
at com.etymon.pj.PdfParser.getNextXref(PdfParser.java:67)
at com.etymon.pj.PdfParser.getXref(PdfParser.java:57)
at com.etymon.pj.PdfParser.getObjects(PdfParser.java:12)
at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227)
at com.etymon.pj.Pdf.<init>(Pdf.java:32)
at PdfParser.getContents(PdfParser.java:82)
at PdfParser.parse(PdfParser.java:47)
at PdfParser.parse(PdfParser.java:29)
at Prova.main(Prova.java:31)

Thanks in advance for your interest.

Sergio.
 
O

Oliver Wong

Sergio said:
Lars Enderin ha scritto:


first af all thanks for the answers.
i've made that method public before calling it.
My procedure's call is this (very simple):

File f = new File("sample.pdf");
String text = new String();
PdfParser p = new PdfParser();
Document doc = p.parse(f);
text = p.getContents();


These the errors displayed on the console:

Exception in thread "main" java.lang.ClassCastException:
java.lang.String
at com.etymon.pj.PdfParser.parse(PdfParser.java:427)
at com.etymon.pj.PdfParser.getNextXref(PdfParser.java:67)
at com.etymon.pj.PdfParser.getXref(PdfParser.java:57)
at com.etymon.pj.PdfParser.getObjects(PdfParser.java:12)
at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227)
at com.etymon.pj.Pdf.<init>(Pdf.java:32)
at PdfParser.getContents(PdfParser.java:82)
at PdfParser.parse(PdfParser.java:47)
at PdfParser.parse(PdfParser.java:29)
at Prova.main(Prova.java:31)

Thanks in advance for your interest.

Please show the parse method of the file com.etymon.pj.PdfParser. Be
sure to include line 427.

- Oliver
 
S

Sergio

Please show the parse method of the file com.etymon.pj.PdfParser. Be
sure to include line 427.

- Oliver

As you've requested here is the parse method of the file
com.etymon.pj.PdfParser.
It's quite long...the line 427 is the return instruction at the end of
method.
Thanks again.

public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][]
xref, byte[] data, int start)
throws IOException, PjException {
PdfParserState state = new PdfParserState();
state._data = data;
state._pos = start;
state._stream = -1;
Stack stack = new Stack();
boolean endFlag = false;
while ( ( ! endFlag ) && (getToken(state)) ) {
if (state._stream != -1) {
stack.push(state._streamToken);
state._stream = -1;
}
else if (state._token.equals("startxref")) {
endFlag = true;
}
else if (state._token.equals("endobj")) {
endFlag = true;
}
else if (state._token.equals("%%EOF")) {
endFlag = true;
}
else if (state._token.equals("endstream")) {
byte[] stream = (byte[])(stack.pop());
PjStreamDictionary pjsd = new PjStreamDictionary(
((PjDictionary)(stack.pop())).getHashtable());
PjStream pjs = new PjStream(pjsd, stream);
stack.push(pjs);
}
else if (state._token.equals("stream")) {
// get length of stream
PjObject obj = ((PjObject)(
(((PjDictionary)(stack.peek())).
getHashtable().
get(new PjName("Length")))));
if (obj instanceof PjReference) {
obj = getObject(pdf, raf, xref,
((PjReference)(obj)).getObjNumber().getInt());
}
state._stream =
((PjNumber)(obj)).getInt();

// the following if() clause added to
// handle the case of "Length" being
// incorrect (larger than the actual
// stream length)
if ( state._stream >
(state._data.length - state._pos)
) {
state._stream =
state._data.length -
state._pos - 17;
}

if (state._pos < state._data.length) {
if ((char)(state._data[state._pos]) == '\r') {
state._pos++;
}
if ( (state._pos < state._data.length) &&
((char)(state._data[state._pos]) ==
'\n') ) {
state._pos++;
}
}
}
else if (state._token.equals("null")) {
stack.push(new PjNull());
}
else if (state._token.equals("true")) {
stack.push(new PjBoolean(true));
}
else if (state._token.equals("false")) {
stack.push(new PjBoolean(false));
}
else if (state._token.equals("R")) {
// we ignore the generation number
// because all objects get reset to
// generation 0 when we collapse the
// incremental updates
stack.pop(); // the generation number
PjNumber obj = (PjNumber)(stack.pop());
stack.push(new PjReference(obj, PjNumber.ZERO));
}
else if ( (state._token.charAt(0) == '<') &&
(state._token.startsWith("<<") == false) ) {
stack.push(new PjString(PjString.decodePdf(state._token)));
}
else if (
(Character.isDigit(state._token.charAt(0)))
|| (state._token.charAt(0) == '-')
|| (state._token.charAt(0) == '.') ) {
stack.push(new PjNumber(new Float(state._token).floatValue()));
}
else if (state._token.charAt(0) == '(') {
stack.push(new PjString(PjString.decodePdf(state._token)));
}
else if (state._token.charAt(0) == '/') {
stack.push(new PjName(state._token.substring(1)));
}
else if (state._token.equals(">>")) {
boolean done = false;
Object obj;
Hashtable h = new Hashtable();
while ( ! done ) {
obj = stack.pop();
if ( (obj instanceof String) &&
(((String)obj).equals("<<")) ) {
done = true;
} else {
h.put((PjName)(stack.pop()),
(PjObject)obj);
}
}
// figure out what kind of dictionary we have
PjDictionary dictionary = new PjDictionary(h);
if (PjPage.isLike(dictionary)) {
stack.push(new PjPage(h));
}
else if (PjPages.isLike(dictionary)) {
stack.push(new PjPages(h));
}
else if (PjFontType1.isLike(dictionary)) {
stack.push(new PjFontType1(h));
}
else if (PjFontDescriptor.isLike(dictionary)) {
stack.push(new PjFontDescriptor(h));
}
else if (PjResources.isLike(dictionary)) {
stack.push(new PjResources(h));
}
else if (PjCatalog.isLike(dictionary)) {
stack.push(new PjCatalog(h));
}
else if (PjInfo.isLike(dictionary)) {
stack.push(new PjInfo(h));
}
else if (PjEncoding.isLike(dictionary)) {
stack.push(new PjEncoding(h));
}
else {
stack.push(dictionary);
}
}
else if (state._token.equals("]")) {
boolean done = false;
Object obj;
Vector v = new Vector();
while ( ! done ) {
obj = stack.pop();
if ( (obj instanceof String) &&
(((String)obj).equals("[")) ) {
done = true;
} else {
v.insertElementAt((PjObject)obj, 0);
}
}
// figure out what kind of array we have
PjArray array = new PjArray(v);
if (PjRectangle.isLike(array)) {
stack.push(new PjRectangle(v));
}
else if (PjProcSet.isLike(array)) {
stack.push(new PjProcSet(v));
}
else {
stack.push(array);
}
}
else if (state._token.startsWith("%")) {
// do nothing
}
else {
stack.push(state._token);
}
}
/*line 427*/ return (PjObject)(stack.pop());
}
 
O

Oliver Wong

[OP has a CastClassException on line 427, actual class type is String]
Please show the parse method of the file com.etymon.pj.PdfParser. Be
sure to include line 427.

- Oliver

As you've requested here is the parse method of the file
com.etymon.pj.PdfParser.
It's quite long...the line 427 is the return instruction at the end of
method.
Thanks again.

public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][]
xref, byte[] data, int start) [...]
Stack stack = new Stack(); [...]
stack.push(state._streamToken);
[...]
byte[] stream = (byte[])(stack.pop());
PjStreamDictionary pjsd = new PjStreamDictionary(
((PjDictionary)(stack.pop())).getHashtable());
PjStream pjs = new PjStream(pjsd, stream);
stack.push(pjs); [...]
/*line 427*/ return (PjObject)(stack.pop());

This code is extremely messy in that it pops all sorts of different type
objects into the stack object. I wouldn't be surprised if this were
generated code instead of hand written.

If this is your code, you've got a bug and you need to fix it. If it's
someone else's code, then you should write up an SSCCE demonstrating the bug
and submit it to then. See http://mindprod.com/jgloss/sscce.html

- Oliver
 
S

Sergio

Oliver Wong ha scritto:
This code is extremely messy in that it pops all sorts of different type
objects into the stack object. I wouldn't be surprised if this were
generated code instead of hand written.

If this is your code, you've got a bug and you need to fix it. If it's
someone else's code, then you should write up an SSCCE demonstrating the bug
and submit it to then. See http://mindprod.com/jgloss/sscce.html

the code of parse method is from pjx library...the only code i've wrote
is the calling method and i think the problem is in that procedure.
Thanks for your help.
Sergio.
 
C

Chris Uppal

Sergio said:
i've made that method public before calling it.

And you are surprised to find that it doesn't work ?

Presumably the author made that method private for a reason -- for instance it
may depend on certain kinds of initialisation being done first. Why not
explore the library for the /correct/ way to use it for what you want. If you
find there isn't a way, then you could drop a line to the author suggesting an
enhancement -- which would probably be more welcome if you can supply /working/
code too.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top