Parsing pdf files

A

Arun Kumar

hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

Thanks for any help.

regards,
Arun Kumar M S


--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
 
G

Gregory Brown

hello all,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Does anyone know a good pdf parser that r= etains formatting
after its extracted text? I used PDF::Reader, but when you extract text y= ou
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

This doesn't exist in Ruby, unfortunately.

-greg
 
A

Arun Kumar

That's really very sad :(

This doesn't exist in Ruby, unfortunately.

-greg


--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
 
A

Arun Kumar

WWVhaCBzZWVpbmcgd2hhdCBjYW4gYmUgZG9uZSA6KQoKT24gU3VuLCBBdWcgMjMsIDIwMDkgYXQg
Mjo0MSBBTSwgR3JlZ29yeSBCcm93biA8Z3JlZ29yeS50LmJyb3duQGdtYWlsLmNvbT53cm90ZToK
Cj4gT24gU2F0LCBBdWcgMjIsIDIwMDkgYXQgNDoxMCBQTSwgQXJ1biBLdW1hcjxhcnVuLmVpbnN0
ZWluQGdtYWlsLmNvbT4KPiB3cm90ZToKPiA+IFRoYXQncyByZWFsbHkgdmVyeSBzYWQgOigKPgo+
IExvb2tzIGxpa2UgeW91IGJldHRlciByb2xsIHVwIHlvdXIgc2xlZXZlcyA6KQo+Cj4KCgotLSAK
fHwg4KS24KWN4KSw4KWAIOCknOCkvuCkqOCkleClgOCksOCkmOClgeCkqOCkvuCkpeCliyDgpLXg
pL/gpJzgpK/gpKTgpYcgfHwK
 
H

Hannes Wyss

Arun,

there is another ruby pdf-extractor: http://scm.ywesee.com/?p=3Drpdf2txt;a=
=3Dsummary

However, it's largely undocumented, slow, fragile, and its column
detection algorithm is basic at best. If that does not faze you, give
it a try and contact me if you have questions. Look at the included
commandline-tool in ./bin/rpdf2txt for an example.

cheers,
Hannes

hello all,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Does anyone know=
a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text y= ou
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

Thanks for any help.

regards,
Arun Kumar M S
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top