Parsing pdf files

Arun Kumar · Aug 22, 2009

hello all,
Does anyone know a good pdf parser that retains formatting
after its extracted text? I used PDF::Reader, but when you extract text you
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

Thanks for any help.

regards,
Arun Kumar M S

--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Gregory Brown · Aug 22, 2009

hello all,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Does anyone know a good pdf parser that r= etains formatting
after its extracted text? I used PDF::Reader, but when you extract text y= ou
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

This doesn't exist in Ruby, unfortunately.

-greg

Arun Kumar · Aug 22, 2009

That's really very sad

This doesn't exist in Ruby, unfortunately.

-greg

--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Gregory Brown · Aug 22, 2009

That's really very sad

Looks like you better roll up your sleeves

Arun Kumar · Aug 23, 2009

WWVhaCBzZWVpbmcgd2hhdCBjYW4gYmUgZG9uZSA6KQoKT24gU3VuLCBBdWcgMjMsIDIwMDkgYXQg
Mjo0MSBBTSwgR3JlZ29yeSBCcm93biA8Z3JlZ29yeS50LmJyb3duQGdtYWlsLmNvbT53cm90ZToK
Cj4gT24gU2F0LCBBdWcgMjIsIDIwMDkgYXQgNDoxMCBQTSwgQXJ1biBLdW1hcjxhcnVuLmVpbnN0
ZWluQGdtYWlsLmNvbT4KPiB3cm90ZToKPiA+IFRoYXQncyByZWFsbHkgdmVyeSBzYWQgOigKPgo+
IExvb2tzIGxpa2UgeW91IGJldHRlciByb2xsIHVwIHlvdXIgc2xlZXZlcyA6KQo+Cj4KCgotLSAK
fHwg4KS24KWN4KSw4KWAIOCknOCkvuCkqOCkleClgOCksOCkmOClgeCkqOCkvuCkpeCliyDgpLXg
pL/gpJzgpK/gpKTgpYcgfHwK

Hannes Wyss · Aug 24, 2009

Arun,

there is another ruby pdf-extractor: http://scm.ywesee.com/?p=3Drpdf2txt;a=
=3Dsummary

However, it's largely undocumented, slow, fragile, and its column
detection algorithm is basic at best. If that does not faze you, give
it a try and contact me if you have questions. Look at the included
commandline-tool in ./bin/rpdf2txt for an example.

cheers,
Hannes

hello all,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Does anyone know=

a good pdf parser that retains formatting

after its extracted text? I used PDF::Reader, but when you extract text y= ou
just get a stream of characters that are not at all intelligible. When I
copy a pdf contents from a pdf reader to Gedit text editor in linux it
retains its format. I'm looking for something like that.

Thanks for any help.

regards,
Arun Kumar M S

=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||

Erik Terpstra · Aug 24, 2009

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

http://github.com/eterps/pdf-struct/tree/master

Gregory Brown · Aug 24, 2009

You can use http://pdftohtml.sourceforge.net or use my Ruby wrapper for
this tool:

http://github.com/eterps/pdf-struct/tree/master

Very interesting, thanks for posting this.

-greg

Random Access using IO#pos in code blocks	13	Apr 28, 2009
retriving escape unicode sequences from files ...	1	Aug 4, 2012
retriving escape unicode sequences from files ...	1	Aug 4, 2012
geting error as unxpected symbol read: ". in array initialization	0	Mar 27, 2016
corrupt zip files	10	May 6, 2012
Why file containing 256 bytes is 257 bytes long?	12	Sep 14, 2005
Can't get to_integer to work	6	Sep 25, 2003
VB script, driving me crazy; vb bug ?	2	Jan 30, 2008

Parsing pdf files

Arun Kumar

Gregory Brown

Arun Kumar

Gregory Brown

Arun Kumar

Hannes Wyss

Erik Terpstra

Gregory Brown

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads