Puzzling PDF

F

F.R.

Hi all,

Struggling to parse bank statements unavailable in sensible
data-transfer formats, I use pdftotext, which solves part of the
problem. The other day I encountered a strange thing, when one single
figure out of many erroneously converted into letters. Adobe Reader
displays the figure 50'000 correctly, but pdftotext makes it into
"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
expect such a mistake from an OCR. However, the statement is not a scan,
but is made up of text. Because malfunctions like this put a damper on
the hope to ever have a reliable reader that doesn't require
time-consuming manual verification, I played around a bit and ended up
even more confused: When I lift the figure off the Adobe display (mark,
copy) and paste it into a Python IDLE window, it is again letters (ascii
83 and 79), when on the Adobe display it shows correctly as digits. How
can that be?

Frederic
 
R

Roy Smith

F.R. said:
Hi all,

Struggling to parse bank statements unavailable in sensible
data-transfer formats, I use pdftotext, which solves part of the
problem. The other day I encountered a strange thing, when one single
figure out of many erroneously converted into letters. Adobe Reader
displays the figure 50'000 correctly, but pdftotext makes it into
"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
expect such a mistake from an OCR. However, the statement is not a scan,
but is made up of text. Because malfunctions like this put a damper on
the hope to ever have a reliable reader that doesn't require
time-consuming manual verification, I played around a bit and ended up
even more confused: When I lift the figure off the Adobe display (mark,
copy) and paste it into a Python IDLE window, it is again letters (ascii
83 and 79), when on the Adobe display it shows correctly as digits. How
can that be?

Frederic

Maybe it's an intentional effort to keep people from screen-scraping
data out of the PDFs (or perhaps trace when they do). Is it possible
the document includes a font where those codepoints are drawn exactly
the same as the digits they resemble?

Keep in mind that PDF is not a data transmission format, it's a document
format. When you try to scape data out of a PDF, you've made a pact
with the devil.

Unclear what any of this has to do with Python. Maybe the tie-in is
that in the old Snake video game, the snake was drawn as Soooooo?

Anyway, it's S as in Sierra, and O as in Oscar.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,201
Latest member
KourtneyBe

Latest Threads

Top