OCR of ancient periodicals. What about XML outputs?

D

Daniel

Hi All,

I'm working on OCR of ancient periodicals. The issue is this: I can't access
to layout data encoded in the OCR pdf files and use them regardless to their
original format. There is an appropriate XML standard, ALTO, which matches
each text character and its corresponding graphic zone. But I don't know how
to generate an ALTO output. Do you know a soft whith such output? Any clue
about this?

Thanks a lot.

Daniel
Paris
 
A

Andy Dingley

I'm working on OCR of ancient periodicals.

That's OK, you're making me dig through some pretty ancient memories!
AFAIR, ALTO was a layout-specific extension to the Metadata Encoding
and Transmission Standard (METS) work that came from the Library of
Congress (LoC) back in the last century. I only worked with METS, but
AFAIR there was a published XML Schema for ALTO and it was pretty
simple to generate - nothing weird about it.

Searching around METS & LoC ought to be useful.
 
D

Daniel

You're wright, nothing weird about this. But do you know an OCR soft with an
ALTO output currently available ?
Thanks

Daniel
 
P

Peter Flynn

Daniel said:
You're wright, nothing weird about this. But do you know an OCR soft with an
ALTO output currently available ?

I think Optopus from Makrolog GmbH (Wiesbaden?) might have done this,
but it was a long time ago.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top