Suggestion for converting PDF files to HTML/txt files

Discussion in 'Python' started by srinivasan srinivas, Aug 11, 2008.

  1. srinivasan srinivas, Aug 11, 2008
  2. srinivasan srinivas

    brad Guest

    Unless there is some recent development, the answer is no, it's not
    possible. Getting text out of PDF is difficult (to say the least) and at
    times impossible... i.e. a PDF can be an image that contains some text, etc.
    brad, Aug 11, 2008
  3. srinivasan srinivas

    alex23 Guest

    PDFMiner is a set of CLI tools written in Python, one of which
    converts PDF to text, HTML and more:
    alex23, Aug 12, 2008
  4. srinivasan srinivas

    brad Guest

    Very neat program. Would be cool if it could easily integrate into other
    py apps instead of being a standalone CLI tool.
    brad, Aug 12, 2008
  5. srinivasan srinivas

    alex23 Guest

    Perhaps, but I think you could get a long way using os.system().
    alex23, Aug 12, 2008
  6. srinivasan srinivas

    brad Guest

    Yes, that is possible, but there's a lot of overhead when doing that...
    unfortunately. Also, if using os.system() is the answer, then one could
    just use the xpdf pdftotext program. A native Python solution that could
    be called from other PY apps naturally, would be awesome.
    brad, Aug 12, 2008
