Re: Html or Pdf to Rtf (Linux) with Python

Discussion in 'Python' started by Axel Straschil, Dec 16, 2004.

  1. Hallo!

    > However, our company's product, PDFTextStream does do a phenomenal job of
    > extracting text and metadata out of PDF documents. It's crazy-fast, has a
    > clean API, and in general gets the job done very nicely. It presents two
    > points of compromise from your idea situation:
    > 1. It only produces text, so you would have to take the text it provides and
    > write it out as an RTF yourself (there are tons of packages and tools that do
    > this). Since the RTF format has pretty weak formatting capabilities compared


    I've got the Input Source in HTML, the Problem ist converting from any to
    RTF. Please give me a hint where the tons of packages are.

    Thanks,
    AXEL.
    --
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
    "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
    interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]
    Axel Straschil, Dec 16, 2004
    #1
    1. Advertising

  2. Axel Straschil

    Mike Meyer Guest

    Axel Straschil <> writes:

    > Hallo!
    >
    >> However, our company's product, PDFTextStream does do a phenomenal
    >> job of extracting text and metadata out of PDF documents. It's
    >> crazy-fast, has a clean API, and in general gets the job done very
    >> nicely. It presents two points of compromise from your idea
    >> situation:
    >> 1. It only produces text, so you would have to take the text it
    >> provides and write it out as an RTF yourself (there are tons of
    >> packages and tools that do this). Since the RTF format has pretty
    >> weak formatting capabilities compared

    >
    > I've got the Input Source in HTML, the Problem ist converting from any
    > to RTF. Please give me a hint where the tons of packages are.


    That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    via COM using the python win32all (I think that's what it's now
    called) package.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
    Mike Meyer, Dec 16, 2004
    #2
    1. Advertising

  3. Hello!

    > That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    > via COM using the python win32all (I think that's what it's now
    > called) package.


    As I wrote in my posting and the subject: linux ;-)
    I could try to do this with open office, by I'm afraid this will not
    be a performant solution ;-(
    I realy was spending hour's on that, the only thing I found was a
    spezifikation for reach text, maybe a good point to start a project ...

    Lg
    AXEL.
    Axel Straschil, Dec 16, 2004
    #3
  4. Axel Straschil

    Mike Meyer Guest

    Axel Straschil <> writes:

    > Hello!
    >
    >> That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    >> via COM using the python win32all (I think that's what it's now
    >> called) package.

    > As I wrote in my posting and the subject: linux ;-)
    > I could try to do this with open office, by I'm afraid this will not
    > be a performant solution ;-(
    > I realy was spending hour's on that, the only thing I found was a
    > spezifikation for reach text, maybe a good point to start a project ...


    Sorry. I forgot the original subject.

    You might take a look at PyRTF in PyPI. It's still in beta,
    though. But it might be enough that coupled with the HTMLParser.py to
    get you where you need to go.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
    Mike Meyer, Dec 17, 2004
    #4
  5. On Thu, 16 Dec 2004 19:30:37 +0000 (UTC), Axel Straschil
    <> wrote:
    > > That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    > > via COM using the python win32all (I think that's what it's now
    > > called) package.

    >
    > As I wrote in my posting and the subject: linux ;-)
    > I could try to do this with open office, by I'm afraid this will not
    > be a performant solution ;-(
    > I realy was spending hour's on that, the only thing I found was a
    > spezifikation for reach text, maybe a good point to start a project ...


    I've been able to successfully get konqueror to generate a pdf from a
    html file via dcop. It's something along the lines of:
    % dcop konqueror-25827 html-widget1 print 1
    You can launch konq in a xvfb (X Virtual Framebuffer) then communicate
    via dcop to send commands to the browser (load this url, print this
    page, etc).

    I've been investigating doing the same feat using JS/XUL/etc in
    mozilla. It probably is possible. There's lots of documentation about
    the XPCOM api available from http://xulplanet.com/

    As for converting to RTF, someone has already pointed out PyRTF.

    Regards,
    Stephen Thorne
    Stephen Thorne, Dec 17, 2004
    #5
  6. Hello!

    > I've been able to successfully get konqueror to generate a pdf from a
    > html file via dcop. It's something along the lines of:


    For that stuff, I'm using htmloc (http://www.htmldoc.org/).

    Lg,
    AXEL.
    --
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
    "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
    interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]
    Axel Straschil, Dec 17, 2004
    #6
  7. Hello!

    > You might take a look at PyRTF in PyPI. It's still in beta,


    I think PyRTF would be the right choice, thanks. Yust had a short look
    at it.

    Lg,
    AXEL.
    --
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
    "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
    interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]
    Axel Straschil, Dec 17, 2004
    #7
  8. On Fri, 17 Dec 2004 07:55:10 +0000 (UTC), Axel Straschil
    <> wrote:
    > Hello!
    >
    > > I've been able to successfully get konqueror to generate a pdf from a
    > > html file via dcop. It's something along the lines of:

    >
    > For that stuff, I'm using htmloc (http://www.htmldoc.org/).


    I found htmldoc and every other open source purpose built html->pdf
    converter to be deficient enough to discourage us from using them. For
    our requirements only web-browsers had the quality of rendering
    required.

    Stephen.
    Stephen Thorne, Dec 18, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alexander Straschil

    Html or Pdf to Rtf (Linux) with Python

    Alexander Straschil, Dec 14, 2004, in forum: Python
    Replies:
    4
    Views:
    2,130
    sevenearths
    Nov 5, 2010
  2. Replies:
    6
    Views:
    406
    Tom Plunket
    Jan 5, 2007
  3. Tony
    Replies:
    2
    Views:
    273
  4. Dizzy Haze

    full-text indexing of pdf, rtf, txt, html

    Dizzy Haze, Nov 17, 2005, in forum: Perl Misc
    Replies:
    3
    Views:
    147
    Lars Kellogg-Stedman
    Nov 17, 2005
Loading...

Share This Page