Extract Content from HTML ?

Discussion in 'HTML' started by mark4, Feb 28, 2005.

  1. mark4

    mark4 Guest

    Hello,

    Are there any utilities to help me extract Content from HTML ?

    I'd like to store this data in a database.

    The HTML consists of about 10,000 files with a total size of
    about 160 Mb. Each file is a thread from a message forum. Each
    thread has several contributions. The threads are in linear
    order of date posted with filenames such as 000125633.html. The
    HTML is marked up with <table>, etc tags. This HTML is very
    badly formed with crucial tags missing (such as <TR>, <BODY>,
    etc.). There is no coherence to this; no system - sometimes tags
    are missing and sometimes they are present. Despite this, the
    threads seem to render correctly; such is the forgiving nature
    of modern browsers.

    Fields for each post are usually identified by an attribute tag.
    (usually an attribute of a <TD> or <SPAN>.

    Sometimes I need to actually store HTML with the content (for
    instance when a post includes a link, colored writing or text
    formatted with <PRE> tags.

    My purpose in storing this in a database is to make the content
    (a) easier to search and (b) use a more efficient storage
    medium.

    The original database from which these web-forum posts were
    taken is no longer available on the web nor does it look like it
    ever will be again. Nor can I contact the person who 'owns' it.
    If I did contact them, they would be unlikely to release the
    data.

    Despite this, there are no copyright issues here. Every single
    post made to the forum was made using an alias and no forum
    poster wants to be identified, nor do any posters wish to claim
    "ownership" of their contributions.
     
    mark4, Feb 28, 2005
    #1
    1. Advertising

  2. mark4

    Toby Inkster Guest

    mark4 wrote:

    > Are there any utilities to help me extract Content from HTML ?
    > I'd like to store this data in a database.


    Looks to me like you'd have to write your own customised program to
    extract the data.

    To do that, I recommend using Perl. Perl has a module called HTML::parser
    which is apparently pretty good at extracting information from malformed
    HTML files. Whatsmore, it is generally very good at text handling and has
    decent database modules too.

    > Nor can I contact the person who 'owns' it. If I did contact them, they
    > would be unlikely to release the data.
    >
    > Despite this, there are no copyright issues here. Every single post made
    > to the forum was made using an alias and no forum poster wants to be
    > identified, nor do any posters wish to claim "ownership" of their
    > contributions.


    Sounds to me like there are *major* copyright issues!

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Feb 28, 2005
    #2
    1. Advertising

  3. mark4

    mark4 Guest

    On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
    <> wrote:

    >mark4 wrote:
    >
    >> Are there any utilities to help me extract Content from HTML ?
    >> I'd like to store this data in a database.

    >
    >Looks to me like you'd have to write your own customised program to
    >extract the data.


    I expected as much.

    >To do that, I recommend using Perl. Perl has a module called HTML::parser
    >which is apparently pretty good at extracting information from malformed
    >HTML files. Whatsmore, it is generally very good at text handling and has
    >decent database modules too.


    Thanks. Being a microserf, I don't normally code in Perl but I
    may look into this. It's either that or WSH Javascript with
    it's regular expressions. Fortunately I already have a top
    level design and it looks pretty simple. I may look into this
    Perl module but it will probably be easier to use microserf
    technology with which I'm intimate with. I shall probably store
    it in MSSQL.

    >> Nor can I contact the person who 'owns' it. If I did contact them, they
    >> would be unlikely to release the data.
    >>
    >> Despite this, there are no copyright issues here. Every single post made
    >> to the forum was made using an alias and no forum poster wants to be
    >> identified, nor do any posters wish to claim "ownership" of their
    >> contributions.

    >
    >Sounds to me like there are *major* copyright issues!


    I can't see what those issues are. Who owns the data? Not the
    original forum provider. The data posted to a forum is copyright
    of the original author - no matter what ToS my be specified in
    the forum. All those original authors have an alias and don't
    actually want to be identified. What I'm doing is no more a
    violation of copyright than someone keeping newspaper clippings.

    So long as I don't republish it.
     
    mark4, Feb 28, 2005
    #3
  4. mark4 wrote:

    > On Mon, 28 Feb 2005 07:24:15 +0000, Toby Inkster
    > <> wrote:
    >
    >>To do that, I recommend using Perl. Perl has a module called HTML::parser
    >>which is apparently pretty good at extracting information from malformed
    >>HTML files. Whatsmore, it is generally very good at text handling and has
    >>decent database modules too.


    Mark's right. I don't do the whole "language cheerleader" thing - but for
    this particular problem, Perl's an ideal fit.

    > Thanks. Being a microserf, I don't normally code in Perl but I
    > may look into this. It's either that or WSH Javascript with
    > it's regular expressions.


    There's Perl for Windows, you know. It integrates nicely with WSH too.

    <http://www.activestate.com>

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
     
    Sherm Pendley, Feb 28, 2005
    #4
  5. Access can link to HTML (direct from the web) and will recognise tables.
    You might be lucky! It would make a very quick solution. File > Get
    External Data > Link... and then choose HTML. I was surprised how well it
    worked when I tried it on a table I'd created in FrontPage.

    --
    ####################
    ## PH, London
    ####################
    "mark4" <mark4asp@#notthis#ntlworld.com> wrote in message
    news:...
    > Hello,
    >
    > Are there any utilities to help me extract Content from HTML ?
    >
    > I'd like to store this data in a database.
    >
    > The HTML consists of about 10,000 files with a total size of
    > about 160 Mb. Each file is a thread from a message forum. Each
    > thread has several contributions. The threads are in linear
    > order of date posted with filenames such as 000125633.html. The
    > HTML is marked up with <table>, etc tags. This HTML is very
    > badly formed with crucial tags missing (such as <TR>, <BODY>,
    > etc.). There is no coherence to this; no system - sometimes tags
    > are missing and sometimes they are present. Despite this, the
    > threads seem to render correctly; such is the forgiving nature
    > of modern browsers.
    >
    > Fields for each post are usually identified by an attribute tag.
    > (usually an attribute of a <TD> or <SPAN>.
    >
    > Sometimes I need to actually store HTML with the content (for
    > instance when a post includes a link, colored writing or text
    > formatted with <PRE> tags.
    >
    > My purpose in storing this in a database is to make the content
    > (a) easier to search and (b) use a more efficient storage
    > medium.
    >
    > The original database from which these web-forum posts were
    > taken is no longer available on the web nor does it look like it
    > ever will be again. Nor can I contact the person who 'owns' it.
    > If I did contact them, they would be unlikely to release the
    > data.
    >
    > Despite this, there are no copyright issues here. Every single
    > post made to the forum was made using an alias and no forum
    > poster wants to be identified, nor do any posters wish to claim
    > "ownership" of their contributions.
    >
     
    Philip Herlihy, Feb 28, 2005
    #5
  6. mark4

    Jim Royal Guest

    In article <>, mark4
    <mark4asp@#notthis#ntlworld.com> wrote:

    > Are there any utilities to help me extract Content from HTML ?


    BBEdit has a simple menu command to remove markup from an HTML page,
    leaving only the content. You should then perform any kind of regex
    operation to massage the data before saving it.

    To process all those files, it should be a pretty simple matter to
    write an AppleScript to automate this procesure.

    However, this solution is Macintosh-only.

    --
    Jim Royal
    "Understanding is a three-edged sword"
    http://JimRoyal.com
     
    Jim Royal, Feb 28, 2005
    #6
  7. On Mon, 28 Feb 2005 08:32:19 GMT, mark4 wrote:

    >>> Nor can I contact the person who 'owns' it. If I did contact them, they
    >>> would be unlikely to release the data.
    >>>
    >>> Despite this, there are no copyright issues here. Every single post made
    >>> to the forum was made using an alias and no forum poster wants to be
    >>> identified, nor do any posters wish to claim "ownership" of their
    >>> contributions.

    >>
    >>Sounds to me like there are *major* copyright issues!

    >
    > I can't see what those issues are.


    By law, those posts are copyrighted and owned by the posters.
     
    Chrissy Cruiser, Feb 28, 2005
    #7
  8. On Mon, 28 Feb 2005 06:06:36 GMT, mark4
    <mark4asp@#notthis#ntlworld.com> wrote:

    >Hello,


    >Are there any utilities to help me extract Content from HTML ?


    < snip >

    Notetab ? Modify - Strip HTML tags ?

    http://www.notetab.com/

    Not sure whether that is in the freeware version or not.

    Regards, John.
     
    John Fitzsimons, Feb 28, 2005
    #8
  9. mark4

    Toby Inkster Guest

    mark4 wrote:

    > Thanks. Being a microserf, I don't normally code in Perl but I
    > may look into this.


    I am told ActiveState's Windows port of Perl is pretty good. Alternatively
    there is also a Cygwin version of Perl.

    > I can't see what those issues are. Who owns the data?


    Its original authors, unless they explicitly signed away the copyright.

    > All those original authors have an alias and don't actually want to be
    > identified.


    Publishing anonymously or under a pseudonym does not mean you forgo
    copyright.

    > So long as I don't republish it.


    If you are keeping the database for private use, then you can probably
    "get away with it", but the natural assumption on alt.html is that posters
    are wanting to publish their efforts to the web, unless it's explicitly
    stated otherwise.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Feb 28, 2005
    #9
  10. mark4

    Guest

    > >To do that, I recommend using Perl. Perl has a module called
    HTML::parser
    > >which is apparently pretty good at extracting information from

    malformed
    > >HTML files. Whatsmore, it is generally very good at text handling

    and has
    > >decent database modules too.

    >
    >
    > Thanks. Being a microserf, I don't normally code in Perl but I
    > may look into this. It's either that or WSH Javascript with
    > it's regular expressions. Fortunately I already have a top
    > level design and it looks pretty simple. I may look into this
    > Perl module but it will probably be easier to use microserf
    > technology with which I'm intimate with. I shall probably store
    > it in MSSQL.


    You could use the InternetExplorer.Application COM object.
    That would give you the facilities for performing HTML
    parsing without regexps. It would therefore be
    more robust and readily doable in your favorite language.
    Try google for examples.
     
    , Mar 1, 2005
    #10
  11. mark4

    mbstevens Guest

    mark4 wrote:

    > Hello,
    >
    > Are there any utilities to help me extract Content from HTML ?


    lynx -dump http://whateverTheHeck.com > temp.txt

    .... is the shortest program I know of for this kind of thing.
    The '>' redirection to temp.txt may vary somewhat between operating systems.
    --
    mbstevens http://www.mbstevens.com
     
    mbstevens, Mar 1, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TheKeith
    Replies:
    20
    Views:
    106,595
    Chris Morris
    Oct 29, 2003
  2. hazz
    Replies:
    6
    Views:
    49,631
    SkyUCHC
    Jun 9, 2010
  3. questioner
    Replies:
    0
    Views:
    340
    questioner
    May 4, 2004
  4. Replies:
    0
    Views:
    341
  5. frozensnow

    Extract content from a HTML or text file

    frozensnow, Nov 1, 2006, in forum: Perl Misc
    Replies:
    2
    Views:
    209
    John Bokma
    Nov 1, 2006
Loading...

Share This Page