Strategies for modifying marked-up text?

Discussion in 'Perl Misc' started by Thomas Baetzler, Feb 17, 2011.

  1. Hi,

    I'm looking for input on how to run search/replace operations on
    paragraphs of HTML text without having to worry about the surrounding
    markup.

    So far I'm using HTML::Treebuilder to parse a HTML document and identfy
    the individual paragraphs in the text. By recursively using the
    content_list method, I can locate the individual text chunks that make
    up the paragraph text.

    What I'd like to do is merge these chunks into a single string, run some
    search/replace regexes on it, then update the individual text chunks
    with the changes.

    Is there a better way to do this than stopping after each change to see
    what's changed and keep track of chunk borders that way?

    I could probably work on individual chunks in turn, but taking care of
    all the edge cases where I'd have to do lookahead/lookback in adjoining
    chunks could be, well, tedious ;-)

    TIA for any suggestion you might have!

    Cheers,
    Thomas
    Thomas Baetzler, Feb 17, 2011
    #1
    1. Advertising

  2. On 2011-02-17 09:45, Thomas Baetzler <> wrote:
    > I'm looking for input on how to run search/replace operations on
    > paragraphs of HTML text without having to worry about the surrounding
    > markup.


    Replace fixed strings or regexps?


    > So far I'm using HTML::Treebuilder to parse a HTML document and identfy
    > the individual paragraphs in the text. By recursively using the
    > content_list method, I can locate the individual text chunks that make
    > up the paragraph text.
    >
    > What I'd like to do is merge these chunks into a single string, run some
    > search/replace regexes on it, then update the individual text chunks
    > with the changes.
    >
    > Is there a better way to do this than stopping after each change to see
    > what's changed and keep track of chunk borders that way?


    You could write a custom matcher which walks your HTML tree. That's
    probably a lot of work and quite slow if you need the whole power of
    perl regexps, but might work if you need only some subset (fixed strings
    in the extreme case).

    Other than that, I think keeping the offsets of the start and end of
    each element and readjusting them after each replacement is probably the
    easiest way.


    > I could probably work on individual chunks in turn, but taking care of
    > all the edge cases where I'd have to do lookahead/lookback in adjoining
    > chunks could be, well, tedious ;-)


    Here is one case which comes immediately in mind and for which I don't
    have a good solution:

    If we have the HTML fragment

    <p>Here is some <em>italicized text</em></p>

    and you do a

    s/ some italicized / a bit of emphasized /

    what should be the result? The em element must start somewhere within
    the replaced text but where?

    hp
    Peter J. Holzer, Feb 17, 2011
    #2
    1. Advertising

  3. Thomas Baetzler

    ccc31807 Guest

    On Feb 17, 4:45 am, Thomas Baetzler <> wrote:
    > I'm looking for input on how to run search/replace operations on
    > paragraphs of HTML text without having to worry about the surrounding
    > markup.


    Depending on the particular search and replace operations, it would
    probably be easiest to slurp the entire file in memory and do the
    search and replace just once. This is by far the best way to make
    global changes to a document, provided it will fit into memory. The
    format of the document (HTML, XML, TXT, CSV, etc.) does not matter.

    If you had the entire document in memory, in a variable name $html,
    and you wanted to change all occurrences of 'George W. Bush' to
    'Barack H. Obama', you could do this:

    $html =~ s/George W. Bush/Barack H. Obama/g;

    You might also want to look at 'Perl slurp mode'

    CC.
    ccc31807, Feb 17, 2011
    #3
  4. On 2011-02-17 18:28, ccc31807 <> wrote:
    > On Feb 17, 4:45 am, Thomas Baetzler <> wrote:
    >> I'm looking for input on how to run search/replace operations on
    >> paragraphs of HTML text without having to worry about the surrounding
    >> markup.

    >
    > Depending on the particular search and replace operations, it would
    > probably be easiest to slurp the entire file in memory and do the
    > search and replace just once. This is by far the best way to make
    > global changes to a document, provided it will fit into memory. The
    > format of the document (HTML, XML, TXT, CSV, etc.) does not matter.
    >
    > If you had the entire document in memory, in a variable name $html,
    > and you wanted to change all occurrences of 'George W. Bush' to
    > 'Barack H. Obama', you could do this:
    >
    > $html =~ s/George W. Bush/Barack H. Obama/g;


    One of us completely misunderstood what Thomas is trying to achieve.

    As I understood it, he wants the substitution to succeed even if the
    text in the file is

    ... George W. <span class="lastname">Bush</span> ...

    hp
    Peter J. Holzer, Feb 18, 2011
    #4
  5. Thomas Baetzler

    ccc31807 Guest

    On Feb 18, 1:45 pm, "Peter J. Holzer" <> wrote:
    > One of us completely misunderstood what Thomas is trying to achieve.


    Could be me. I'm real good at that. ;-)

    > As I understood it, he wants the substitution to succeed even if the
    > text in the file is
    >
    >     ... George W. <span class="lastname">Bush</span> ...


    Yeah, but the OP wrote:

    >> I'm looking for input on how to run search/replace
    >> operations on paragraphs of HTML text without having
    >> to worry about the surrounding markup.


    If all you are doing is searching and replacing for specific patterns,
    the surrounding text doesn't matter, whether or not it's HTML markup.

    CC.
    ccc31807, Feb 18, 2011
    #5
  6. Thomas Baetzler

    Guest

    On Thu, 17 Feb 2011 10:45:11 +0100, Thomas Baetzler <> wrote:

    >Hi,
    >
    >I'm looking for input on how to run search/replace operations on
    >paragraphs of HTML text without having to worry about the surrounding
    >markup.
    >
    >So far I'm using HTML::Treebuilder to parse a HTML document and identfy
    >the individual paragraphs in the text. By recursively using the
    >content_list method, I can locate the individual text chunks that make
    >up the paragraph text.
    >
    >What I'd like to do is merge these chunks into a single string, run some
    >search/replace regexes on it, then update the individual text chunks
    >with the changes.
    >
    >Is there a better way to do this than stopping after each change to see
    >what's changed and keep track of chunk borders that way?
    >
    >I could probably work on individual chunks in turn, but taking care of
    >all the edge cases where I'd have to do lookahead/lookback in adjoining
    >chunks could be, well, tedious ;-)
    >
    >TIA for any suggestion you might have!
    >
    >Cheers,
    >Thomas


    You can't. What granulatiry, letters? Yes letters. Thats about it.
    That means even a word is not safe, let alone a phrase.

    A human put all that together in a rule-less way. That means only a
    human can modify it.

    Its sometimes easy for the mind to rationalize that these things can be
    done. After all, a human did it. Oh, it could probably be guessed with
    natural language processing, but its just a guess.

    Nice try though.

    -sln
    , Feb 20, 2011
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cynic07
    Replies:
    0
    Views:
    414
    Cynic07
    May 20, 2004
  2. Replies:
    4
    Views:
    337
  3. James Stroud

    Marked-Up Text Viewer for Python/Tkinter

    James Stroud, Jan 30, 2006, in forum: Python
    Replies:
    4
    Views:
    401
    =?iso-8859-1?B?QW5kcuk=?=
    Jan 31, 2006
  4. Mauricio Fernandez
    Replies:
    7
    Views:
    126
    Thomas Nitsche
    Nov 21, 2006
  5. Aaron Gray

    Determining marked text

    Aaron Gray, Apr 4, 2006, in forum: Javascript
    Replies:
    4
    Views:
    86
    Aaron Gray
    Apr 4, 2006
Loading...

Share This Page