Strategies for modifying marked-up text?

Thomas Baetzler · Feb 17, 2011

Hi,

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

So far I'm using HTML::Treebuilder to parse a HTML document and identfy
the individual paragraphs in the text. By recursively using the
content_list method, I can locate the individual text chunks that make
up the paragraph text.

What I'd like to do is merge these chunks into a single string, run some
search/replace regexes on it, then update the individual text chunks
with the changes.

Is there a better way to do this than stopping after each change to see
what's changed and keep track of chunk borders that way?

I could probably work on individual chunks in turn, but taking care of
all the edge cases where I'd have to do lookahead/lookback in adjoining
chunks could be, well, tedious ;-)

TIA for any suggestion you might have!

Cheers,
Thomas

Peter J. Holzer · Feb 17, 2011

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

Replace fixed strings or regexps?

So far I'm using HTML::Treebuilder to parse a HTML document and identfy
the individual paragraphs in the text. By recursively using the
content_list method, I can locate the individual text chunks that make
up the paragraph text.

What I'd like to do is merge these chunks into a single string, run some
search/replace regexes on it, then update the individual text chunks
with the changes.

Is there a better way to do this than stopping after each change to see
what's changed and keep track of chunk borders that way?

You could write a custom matcher which walks your HTML tree. That's
probably a lot of work and quite slow if you need the whole power of
perl regexps, but might work if you need only some subset (fixed strings
in the extreme case).

Other than that, I think keeping the offsets of the start and end of
each element and readjusting them after each replacement is probably the
easiest way.

I could probably work on individual chunks in turn, but taking care of
all the edge cases where I'd have to do lookahead/lookback in adjoining
chunks could be, well, tedious ;-)

Here is one case which comes immediately in mind and for which I don't
have a good solution:

If we have the HTML fragment

Here is some italicized text

and you do a

s/ some italicized / a bit of emphasized /

what should be the result? The em element must start somewhere within
the replaced text but where?

hp

ccc31807 · Feb 17, 2011

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

Depending on the particular search and replace operations, it would
probably be easiest to slurp the entire file in memory and do the
search and replace just once. This is by far the best way to make
global changes to a document, provided it will fit into memory. The
format of the document (HTML, XML, TXT, CSV, etc.) does not matter.

If you had the entire document in memory, in a variable name $html,
and you wanted to change all occurrences of 'George W. Bush' to
'Barack H. Obama', you could do this:

$html =~ s/George W. Bush/Barack H. Obama/g;

You might also want to look at 'Perl slurp mode'

CC.

Peter J. Holzer · Feb 18, 2011

Depending on the particular search and replace operations, it would
probably be easiest to slurp the entire file in memory and do the
search and replace just once. This is by far the best way to make
global changes to a document, provided it will fit into memory. The
format of the document (HTML, XML, TXT, CSV, etc.) does not matter.

If you had the entire document in memory, in a variable name $html,
and you wanted to change all occurrences of 'George W. Bush' to
'Barack H. Obama', you could do this:

$html =~ s/George W. Bush/Barack H. Obama/g;

One of us completely misunderstood what Thomas is trying to achieve.

As I understood it, he wants the substitution to succeed even if the
text in the file is

... George W. Bush ...

hp

ccc31807 · Feb 18, 2011

One of us completely misunderstood what Thomas is trying to achieve.

Could be me. I'm real good at that. ;-)

As I understood it, he wants the substitution to succeed even if the
text in the file is

... George W. Bush ...

If all you are doing is searching and replacing for specific patterns,
the surrounding text doesn't matter, whether or not it's HTML markup.

CC.

sln · Feb 20, 2011

Hi,

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

So far I'm using HTML::Treebuilder to parse a HTML document and identfy
the individual paragraphs in the text. By recursively using the
content_list method, I can locate the individual text chunks that make
up the paragraph text.

What I'd like to do is merge these chunks into a single string, run some
search/replace regexes on it, then update the individual text chunks
with the changes.

Is there a better way to do this than stopping after each change to see
what's changed and keep track of chunk borders that way?

I could probably work on individual chunks in turn, but taking care of
all the edge cases where I'd have to do lookahead/lookback in adjoining
chunks could be, well, tedious ;-)

TIA for any suggestion you might have!

Cheers,
Thomas

You can't. What granulatiry, letters? Yes letters. Thats about it.
That means even a word is not safe, let alone a phrase.

A human put all that together in a rule-less way. That means only a
human can modify it.

Its sometimes easy for the mind to rationalize that these things can be
done. After all, a human did it. Oh, it could probably be guessed with
natural language processing, but its just a guess.

Nice try though.

-sln

Looking For Advice	1	Dec 10, 2022
[ANN] FastRI 0.2.0: full-text searching, smarter search strategies	7	Nov 15, 2006
Help modifying select menu dynamically!	6	Dec 5, 2006
SQLite + FTS (full text search)	8	Jan 23, 2014
Setting up YaBB Perl forum - weird respond as plain text in browser, including headers	2	May 17, 2007
URL detection follow-up	6	Sep 10, 2003
HTML::TreeBuilder issue	6	Feb 5, 2009
cgi simple script in c to search text file	15	Mar 4, 2013

Strategies for modifying marked-up text?

Thomas Baetzler

Peter J. Holzer

ccc31807

Peter J. Holzer

ccc31807

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads