Strategies for modifying marked-up text?

T

Thomas Baetzler

Hi,

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

So far I'm using HTML::Treebuilder to parse a HTML document and identfy
the individual paragraphs in the text. By recursively using the
content_list method, I can locate the individual text chunks that make
up the paragraph text.

What I'd like to do is merge these chunks into a single string, run some
search/replace regexes on it, then update the individual text chunks
with the changes.

Is there a better way to do this than stopping after each change to see
what's changed and keep track of chunk borders that way?

I could probably work on individual chunks in turn, but taking care of
all the edge cases where I'd have to do lookahead/lookback in adjoining
chunks could be, well, tedious ;-)

TIA for any suggestion you might have!

Cheers,
Thomas
 
P

Peter J. Holzer

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

Replace fixed strings or regexps?

So far I'm using HTML::Treebuilder to parse a HTML document and identfy
the individual paragraphs in the text. By recursively using the
content_list method, I can locate the individual text chunks that make
up the paragraph text.

What I'd like to do is merge these chunks into a single string, run some
search/replace regexes on it, then update the individual text chunks
with the changes.

Is there a better way to do this than stopping after each change to see
what's changed and keep track of chunk borders that way?

You could write a custom matcher which walks your HTML tree. That's
probably a lot of work and quite slow if you need the whole power of
perl regexps, but might work if you need only some subset (fixed strings
in the extreme case).

Other than that, I think keeping the offsets of the start and end of
each element and readjusting them after each replacement is probably the
easiest way.

I could probably work on individual chunks in turn, but taking care of
all the edge cases where I'd have to do lookahead/lookback in adjoining
chunks could be, well, tedious ;-)

Here is one case which comes immediately in mind and for which I don't
have a good solution:

If we have the HTML fragment

<p>Here is some <em>italicized text</em></p>

and you do a

s/ some italicized / a bit of emphasized /

what should be the result? The em element must start somewhere within
the replaced text but where?

hp
 
C

ccc31807

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

Depending on the particular search and replace operations, it would
probably be easiest to slurp the entire file in memory and do the
search and replace just once. This is by far the best way to make
global changes to a document, provided it will fit into memory. The
format of the document (HTML, XML, TXT, CSV, etc.) does not matter.

If you had the entire document in memory, in a variable name $html,
and you wanted to change all occurrences of 'George W. Bush' to
'Barack H. Obama', you could do this:

$html =~ s/George W. Bush/Barack H. Obama/g;

You might also want to look at 'Perl slurp mode'

CC.
 
P

Peter J. Holzer

Depending on the particular search and replace operations, it would
probably be easiest to slurp the entire file in memory and do the
search and replace just once. This is by far the best way to make
global changes to a document, provided it will fit into memory. The
format of the document (HTML, XML, TXT, CSV, etc.) does not matter.

If you had the entire document in memory, in a variable name $html,
and you wanted to change all occurrences of 'George W. Bush' to
'Barack H. Obama', you could do this:

$html =~ s/George W. Bush/Barack H. Obama/g;

One of us completely misunderstood what Thomas is trying to achieve.

As I understood it, he wants the substitution to succeed even if the
text in the file is

... George W. <span class="lastname">Bush</span> ...

hp
 
C

ccc31807

One of us completely misunderstood what Thomas is trying to achieve.

Could be me. I'm real good at that. ;-)
As I understood it, he wants the substitution to succeed even if the
text in the file is

    ... George W. <span class="lastname">Bush</span> ...


If all you are doing is searching and replacing for specific patterns,
the surrounding text doesn't matter, whether or not it's HTML markup.

CC.
 
S

sln

Hi,

I'm looking for input on how to run search/replace operations on
paragraphs of HTML text without having to worry about the surrounding
markup.

So far I'm using HTML::Treebuilder to parse a HTML document and identfy
the individual paragraphs in the text. By recursively using the
content_list method, I can locate the individual text chunks that make
up the paragraph text.

What I'd like to do is merge these chunks into a single string, run some
search/replace regexes on it, then update the individual text chunks
with the changes.

Is there a better way to do this than stopping after each change to see
what's changed and keep track of chunk borders that way?

I could probably work on individual chunks in turn, but taking care of
all the edge cases where I'd have to do lookahead/lookback in adjoining
chunks could be, well, tedious ;-)

TIA for any suggestion you might have!

Cheers,
Thomas

You can't. What granulatiry, letters? Yes letters. Thats about it.
That means even a word is not safe, let alone a phrase.

A human put all that together in a rule-less way. That means only a
human can modify it.

Its sometimes easy for the mind to rationalize that these things can be
done. After all, a human did it. Oh, it could probably be guessed with
natural language processing, but its just a guess.

Nice try though.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top