Looking for Regexp that strips newlines inside of a tag

W

weston

I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word Doc
to filtered HTML, but there's one that I find particularly interesting
(and annoying): sometimes tags have newlines within them. Especially
<span> tags. For example:

<p><span lang=JA style='font-family:
&quot;MS Mincho&quot;'>(</span>

Is there a regular expression that can pull the span up onto the same
line?

So far, I've tried slurping the whole file into a single string, and
doing:

s/(<span.*?)^+([^>]*>)/$1 $2/mig;

which seems to have no effect, and this:

s/(<span.*?)\n+([^>]*>)/$1 $3/mig;

which seems to lop off everything from the first line.

It seems likely there's a way to do this, but I'm sortof stuck on what
to try next. Any ideas?
 
A

A. Sinan Unur

I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word
Doc to filtered HTML, but there's one that I find particularly
interesting (and annoying): sometimes tags have newlines within them.
Especially <span> tags. For example:

<p><span lang=JA style='font-family:
&quot;MS Mincho&quot;'>(</span>

Is there a regular expression that can pull the span up onto the same
line?

You should use an HTML parser to parse HTML. See

perldoc -q html

With HTML::TokeParser::Simple, this can be achieved simpy using code
similar to the following (untested):

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML
<p><span lang=JA style='font-family:
&quot;MS Mincho&quot;'>(</span></p>
HTML
;

my $p = HTML::TokeParser::Simple->new(\$html);

while( my $token = $p->get_token ) {
if( $token->is_start_tag ) {
my $attrs = $token->get_attr;
for my $attr (keys %{ $attrs }) {
$attrs->{$attr} =~ s/\s+/ /sg;
$token->set_attr($attr, $attrs->{$attr});
}
}
print $token->as_is . "\n";
}


__END__

Sinan
 
W

weston

You should use an HTML parser to parse HTML.

You're quite correct, and I appreciate your help in pushing me this
way. I've been avoiding the actual parsers because regexps are what I'm
familiar with, and perhaps this is a good time to change.

However, as much as I'm interested in solving the problem at hand, I'm
also very curious about the potential gaps in my regexp knowledge.

And plot, as they say, thickens. It appears that the second regular
expression only fails in Perl (5.8.5) under Cygwin. It works when I run
it under the native Windows Command Prompt on my XP system (same perl
install), and also when I try it under OpenBSD (5.8.6). It would seem
this is a platform-related issue rather a regexp one...
 
W

William James

weston said:
I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word Doc
to filtered HTML, but there's one that I find particularly interesting
(and annoying): sometimes tags have newlines within them. Especially
<span> tags. For example:

<p><span lang=JA style='font-family:
&quot;MS Mincho&quot;'>(</span>

Is there a regular expression that can pull the span up onto the same
line?

Look-ahead lets you make sure the newline is within a tag.

s/\n(?=[^<]*>)/ /g;

transforms

<p>
<span lang=JA
style='font:
&quot;Mincho&quot;'>ÿ
</span>

into

<p>
<span lang=JA style='font: &quot;Mincho&quot;'>ÿ
</span>
 
A

A. Sinan Unur

in
[ Please do not omit attributions when quoting ]
....

And plot, as they say, thickens. It appears that the second regular
expression only fails in Perl (5.8.5) under Cygwin.

[ Please quote an appropriate amount of context when replying.
There are no regular expressions in your post, so looking for
the second one is a futile exercise. ]

Sinan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top