Looking for Regexp that strips newlines inside of a tag

weston · Aug 26, 2005

I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word Doc
to filtered HTML, but there's one that I find particularly interesting
(and annoying): sometimes tags have newlines within them. Especially
 tags. For example:

（

Is there a regular expression that can pull the span up onto the same
line?

So far, I've tried slurping the whole file into a single string, and
doing:

s/(<span.*?)^+([^>]*>)/$1 $2/mig;

which seems to have no effect, and this:

s/(<span.*?)\n+([^>]*>)/$1 $3/mig;

which seems to lop off everything from the first line.

It seems likely there's a way to do this, but I'm sortof stuck on what
to try next. Any ideas?

A. Sinan Unur · Aug 26, 2005

I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word
Doc to filtered HTML, but there's one that I find particularly
interesting (and annoying): sometimes tags have newlines within them.
Especially tags. For example:

（

Is there a regular expression that can pull the span up onto the same
line?

You should use an HTML parser to parse HTML. See

perldoc -q html

With HTML::TokeParser::Simple, this can be achieved simpy using code
similar to the following (untested):

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML
（
HTML
;

my $p = HTML::TokeParser::Simple->new(\$html);

while( my $token = $p->get_token ) {
if( $token->is_start_tag ) {
my $attrs = $token->get_attr;
for my $attr (keys %{ $attrs }) {
$attrs->{$attr} =~ s/\s+/ /sg;
$token->set_attr($attr, $attrs->{$attr});
}
}
print $token->as_is . "\n";
}

__END__

Sinan

weston · Aug 27, 2005

You should use an HTML parser to parse HTML.

You're quite correct, and I appreciate your help in pushing me this
way. I've been avoiding the actual parsers because regexps are what I'm
familiar with, and perhaps this is a good time to change.

However, as much as I'm interested in solving the problem at hand, I'm
also very curious about the potential gaps in my regexp knowledge.

And plot, as they say, thickens. It appears that the second regular
expression only fails in Perl (5.8.5) under Cygwin. It works when I run
it under the native Windows Command Prompt on my XP system (same perl
install), and also when I try it under OpenBSD (5.8.6). It would seem
this is a platform-related issue rather a regexp one...

William James · Aug 27, 2005

weston said:
I'm trying to streamline workflow from Word Documents to HTML. There
are numerous atrocities perpetrated in the process of saving a Word Doc
to filtered HTML, but there's one that I find particularly interesting
(and annoying): sometimes tags have newlines within them. Especially
 tags. For example:

（

Is there a regular expression that can pull the span up onto the same
line?

Look-ahead lets you make sure the newline is within a tag.

s/\n(?=[^<]*>)/ /g;

transforms


ÿ


into


ÿ

A. Sinan Unur · Aug 27, 2005

in
[ Please do not omit attributions when quoting ]

....

And plot, as they say, thickens. It appears that the second regular
expression only fails in Perl (5.8.5) under Cygwin.

[ Please quote an appropriate amount of context when replying.
There are no regular expressions in your post, so looking for
the second one is a futile exercise. ]

Sinan

Known issues with Perl under Cygwin?	3	Aug 27, 2005
New Dojo Site--Most incompetent ever?	49	Mar 8, 2010
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
HTML Correctness and Validators	7	Dec 29, 2008
Roundup of FAQ change requests	4	Dec 6, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Looking for Regexp that strips newlines inside of a tag

weston

A. Sinan Unur

weston

William James

A. Sinan Unur

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads