Remove whitespace in tags using regex

ahjiang · Apr 15, 2006

Hi all,

Need some advice on this.

I have a string say,
$line = <html> Hello World </html> Hello world

$line =~ s,\s,,g;

This would returns me <html>HelloWorld</html>Helloworld

$line =~ s,<html>.*</html>,,g;

This would returns me Helloworld. Contents including <html></html> is
removed..

How can i achieve
<html>HelloWorld</html> Hello world

Only whitespace within the <html> tags is removed

robic0 · Apr 15, 2006

Hi all,

Need some advice on this.

I have a string say,
$line = <html> Hello World </html> Hello world

$line =~ s,\s,,g;

This would returns me <html>HelloWorld</html>Helloworld

$line =~ s,<html>.*</html>,,g;

This would returns me Helloworld. Contents including <html></html> is
removed..

How can i achieve
<html>HelloWorld</html> Hello world

Only whitespace within the <html> tags is removed

Want to parse html, or are you just full of shit?
Since your trying something you should not be, maybe you should try something like this:

$RxParse =
qr/(?:<(?

?

\/*)($Name)\s*(\/*))|(?:META(.*?))|(?

$Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?

?

OCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
# ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
2 2)|( 3 3))))>)|4 4

robic0 · Apr 15, 2006

Want to parse html, or are you just full of shit?
Since your trying something you should not be, maybe you should try something like this:

The line below is the 'definitive' regexp for xml, xhtml, etc.. Of course you don't know what in the hell it is..
Those numbered brackets in the comments foward the trapped contents to mutliple handlers/sub-regexp processors.
This is only the outline, framed for performance. Several subroutines do regexp processing on captured data.
Unfortunately for you, this subject is a mile over your head (ten miles actually). Wake up to reality. I know reality.
'I am reality' (in the famous words from Platoon). There is no fuckin XML question or premise I cannot weigh in on as a
mother fuckin expert...... (much to Matt Garish's dismay)

$RxParse =
qr/(?:<(??\/*)($Name)\s*(\/*))|(?:META(.*?))|(?$Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(??OCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
# ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
2 2)|( 3 3))))>)|4 4

Xicheng Jia · Apr 15, 2006

Hi all,

Need some advice on this.

I have a string say,
$line = <html> Hello World </html> Hello world

$line =~ s,\s,,g;

This would returns me <html>HelloWorld</html>Helloworld

$line =~ s,<html>.*</html>,,g;

This would returns me Helloworld. Contents including <html></html> is
removed..

=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
print "$string\n";

Xicheng

robic0 · Apr 15, 2006

=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
print "$string\n";

Xicheng

Amazing, why help him alter source xml/xhtml outside of a parser.
Have you a few screws loosend? Anything is possible but
should'nt systems be modified with the tools meant for them?
And, you never asked why.... I wonder 'why' you follow up in this fasshion.
Its out of the norm. In-place modification of xml/xhtml is a purely risky
business. Or don't you concede that?

Tad McClellan · Apr 15, 2006

Need some advice on this.

Don't use regular expressions for this if you need it to be robust.

Use a real parser.

I have a string say,
$line = <html> Hello World </html> Hello world

How can i achieve
<html>HelloWorld</html> Hello world

Only whitespace within the <html> tags is removed

$line =~ s,(<html>.*</html>), ($a=$1) =~ tr/ //d; $a,gse;

or formatted more sensibly:

$line =~ s{( <html>.*</html> )}
{ ($a=$1) =~ tr/ //d;
$a;
}gsex;

robic0 · Apr 17, 2006

=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e; ^ ^
print "$string\n";

See, thats the problem. You don't know what '(<tag>.*?</tag>)' is.
He wants to apply this to the whole file. Why would anyone search a
whole xml/xhtml file to remove spaces between tags.
Don't you realize that in a compliant xml file, that within this string
'<html>HelloWorld</html> Hello world', that
^^^^^^^^^^^^
is also contained within a tag???

You don't know the form of 'tag'. It has many faces in xml. This is a
tag in the simplest of forms.

Xicheng

He's looking for between the tags, not within the tags, which this also removes.
But you don't know that whitespaces are significant in xml, they act as delimeters
for parsers. Didn't know that did you..

You can no more suggest this technique as having valid xml afterwards than you
can know the future. Whats in a tag? Alot...........

robic0 · Apr 17, 2006

Don't use regular expressions for this if you need it to be robust.

Use a real parser.

$line =~ s,(<html>.*</html>), ($a=$1) =~ tr/ //d; $a,gse;

or formatted more sensibly:

$line =~ s{( <html>.*</html> )}
{ ($a=$1) =~ tr/ //d;
$a;
}gsex;

Explain '<html>'...

ahjiang · Apr 17, 2006

now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

any advice?

A. Sinan Unur · Apr 17, 2006

(e-mail address removed) wrote in @i40g2000cwc.googlegroups.com:

[ Please do not top-post. Please do read the FAQ and the posting
guidelines. ]

now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

Use HTML:

arser or one of the modules based on that. I prefer
HTML::TokeParser.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

robic0 · Apr 17, 2006

On 16 Apr 2006 20:01:23 -0700, (e-mail address removed) wrote:

Give it up dude.......

Xicheng said:
now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

any advice?

Xicheng said:

=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
print "$string\n";

Xicheng

Click to expand...

ahjiang · Apr 17, 2006

thanks for the advice..

however how can i do it using [^<html>.*?</html>] matches not
<html>.*?</html> ??

A. Sinan Unur said:
(e-mail address removed) wrote in @i40g2000cwc.googlegroups.com:

[ Please do not top-post. Please do read the FAQ and the posting
guidelines. ]

now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

Click to expand...

Use HTML:arser or one of the modules based on that. I prefer
HTML::TokeParser.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

robic0 · Apr 17, 2006

thanks for the advice..

however how can i do it using [^<html>.*?</html>] matches not
<html>.*?</html> ??

Why don't you put your head up your ass and ask an expert?

robic0 · Apr 17, 2006

thanks for the advice..

however how can i do it using [^<html>.*?</html>] matches not
<html>.*?</html> ??

Click to expand...

Why don't you put your head up your ass and ask an expert?

Just trying to help. Need more help, just ask

Tad McClellan · Apr 17, 2006

robic0 said:
Don't you realize that in a compliant xml file, that within this string
'<html>HelloWorld</html> Hello world', that
^^^^^^^^^^^^
is also contained within a tag???

The "Hello world" part is NOT contained within a "tag".

See the XML FAQ:

http://xml.silmaril.ie/authors/makeup/

You don't know the form of 'tag'.

And neither do you.

Are we surprised?

Remove all HTML but keep <p> tags	4	Feb 10, 2012
Clickable link conversion regex?	0	Nov 30, 2012
Remove space from input	5	Mar 16, 2013
Removing empty tags	2	Feb 24, 2011
Changing .html in URL	3	Jul 11, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
FAQ 9.4 How do I remove HTML from a string?	0	Apr 10, 2011
Regex question, limit repeats UNLESS within specified tags	3	Nov 2, 2012

Remove whitespace in tags using regex

ahjiang

robic0

robic0

Xicheng Jia

robic0

Tad McClellan

robic0

robic0

ahjiang

A. Sinan Unur

robic0

ahjiang

robic0

robic0

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads