Remove whitespace in tags using regex

A

ahjiang

Hi all,

Need some advice on this.

I have a string say,
$line = <html> Hello World </html> Hello world

$line =~ s,\s,,g;

This would returns me <html>HelloWorld</html>Helloworld


$line =~ s,<html>.*</html>,,g;

This would returns me Helloworld. Contents including <html></html> is
removed..

How can i achieve
<html>HelloWorld</html> Hello world

Only whitespace within the <html> tags is removed
 
R

robic0

Hi all,

Need some advice on this.

I have a string say,
$line = <html> Hello World </html> Hello world

$line =~ s,\s,,g;

This would returns me <html>HelloWorld</html>Helloworld


$line =~ s,<html>.*</html>,,g;

This would returns me Helloworld. Contents including <html></html> is
removed..

How can i achieve
<html>HelloWorld</html> Hello world

Only whitespace within the <html> tags is removed

Want to parse html, or are you just full of shit?
Since your trying something you should not be, maybe you should try something like this:

$RxParse =
qr/(?:<(?:(?:(\/*)($Name)\s*(\/*))|(?:META(.*?))|(?:($Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?:(?:DOCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
# ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
2 2)|( 3 3))))>)|4 4
 
R

robic0

Want to parse html, or are you just full of shit?
Since your trying something you should not be, maybe you should try something like this:
The line below is the 'definitive' regexp for xml, xhtml, etc.. Of course you don't know what in the hell it is..
Those numbered brackets in the comments foward the trapped contents to mutliple handlers/sub-regexp processors.
This is only the outline, framed for performance. Several subroutines do regexp processing on captured data.
Unfortunately for you, this subject is a mile over your head (ten miles actually). Wake up to reality. I know reality.
'I am reality' (in the famous words from Platoon). There is no fuckin XML question or premise I cannot weigh in on as a
mother fuckin expert...... (much to Matt Garish's dismay)
$RxParse =
qr/(?:<(?:(?:(\/*)($Name)\s*(\/*))|(?:META(.*?))|(?:($Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?:(?:DOCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
# ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
2 2)|( 3 3))))>)|4 4
 
X

Xicheng Jia

Hi all,

Need some advice on this.

I have a string say,
$line = <html> Hello World </html> Hello world

$line =~ s,\s,,g;

This would returns me <html>HelloWorld</html>Helloworld


$line =~ s,<html>.*</html>,,g;

This would returns me Helloworld. Contents including <html></html> is
removed..
=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
print "$string\n";


Xicheng
 
R

robic0

=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
print "$string\n";


Xicheng

Amazing, why help him alter source xml/xhtml outside of a parser.
Have you a few screws loosend? Anything is possible but
should'nt systems be modified with the tools meant for them?
And, you never asked why.... I wonder 'why' you follow up in this fasshion.
Its out of the norm. In-place modification of xml/xhtml is a purely risky
business. Or don't you concede that?
 
T

Tad McClellan

Need some advice on this.


Don't use regular expressions for this if you need it to be robust.

Use a real parser.

I have a string say,
$line = <html> Hello World </html> Hello world

How can i achieve
<html>HelloWorld</html> Hello world

Only whitespace within the <html> tags is removed


$line =~ s,(<html>.*</html>), ($a=$1) =~ tr/ //d; $a,gse;


or formatted more sensibly:

$line =~ s{( <html>.*</html> )}
{ ($a=$1) =~ tr/ //d;
$a;
}gsex;
 
R

robic0

=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e; ^ ^
print "$string\n";
See, thats the problem. You don't know what '(<tag>.*?</tag>)' is.
He wants to apply this to the whole file. Why would anyone search a
whole xml/xhtml file to remove spaces between tags.
Don't you realize that in a compliant xml file, that within this string
'<html>HelloWorld</html> Hello world', that
^^^^^^^^^^^^
is also contained within a tag???

You don't know the form of 'tag'. It has many faces in xml. This is a
tag in the simplest of forms.
He's looking for between the tags, not within the tags, which this also removes.
But you don't know that whitespaces are significant in xml, they act as delimeters
for parsers. Didn't know that did you..

You can no more suggest this technique as having valid xml afterwards than you
can know the future. Whats in a tag? Alot...........
 
R

robic0

Don't use regular expressions for this if you need it to be robust.

Use a real parser.




$line =~ s,(<html>.*</html>), ($a=$1) =~ tr/ //d; $a,gse;


or formatted more sensibly:

$line =~ s{( <html>.*</html> )}
{ ($a=$1) =~ tr/ //d;
$a;
}gsex;
Explain '<html>'...
 
A

ahjiang

now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

any advice?
 
A

A. Sinan Unur

(e-mail address removed) wrote in @i40g2000cwc.googlegroups.com:

[ Please do not top-post. Please do read the FAQ and the posting
guidelines. ]
now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

Use HTML::parser or one of the modules based on that. I prefer
HTML::TokeParser.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
R

robic0

On 16 Apr 2006 20:01:23 -0700, (e-mail address removed) wrote:

Give it up dude.......
now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

any advice?

Xicheng said:
=> How can i achieve
=> <html>HelloWorld</html> Hello world

you can write a subroutine to remove whitespace from an input string
and then apply it on your 's///e' expression, like:

sub trim_spaces {
my $str = shift;
$str =~ s/\s//g;
$str;
}

# then:
my $string = q(<html> Hello World </html> Hello world);
$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
print "$string\n";


Xicheng
 
A

ahjiang

thanks for the advice..

however how can i do it using [^<html>.*?</html>] matches not
<html>.*?</html> ??

A. Sinan Unur said:
(e-mail address removed) wrote in @i40g2000cwc.googlegroups.com:

[ Please do not top-post. Please do read the FAQ and the posting
guidelines. ]
now im trying to remove whitespace that is not in the <html> tags

so i do

$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

seems like im getting wrong

Use HTML::parser or one of the modules based on that. I prefer
HTML::TokeParser.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
R

robic0

thanks for the advice..

however how can i do it using [^<html>.*?</html>] matches not
<html>.*?</html> ??
Why don't you put your head up your ass and ask an expert?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top