Remove whitespace in tags using regex

Discussion in 'Perl Misc' started by ahjiang@gmail.com, Apr 15, 2006.

  1. Guest

    Hi all,

    Need some advice on this.

    I have a string say,
    $line = <html> Hello World </html> Hello world

    $line =~ s,\s,,g;

    This would returns me <html>HelloWorld</html>Helloworld


    $line =~ s,<html>.*</html>,,g;

    This would returns me Helloworld. Contents including <html></html> is
    removed..

    How can i achieve
    <html>HelloWorld</html> Hello world

    Only whitespace within the <html> tags is removed
     
    , Apr 15, 2006
    #1
    1. Advertising

  2. robic0 Guest

    On 14 Apr 2006 19:04:13 -0700, wrote:

    >Hi all,
    >
    >Need some advice on this.
    >
    >I have a string say,
    >$line = <html> Hello World </html> Hello world
    >
    >$line =~ s,\s,,g;
    >
    >This would returns me <html>HelloWorld</html>Helloworld
    >
    >
    >$line =~ s,<html>.*</html>,,g;
    >
    >This would returns me Helloworld. Contents including <html></html> is
    >removed..
    >
    >How can i achieve
    ><html>HelloWorld</html> Hello world
    >
    >Only whitespace within the <html> tags is removed


    Want to parse html, or are you just full of shit?
    Since your trying something you should not be, maybe you should try something like this:

    $RxParse =
    qr/(?:<(?:(?:(\/*)($Name)\s*(\/*))|(?:META(.*?))|(?:($Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?:(?:DOCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
    # ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
    2 2)|( 3 3))))>)|4 4
     
    robic0, Apr 15, 2006
    #2
    1. Advertising

  3. robic0 Guest

    On Fri, 14 Apr 2006 19:43:24 -0700, robic0 wrote:

    >On 14 Apr 2006 19:04:13 -0700, wrote:
    >
    >>Hi all,
    >>
    >>Need some advice on this.
    >>
    >>I have a string say,
    >>$line = <html> Hello World </html> Hello world
    >>
    >>$line =~ s,\s,,g;
    >>
    >>This would returns me <html>HelloWorld</html>Helloworld
    >>
    >>
    >>$line =~ s,<html>.*</html>,,g;
    >>
    >>This would returns me Helloworld. Contents including <html></html> is
    >>removed..
    >>
    >>How can i achieve
    >><html>HelloWorld</html> Hello world
    >>
    >>Only whitespace within the <html> tags is removed

    >
    >Want to parse html, or are you just full of shit?
    >Since your trying something you should not be, maybe you should try something like this:
    >

    The line below is the 'definitive' regexp for xml, xhtml, etc.. Of course you don't know what in the hell it is..
    Those numbered brackets in the comments foward the trapped contents to mutliple handlers/sub-regexp processors.
    This is only the outline, framed for performance. Several subroutines do regexp processing on captured data.
    Unfortunately for you, this subject is a mile over your head (ten miles actually). Wake up to reality. I know reality.
    'I am reality' (in the famous words from Platoon). There is no fuckin XML question or premise I cannot weigh in on as a
    mother fuckin expert...... (much to Matt Garish's dismay)
    >$RxParse =
    >qr/(?:<(?:(?:(\/*)($Name)\s*(\/*))|(?:META(.*?))|(?:($Name)((?:\s+$Name\s*=\s*["'][^<]*['"])+)\s*(\/*))|(?:\?(.*?)\?)|(?:!(?:(?:DOCTYPE(.*?))|(?:\[CDATA\[(.*?)\]\])|(?:--(.*?[^-])--)|(?:ATTLIST(.*?))|(?:ENTITY(.*?)))))>)|(.+?)/s;
    ># ( <( ( 1 12 2 3 3)|( 4 4)|( 5 56( ) 6 7 7)|( 8 8 )|( !( ( 9 9)|( 0 0 )|( 1 1 )|(
    >2 2)|( 3 3))))>)|4 4
    >
     
    robic0, Apr 15, 2006
    #3
  4. Xicheng Jia Guest

    wrote:
    > Hi all,
    >
    > Need some advice on this.
    >
    > I have a string say,
    > $line = <html> Hello World </html> Hello world
    >
    > $line =~ s,\s,,g;
    >
    > This would returns me <html>HelloWorld</html>Helloworld
    >
    >
    > $line =~ s,<html>.*</html>,,g;
    >
    > This would returns me Helloworld. Contents including <html></html> is
    > removed..
    >

    => How can i achieve
    => <html>HelloWorld</html> Hello world

    you can write a subroutine to remove whitespace from an input string
    and then apply it on your 's///e' expression, like:

    sub trim_spaces {
    my $str = shift;
    $str =~ s/\s//g;
    $str;
    }

    # then:
    my $string = q(<html> Hello World </html> Hello world);
    $string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
    print "$string\n";


    Xicheng

    > Only whitespace within the <html> tags is removed
     
    Xicheng Jia, Apr 15, 2006
    #4
  5. robic0 Guest

    On 14 Apr 2006 21:11:41 -0700, "Xicheng Jia" <> wrote:

    > wrote:
    >> Hi all,
    >>
    >> Need some advice on this.
    >>
    >> I have a string say,
    >> $line = <html> Hello World </html> Hello world
    >>
    >> $line =~ s,\s,,g;
    >>
    >> This would returns me <html>HelloWorld</html>Helloworld
    >>
    >>
    >> $line =~ s,<html>.*</html>,,g;
    >>
    >> This would returns me Helloworld. Contents including <html></html> is
    >> removed..
    >>

    >=> How can i achieve
    >=> <html>HelloWorld</html> Hello world
    >
    >you can write a subroutine to remove whitespace from an input string
    >and then apply it on your 's///e' expression, like:
    >
    >sub trim_spaces {
    > my $str = shift;
    > $str =~ s/\s//g;
    > $str;
    >}
    >
    ># then:
    >my $string = q(<html> Hello World </html> Hello world);
    >$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
    >print "$string\n";
    >
    >
    >Xicheng
    >
    >> Only whitespace within the <html> tags is removed


    Amazing, why help him alter source xml/xhtml outside of a parser.
    Have you a few screws loosend? Anything is possible but
    should'nt systems be modified with the tools meant for them?
    And, you never asked why.... I wonder 'why' you follow up in this fasshion.
    Its out of the norm. In-place modification of xml/xhtml is a purely risky
    business. Or don't you concede that?
     
    robic0, Apr 15, 2006
    #5
  6. <> wrote:


    > Need some advice on this.



    Don't use regular expressions for this if you need it to be robust.

    Use a real parser.


    > I have a string say,
    > $line = <html> Hello World </html> Hello world



    > How can i achieve
    ><html>HelloWorld</html> Hello world
    >
    > Only whitespace within the <html> tags is removed



    $line =~ s,(<html>.*</html>), ($a=$1) =~ tr/ //d; $a,gse;


    or formatted more sensibly:

    $line =~ s{( <html>.*</html> )}
    { ($a=$1) =~ tr/ //d;
    $a;
    }gsex;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 15, 2006
    #6
  7. robic0 Guest

    On 14 Apr 2006 21:11:41 -0700, "Xicheng Jia" <> wrote:

    > wrote:
    >> Hi all,
    >>
    >> Need some advice on this.
    >>
    >> I have a string say,
    >> $line = <html> Hello World </html> Hello world
    >>
    >> $line =~ s,\s,,g;
    >>
    >> This would returns me <html>HelloWorld</html>Helloworld
    >>
    >>
    >> $line =~ s,<html>.*</html>,,g;
    >>
    >> This would returns me Helloworld. Contents including <html></html> is
    >> removed..
    >>

    >=> How can i achieve
    >=> <html>HelloWorld</html> Hello world
    >
    >you can write a subroutine to remove whitespace from an input string
    >and then apply it on your 's///e' expression, like:
    >
    >sub trim_spaces {
    > my $str = shift;
    > $str =~ s/\s//g;
    > $str;
    >}
    >
    ># then:
    >my $string = q(<html> Hello World </html> Hello world);
    >$string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;

    ^ ^
    >print "$string\n";
    >

    See, thats the problem. You don't know what '(<tag>.*?</tag>)' is.
    He wants to apply this to the whole file. Why would anyone search a
    whole xml/xhtml file to remove spaces between tags.
    Don't you realize that in a compliant xml file, that within this string
    '<html>HelloWorld</html> Hello world', that
    ^^^^^^^^^^^^
    is also contained within a tag???

    You don't know the form of 'tag'. It has many faces in xml. This is a
    tag in the simplest of forms.
    >
    >Xicheng
    >
    >> Only whitespace within the <html> tags is removed

    He's looking for between the tags, not within the tags, which this also removes.
    But you don't know that whitespaces are significant in xml, they act as delimeters
    for parsers. Didn't know that did you..

    You can no more suggest this technique as having valid xml afterwards than you
    can know the future. Whats in a tag? Alot...........
     
    robic0, Apr 17, 2006
    #7
  8. robic0 Guest

    On Sat, 15 Apr 2006 08:03:35 -0500, Tad McClellan <> wrote:

    > <> wrote:
    >
    >
    >> Need some advice on this.

    >
    >
    >Don't use regular expressions for this if you need it to be robust.
    >
    >Use a real parser.
    >
    >
    >> I have a string say,
    >> $line = <html> Hello World </html> Hello world

    >
    >
    >> How can i achieve
    >><html>HelloWorld</html> Hello world
    >>
    >> Only whitespace within the <html> tags is removed

    >
    >
    > $line =~ s,(<html>.*</html>), ($a=$1) =~ tr/ //d; $a,gse;
    >
    >
    >or formatted more sensibly:
    >
    > $line =~ s{( <html>.*</html> )}
    > { ($a=$1) =~ tr/ //d;
    > $a;
    > }gsex;

    Explain '<html>'...
     
    robic0, Apr 17, 2006
    #8
  9. Guest

    now im trying to remove whitespace that is not in the <html> tags

    so i do

    $string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;

    seems like im getting wrong

    any advice?

    Xicheng Jia wrote:
    > wrote:
    > > Hi all,
    > >
    > > Need some advice on this.
    > >
    > > I have a string say,
    > > $line = <html> Hello World </html> Hello world
    > >
    > > $line =~ s,\s,,g;
    > >
    > > This would returns me <html>HelloWorld</html>Helloworld
    > >
    > >
    > > $line =~ s,<html>.*</html>,,g;
    > >
    > > This would returns me Helloworld. Contents including <html></html> is
    > > removed..
    > >

    > => How can i achieve
    > => <html>HelloWorld</html> Hello world
    >
    > you can write a subroutine to remove whitespace from an input string
    > and then apply it on your 's///e' expression, like:
    >
    > sub trim_spaces {
    > my $str = shift;
    > $str =~ s/\s//g;
    > $str;
    > }
    >
    > # then:
    > my $string = q(<html> Hello World </html> Hello world);
    > $string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
    > print "$string\n";
    >
    >
    > Xicheng
    >
    > > Only whitespace within the <html> tags is removed
     
    , Apr 17, 2006
    #9
  10. wrote in news:1145242883.016777.198100
    @i40g2000cwc.googlegroups.com:

    [ Please do not top-post. Please do read the FAQ and the posting
    guidelines. ]

    > now im trying to remove whitespace that is not in the <html> tags
    >
    > so i do
    >
    > $string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;
    >
    > seems like im getting wrong


    Use HTML::parser or one of the modules based on that. I prefer
    HTML::TokeParser.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 17, 2006
    #10
  11. robic0 Guest

    On 16 Apr 2006 20:01:23 -0700, wrote:

    Give it up dude.......

    >now im trying to remove whitespace that is not in the <html> tags
    >
    >so i do
    >
    >$string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;
    >
    >seems like im getting wrong
    >
    >any advice?
    >
    >Xicheng Jia wrote:
    >> wrote:
    >> > Hi all,
    >> >
    >> > Need some advice on this.
    >> >
    >> > I have a string say,
    >> > $line = <html> Hello World </html> Hello world
    >> >
    >> > $line =~ s,\s,,g;
    >> >
    >> > This would returns me <html>HelloWorld</html>Helloworld
    >> >
    >> >
    >> > $line =~ s,<html>.*</html>,,g;
    >> >
    >> > This would returns me Helloworld. Contents including <html></html> is
    >> > removed..
    >> >

    >> => How can i achieve
    >> => <html>HelloWorld</html> Hello world
    >>
    >> you can write a subroutine to remove whitespace from an input string
    >> and then apply it on your 's///e' expression, like:
    >>
    >> sub trim_spaces {
    >> my $str = shift;
    >> $str =~ s/\s//g;
    >> $str;
    >> }
    >>
    >> # then:
    >> my $string = q(<html> Hello World </html> Hello world);
    >> $string =~ s#(<html>.*?</html>)# trim_spaces($1) #e;
    >> print "$string\n";
    >>
    >>
    >> Xicheng
    >>
    >> > Only whitespace within the <html> tags is removed
     
    robic0, Apr 17, 2006
    #11
  12. Guest

    thanks for the advice..

    however how can i do it using [^<html>.*?</html>] matches not
    <html>.*?</html> ??

    A. Sinan Unur wrote:
    > wrote in news:1145242883.016777.198100
    > @i40g2000cwc.googlegroups.com:
    >
    > [ Please do not top-post. Please do read the FAQ and the posting
    > guidelines. ]
    >
    > > now im trying to remove whitespace that is not in the <html> tags
    > >
    > > so i do
    > >
    > > $string =~ s#([^<html>.*?</html>])# trim_spaces($1) #e;
    > >
    > > seems like im getting wrong

    >
    > Use HTML::parser or one of the modules based on that. I prefer
    > HTML::TokeParser.
    >
    > Sinan
    >
    > --
    > A. Sinan Unur <>
    > (remove .invalid and reverse each component for email address)
    >
    > comp.lang.perl.misc guidelines on the WWW:
    > http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    , Apr 17, 2006
    #12
  13. robic0 Guest

    On 16 Apr 2006 20:28:45 -0700, wrote:

    >thanks for the advice..
    >
    >however how can i do it using [^<html>.*?</html>] matches not
    ><html>.*?</html> ??
    >

    Why don't you put your head up your ass and ask an expert?
     
    robic0, Apr 17, 2006
    #13
  14. robic0 Guest

    On Sun, 16 Apr 2006 20:35:37 -0700, robic0 wrote:

    >On 16 Apr 2006 20:28:45 -0700, wrote:
    >
    >>thanks for the advice..
    >>
    >>however how can i do it using [^<html>.*?</html>] matches not
    >><html>.*?</html> ??
    >>

    >Why don't you put your head up your ass and ask an expert?

    Just trying to help. Need more help, just ask
     
    robic0, Apr 17, 2006
    #14
  15. robic0 <> wrote:

    > Don't you realize that in a compliant xml file, that within this string
    > '<html>HelloWorld</html> Hello world', that
    > ^^^^^^^^^^^^
    > is also contained within a tag???



    The "Hello world" part is NOT contained within a "tag".

    See the XML FAQ:

    http://xml.silmaril.ie/authors/makeup/


    > You don't know the form of 'tag'.



    And neither do you.

    Are we surprised?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 17, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dean H. Saxe
    Replies:
    0
    Views:
    1,078
    Dean H. Saxe
    Jan 3, 2004
  2. Oli Filth
    Replies:
    9
    Views:
    3,365
    Uncle Pirate
    Jan 17, 2005
  3. Replies:
    10
    Views:
    798
    Eric Brunel
    Dec 16, 2008
  4. MRAB
    Replies:
    3
    Views:
    406
  5. Chris Withers
    Replies:
    2
    Views:
    502
    Chris Withers
    Aug 11, 2010
Loading...

Share This Page