Pattern Matching on Case

Discussion in 'Perl Misc' started by DANIEL BURCH, Feb 19, 2006.

  1. DANIEL BURCH

    DANIEL BURCH Guest

    I have a file that apparently had html tags stripped out of it, or
    something, but no space characters added to replace the tags so it ended up
    with a lot of words run together like "ExplosionThis". In almost all cases
    there is a lower case letter followed by an upper case letter. I am trying
    to figure out a substitution statement that would separate them, but I'm not
    sure what would work. Maybe something like

    s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

    but I don't have a clue if that is even close to working or if it will give
    me an "a" at the end and beginning of the words. Any help would be greatly
    appreciated.
    DANIEL BURCH, Feb 19, 2006
    #1
    1. Advertising

  2. "DANIEL BURCH" <> wrote in
    news:Ge3Kf.3027$GQ.2625@trnddc03:

    > I have a file that apparently had html tags stripped out of it, or
    > something, but no space characters added to replace the tags so it
    > ended up with a lot of words run together like "ExplosionThis". In
    > almost all cases there is a lower case letter followed by an upper
    > case letter. I am trying to figure out a substitution statement that
    > would separate them, but I'm not sure what would work. Maybe
    > something like
    >
    > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;


    I am curious: What do you think this does?

    Here is a quick and dirty attempt based on your vague specification, and
    nothing else. You might want to post some real code along with data
    after reading the posting guidelines for this group.

    #!/usr/bin/perl

    use strict;
    use warnings;

    my $text;
    {
    local $/;
    $text = <DATA>;
    }

    $text =~ s{\.\s+}{}g;

    $text =~ s{([[:lower:]])([[:upper:]])}{$1\. $2}g;

    print "$text\n";

    __DATA__
    I have a file that apparently had html tags stripped out of it,
    or something, but no space characters added to replace the tags
    so it ended up with a lot of words run together like "ExplosionThis."
    In almost all cases there is a lower case letter followed by an
    upper case letter. I am trying to figure out a substitution
    statement that would separate them, but I'm not sure what would
    work. Maybe something like

    Notice the mess this makes of "ExplosionThis".

    Sinan


    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Feb 19, 2006
    #2
    1. Advertising

  3. DANIEL BURCH wrote:
    > I have a file that apparently had html tags stripped out of it, or
    > something, but no space characters added to replace the tags so it ended up
    > with a lot of words run together like "ExplosionThis". In almost all cases
    > there is a lower case letter followed by an upper case letter. I am trying
    > to figure out a substitution statement that would separate them, but I'm not
    > sure what would work. Maybe something like
    >
    > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
    >
    > but I don't have a clue if that is even close to working or if it will give
    > me an "a" at the end and beginning of the words. Any help would be greatly
    > appreciated.


    use strict; use warnings;

    my $string = 'Hello theRe danielBurch howAreYou?';
    $string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
    6 /x will be default
    print $string, "\n";
    it_says_BALLS_on_your_forehead, Feb 19, 2006
    #3
  4. DANIEL BURCH

    Ala Qumsieh Guest

    it_says_BALLS_on_your_forehead wrote:
    > use strict; use warnings;
    >
    > my $string = 'Hello theRe danielBurch howAreYou?';
    > $string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
    > 6 /x will be default


    Still, you don't need to escape it. /x only affects the regexp part, and
    not the replacement part.

    --Ala
    Ala Qumsieh, Feb 19, 2006
    #4
  5. Ala Qumsieh wrote:
    > it_says_BALLS_on_your_forehead wrote:
    > > use strict; use warnings;
    > >
    > > my $string = 'Hello theRe danielBurch howAreYou?';
    > > $string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
    > > 6 /x will be default

    >
    > Still, you don't need to escape it. /x only affects the regexp part, and
    > not the replacement part.


    ahh, right you are! i always forget that.
    it_says_BALLS_on_your_forehead, Feb 19, 2006
    #5
  6. DANIEL BURCH wrote:
    > I have a file that apparently had html tags stripped out of it, or
    > something, but no space characters added to replace the tags so it ended up
    > with a lot of words run together like "ExplosionThis". In almost all cases
    > there is a lower case letter followed by an upper case letter. I am trying
    > to figure out a substitution statement that would separate them, but I'm not
    > sure what would work. Maybe something like
    >
    > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
    >
    > but I don't have a clue if that is even close to working or if it will give
    > me an "a" at the end and beginning of the words. Any help would be greatly
    > appreciated.


    $ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
    ThisIsATest
    This Is A Test


    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Feb 20, 2006
    #6
  7. John W. Krahn wrote:
    > DANIEL BURCH wrote:
    > > I have a file that apparently had html tags stripped out of it, or
    > > something, but no space characters added to replace the tags so it ended up
    > > with a lot of words run together like "ExplosionThis". In almost all cases
    > > there is a lower case letter followed by an upper case letter.


    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > > I am trying
    > > to figure out a substitution statement that would separate them, but I'm not
    > > sure what would work. Maybe something like
    > >
    > > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
    > >
    > > but I don't have a clue if that is even close to working or if it will give
    > > me an "a" at the end and beginning of the words. Any help would be greatly
    > > appreciated.

    >
    > $ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
    > ThisIsATest
    > This Is A Test


    the above is pretty slick, but doesn't address what the OP asked for.
    what about cases where the data consists of a word in all caps?
    it_says_BALLS_on_your_forehead, Feb 20, 2006
    #7
  8. DANIEL BURCH

    Matt Garrish Guest

    "it_says_BALLS_on_your_forehead" <> wrote in message
    news:...
    >
    > John W. Krahn wrote:
    >> DANIEL BURCH wrote:
    >> > I have a file that apparently had html tags stripped out of it, or
    >> > something, but no space characters added to replace the tags so it
    >> > ended up
    >> > with a lot of words run together like "ExplosionThis". In almost all
    >> > cases
    >> > there is a lower case letter followed by an upper case letter.

    >
    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >> > I am trying
    >> > to figure out a substitution statement that would separate them, but
    >> > I'm not
    >> > sure what would work. Maybe something like
    >> >
    >> > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
    >> >
    >> > but I don't have a clue if that is even close to working or if it will
    >> > give
    >> > me an "a" at the end and beginning of the words. Any help would be
    >> > greatly
    >> > appreciated.

    >>
    >> $ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
    >> ThisIsATest
    >> This Is A Test

    >
    > the above is pretty slick, but doesn't address what the OP asked for.
    > what about cases where the data consists of a word in all caps?
    >


    That's why the OP will probably learn the hard way that regexes are more
    trouble than they're worth in this kind of situation, and that it's easier
    to go back to the source and start over. A spellchecker might prove more
    useful if that's not possible...

    Matt
    Matt Garrish, Feb 20, 2006
    #8
  9. DANIEL BURCH

    Samwyse Guest

    DANIEL BURCH wrote:
    > I have a file that apparently had html tags stripped out of it, or
    > something, but no space characters added to replace the tags so it ended up
    > with a lot of words run together like "ExplosionThis".


    This is a bit off-topic, and definitely not related to Perl, but your
    file didn't have HTML tags stripped from it. When stripping HTML tags,
    you aren't supposed to replace them with whitespace. For example,
    consider the following HTML, which italicizes some of the alphabet:

    a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>

    Introducing spaces for the tags would mess everything up.
    Samwyse, Feb 20, 2006
    #9
  10. DANIEL BURCH

    Matt Garrish Guest

    "Samwyse" <> wrote in message
    news:phcKf.34787$...
    > DANIEL BURCH wrote:
    >> I have a file that apparently had html tags stripped out of it, or
    >> something, but no space characters added to replace the tags so it ended
    >> up
    >> with a lot of words run together like "ExplosionThis".

    >
    > This is a bit off-topic, and definitely not related to Perl, but your file
    > didn't have HTML tags stripped from it. When stripping HTML tags, you
    > aren't supposed to replace them with whitespace. For example, consider
    > the following HTML, which italicizes some of the alphabet:
    >
    > a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>
    >
    > Introducing spaces for the tags would mess everything up.


    But then consider:

    <td>I like to</td><td>Format everything</td><td>Inside cells</td><td>On one
    line</td>

    Never underestimate a bad html parsing job... : )

    Matt
    Matt Garrish, Feb 20, 2006
    #10
  11. "Matt Garrish" <> writes:

    > But then consider:


    > <td>I like to</td><td>Format everything</td><td>Inside
    > cells</td><td>On one line</td>


    > Never underestimate a bad html parsing job... : )


    Yep. When I really want to get the visible text of a page without the
    html, `lynx -dump $url` comes in handy.


    --
    Aaron --
    http://360.yahoo.com/aaron_baugher
    Aaron Baugher, Feb 20, 2006
    #11
  12. DANIEL BURCH

    DANIEL BURCH Guest

    I think it was like:

    <h1>This is a header</h1>This is some text.
    "Samwyse" <> wrote in message news:phcKf.34787$...
    DANIEL BURCH wrote:
    > I have a file that apparently had html tags stripped out of it, or
    > something, but no space characters added to replace the tags so it ended up
    > with a lot of words run together like "ExplosionThis".


    This is a bit off-topic, and definitely not related to Perl, but your
    file didn't have HTML tags stripped from it. When stripping HTML tags,
    you aren't supposed to replace them with whitespace. For example,
    consider the following HTML, which italicizes some of the alphabet:

    a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>

    Introducing spaces for the tags would mess everything up.
    DANIEL BURCH, Feb 20, 2006
    #12
  13. DANIEL BURCH

    DANIEL BURCH Guest


    >That's why the OP will probably learn the hard way that regexes are more
    >trouble than they're worth in this kind of situation, and that it's easier
    >to go back to the source and start over. A spellchecker might prove more
    >useful if that's not possible...


    >Matt


    Hey - It was about 9000 lines of data in text format. Kind of big to go
    through with a spell checker. What Balls sent in his first post worked just
    how I wanted it to. I had to add a few lines to it with more variables like
    cases of ".Cap" and "!Cap" , but it fixed the file in about 30 seconds.

    Thanks to the group for the posts.

    Dan
    DANIEL BURCH, Feb 20, 2006
    #13
  14. DANIEL BURCH

    DANIEL BURCH Guest

    >use strict; use warnings;

    >my $string = 'Hello theRe danielBurch howAreYou?';
    >$string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
    >6 /x will be default
    >print $string, "\n";


    What Balls sent in his post worked just how I wanted it to. I had to add a
    few lines to it with more variables like cases of ".Cap" and "!Cap" , but
    it fixed the file.
    DANIEL BURCH, Feb 20, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. anonym
    Replies:
    1
    Views:
    1,000
    Knute Johnson
    Jan 15, 2009
  2. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    220
    Marc Bissonnette
    Jan 13, 2004
  3. Replies:
    12
    Views:
    406
    it_says_BALLS_on_your forehead
    May 11, 2006
  4. Bobby Chamness
    Replies:
    2
    Views:
    214
    Xicheng Jia
    May 3, 2007
  5. Replies:
    8
    Views:
    129
Loading...

Share This Page