Pattern Matching on Case

D

DANIEL BURCH

I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis". In almost all cases
there is a lower case letter followed by an upper case letter. I am trying
to figure out a substitution statement that would separate them, but I'm not
sure what would work. Maybe something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

but I don't have a clue if that is even close to working or if it will give
me an "a" at the end and beginning of the words. Any help would be greatly
appreciated.
 
A

A. Sinan Unur

I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it
ended up with a lot of words run together like "ExplosionThis". In
almost all cases there is a lower case letter followed by an upper
case letter. I am trying to figure out a substitution statement that
would separate them, but I'm not sure what would work. Maybe
something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

I am curious: What do you think this does?

Here is a quick and dirty attempt based on your vague specification, and
nothing else. You might want to post some real code along with data
after reading the posting guidelines for this group.

#!/usr/bin/perl

use strict;
use warnings;

my $text;
{
local $/;
$text = <DATA>;
}

$text =~ s{\.\s+}{}g;

$text =~ s{([[:lower:]])([[:upper:]])}{$1\. $2}g;

print "$text\n";

__DATA__
I have a file that apparently had html tags stripped out of it,
or something, but no space characters added to replace the tags
so it ended up with a lot of words run together like "ExplosionThis."
In almost all cases there is a lower case letter followed by an
upper case letter. I am trying to figure out a substitution
statement that would separate them, but I'm not sure what would
work. Maybe something like

Notice the mess this makes of "ExplosionThis".

Sinan
 
I

it_says_BALLS_on_your_forehead

DANIEL said:
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis". In almost all cases
there is a lower case letter followed by an upper case letter. I am trying
to figure out a substitution statement that would separate them, but I'm not
sure what would work. Maybe something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

but I don't have a clue if that is even close to working or if it will give
me an "a" at the end and beginning of the words. Any help would be greatly
appreciated.

use strict; use warnings;

my $string = 'Hello theRe danielBurch howAreYou?';
$string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
6 /x will be default
print $string, "\n";
 
A

Ala Qumsieh

it_says_BALLS_on_your_forehead said:
use strict; use warnings;

my $string = 'Hello theRe danielBurch howAreYou?';
$string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
6 /x will be default

Still, you don't need to escape it. /x only affects the regexp part, and
not the replacement part.

--Ala
 
I

it_says_BALLS_on_your_forehead

Ala said:
it_says_BALLS_on_your_forehead said:
use strict; use warnings;

my $string = 'Hello theRe danielBurch howAreYou?';
$string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
6 /x will be default

Still, you don't need to escape it. /x only affects the regexp part, and
not the replacement part.

ahh, right you are! i always forget that.
 
J

John W. Krahn

DANIEL said:
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis". In almost all cases
there is a lower case letter followed by an upper case letter. I am trying
to figure out a substitution statement that would separate them, but I'm not
sure what would work. Maybe something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

but I don't have a clue if that is even close to working or if it will give
me an "a" at the end and beginning of the words. Any help would be greatly
appreciated.

$ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
ThisIsATest
This Is A Test


John
 
I

it_says_BALLS_on_your_forehead

John said:
DANIEL said:
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis". In almost all cases
there is a lower case letter followed by an upper case letter.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I am trying
to figure out a substitution statement that would separate them, but I'm not
sure what would work. Maybe something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

but I don't have a clue if that is even close to working or if it will give
me an "a" at the end and beginning of the words. Any help would be greatly
appreciated.

$ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
ThisIsATest
This Is A Test

the above is pretty slick, but doesn't address what the OP asked for.
what about cases where the data consists of a word in all caps?
 
M

Matt Garrish

it_says_BALLS_on_your_forehead said:
DANIEL said:
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it
ended up
with a lot of words run together like "ExplosionThis". In almost all
cases
there is a lower case letter followed by an upper case letter.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I am trying
to figure out a substitution statement that would separate them, but
I'm not
sure what would work. Maybe something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

but I don't have a clue if that is even close to working or if it will
give
me an "a" at the end and beginning of the words. Any help would be
greatly
appreciated.

$ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
ThisIsATest
This Is A Test

the above is pretty slick, but doesn't address what the OP asked for.
what about cases where the data consists of a word in all caps?

That's why the OP will probably learn the hard way that regexes are more
trouble than they're worth in this kind of situation, and that it's easier
to go back to the source and start over. A spellchecker might prove more
useful if that's not possible...

Matt
 
S

Samwyse

DANIEL said:
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis".

This is a bit off-topic, and definitely not related to Perl, but your
file didn't have HTML tags stripped from it. When stripping HTML tags,
you aren't supposed to replace them with whitespace. For example,
consider the following HTML, which italicizes some of the alphabet:

a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>

Introducing spaces for the tags would mess everything up.
 
M

Matt Garrish

Samwyse said:
This is a bit off-topic, and definitely not related to Perl, but your file
didn't have HTML tags stripped from it. When stripping HTML tags, you
aren't supposed to replace them with whitespace. For example, consider
the following HTML, which italicizes some of the alphabet:

a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>

Introducing spaces for the tags would mess everything up.

But then consider:

<td>I like to</td><td>Format everything</td><td>Inside cells</td><td>On one
line</td>

Never underestimate a bad html parsing job... : )

Matt
 
D

DANIEL BURCH

I think it was like:

<h1>This is a header</h1>This is some text.
Samwyse said:
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis".

This is a bit off-topic, and definitely not related to Perl, but your
file didn't have HTML tags stripped from it. When stripping HTML tags,
you aren't supposed to replace them with whitespace. For example,
consider the following HTML, which italicizes some of the alphabet:

a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>

Introducing spaces for the tags would mess everything up.
 
D

DANIEL BURCH

That's why the OP will probably learn the hard way that regexes are more
trouble than they're worth in this kind of situation, and that it's easier
to go back to the source and start over. A spellchecker might prove more
useful if that's not possible...

Hey - It was about 9000 lines of data in text format. Kind of big to go
through with a spell checker. What Balls sent in his first post worked just
how I wanted it to. I had to add a few lines to it with more variables like
cases of ".Cap" and "!Cap" , but it fixed the file in about 30 seconds.

Thanks to the group for the posts.

Dan
 
D

DANIEL BURCH

use strict; use warnings;
my $string = 'Hello theRe danielBurch howAreYou?';
$string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
6 /x will be default
print $string, "\n";

What Balls sent in his post worked just how I wanted it to. I had to add a
few lines to it with more variables like cases of ".Cap" and "!Cap" , but
it fixed the file.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top