Perl RegExp question

K

Keith

All:
I am having a problem with Perl's regular expressions.

I am trying to change this --
King of the Forest Rangers S.O.S. Ranger Chapter 9.AVI

To this --
King of the Forest Rangers Ch 9 S.O.S. Ranger.avi

I tried to use this regexp but it didn't work as expected --
s/^(.*)(\s*)(.*)(\s*)(Chapter)(\s*)(\d*).AVI/$1 Ch $7 $3.avi/g

All I got was this --
King of the Forest Rangers S.O.S. Ranger Ch 9.avi

How do I make a Perl regexp that works with any number of strings at the beginning (the title of the file),
any number of strings in the middle (the title of that file's episode) and the Chapter number?
 
C

ccc31807

I am trying to change this --
King of the Forest Rangers S.O.S. Ranger  Chapter 9.AVI

To this --
King of the Forest Rangers Ch 9 S.O.S. Ranger.avi

The Perl mavens will tar and feather me for saying this, but I'll say
it anyway. And BTW, this is true not only for Perl but also for
anything that uses REs (like vi, for example).

Build up your RE piece by piece. This is somewhat difficult to do
without a top loop (like Lisp) but you can still do it. Match your
original term by term, using the $1, $2, etc., variables to see what
you get. Also, the prematch and postmatch variables ($', $`, and kin)
variables are eyeopening as well. Look at perlvar.

My strategy, which isn't the best by any means, is to start from the
front and match token by token, until I can capture everything I want
in the numbered variables, and then the job is as good as done. I find
it much harder to compose a RE in one stroke -- it's better to have
one that does ten percent of what it's supposed to do than to have one
that does 100 percent of nothing.

CC.
 
K

Keith

CC:
But how does one match a title that in one example starts off with "King of the Royal Mounties" and
another that has as it's title "King's Row"? Is there a global RE that takes care of such titles or are each
controlled by the number of words in each file's title?


Keith
 
C

ccc31807

  But how does one match a title that in one example starts off with "King of the Royal Mounties" and
another that has as it's title "King's Row"?  Is there a global RE thattakes care of such titles or are each
controlled by the number of words in each file's title?

That's not what you have. You have a sequence of tokens separated by
white space, most composed of alphabetical characters, some of numeric
characters, some both, and some with non-alphanumeric characters. It's
possible to have a non-alphanumeric character in a title, like 'King's
Row'.

You can't match what you don't have, and you don't have a book title.
If you want to consider everything before the literal string 'Chapter'
as the title, you can do that, but that's a characteristic you impose
on the data, not something that's inherent in the data.

Of course, in a case like this one, where you are reading in a mass of
character data, you should normalize the data in some way -- something
that Perl excels at. You should also assume that you will get error
lines, and write those out to an error file.

It might be helpful if you could post several dozen lines of your
input data.

CC.
 
K

Keith

CC:
OK, I understand now. Here are some more examples of input and wanted output.

Title of show -- "King of Royal Mounted" or "King's Row"
Titlle of episode -- "Murderer's Row" or "Saps at Sea"
Number of episode -- 1 through 13

Input Examples:
King of Royal Mounted Murderer's Row Chapter 2.AVI

King's Row Saps at Sea Chapter 11.AVI


Again, how do I make some generic regexp in Perl in order to change the above to the following output?

Output Examples:
King of Royal Mounted Ch 2 Murderer's Row.avi

King's Row Ch 11 Saps at Sea.avi
 
T

Ted Zlatanov

K> OK, I understand now. Here are some more examples of input and wanted output.

K> Title of show -- "King of Royal Mounted" or "King's Row"
K> Titlle of episode -- "Murderer's Row" or "Saps at Sea"
K> Number of episode -- 1 through 13

K> Input Examples:
K> King of Royal Mounted Murderer's Row Chapter 2.AVI

K> King's Row Saps at Sea Chapter 11.AVI


K> Again, how do I make some generic regexp in Perl in order to change the above to the following output?

K> Output Examples:
K> King of Royal Mounted Ch 2 Murderer's Row.avi

K> King's Row Ch 11 Saps at Sea.avi

Build a list of possible show names. Untested:

my @shows = ('King of Royal Mounted', "King's Row");
foreach my $show (@shows)
{
# note the i modifier to match AVI, avi, etc.
$name =~ s/^$show\s+(.*)\s+Chapter\s+(\d+).avi/$show Ch $2 $1.avi/i;
}

Ted
 
K

Keith

K> OK, I understand now. Here are some more examples of input and
wanted output.

K> Title of show -- "King of Royal Mounted" or "King's Row" K> Titlle
of episode -- "Murderer's Row" or "Saps at Sea" K> Number of episode
-- 1 through 13

K> Input Examples:
K> King of Royal Mounted Murderer's Row Chapter 2.AVI

K> King's Row Saps at Sea Chapter 11.AVI


K> Again, how do I make some generic regexp in Perl in order to change
the above to the following output?

K> Output Examples:
K> King of Royal Mounted Ch 2 Murderer's Row.avi

K> King's Row Ch 11 Saps at Sea.avi

Build a list of possible show names. Untested:

my @shows = ('King of Royal Mounted', "King's Row"); foreach my $show
(@shows)
{
# note the i modifier to match AVI, avi, etc. $name =~
s/^$show\s+(.*)\s+Chapter\s+(\d+).avi/$show Ch $2 $1.avi/i;
}

Ted

Ted:
What if you don't know what the title is exactly in an .avi file? That is, you know that it's the first word(s)
of the file name but nothing more?


Keith Lee
 
A

azrazer

Ted:
What if you don't know what the title is exactly in an .avi file? That is, you know that it's the first word(s)
of the file name but nothing more?
Keith Lee
hulo,
you may try to build a big database (an array in this case will be
sufficient) containing all the show names (extracted from the internet)....
Otherwise, you cannot expect a computer to be "intelligent" and know
what is a show and what is not... this part of the program must come
from the programmer :)

best,
azra
 
K

Keith

Azra:
Yes, I have learned about the limitations of Perl RegExp or RegExp in general. Thank you.

Keith
 
C

ccc31807

Azra:
 Yes, I have learned about the limitations of Perl RegExp or RegExp in general. Thank you.

Keith

This isn't a limitation of regular expressions. A regular expression
is a pattern and the regular expression is a pattern matching
languages, like Prolog, for example. In order to use it, you must have
patterns.

To illustrate, HTML is also a 'pattern matching language' in a sense.
Look at examples (a) and (b):
(a)
<html>
<head>
<title>My Home Page</title>
</head>
<body>
<h1>Keith's Home Page</h1>
<p>How do you like it?</p>
</body>
</html>
(b)
My Home Page Keith's Home Page How do you like it>

Now, suppose you had (b)? Would you say that it's valid HTML? Of
course not! The same is true for your data, YOU DON'T HAVE PATTERNS TO
MATCH.

At the risk of being a little insulting (you may have earned it)
you've been slow on the uptake. GIGO. Your input is garbage, so your
output will be garbage.

I'm not saying that your data is invalid necessarily -- I can't
determine that and make no judgment on that point. What I'm saying is
that you do not have valid input to feed to a regular expression to
generate the kind of output that you want.

If you want my advice, I would input the data with some kind of
delimited file format, and then use split() or related to break it
apart.

my $input = q(King of Royal Mounted:Murderer's Row:Chapter 2.AVI);
my ($show, $episode, $avi) = split(/:/, $input);
$avi =~ /(.+).AVI/;
my $chapter = $1;
my $vid = "$episode.AVI";
print qq(show: $show\nepisode: $episode\navi: $avi\nchapter: $chapter
\nvid: $vid\n);

outputs this:
show: King of Royal Mounted
episode: Murderer's Row
avi: Chapter 2.AVI
chapter: Chapter 2
vid: Murderer's Row.AVI

CC.
 
T

Ted Zlatanov

K> What if you don't know what the title is exactly in an .avi file?
K> That is, you know that it's the first word(s) of the file name but
K> nothing more?

Assuming you're talking about TV shows, try
http://thetvdb.com/?tab=advancedsearch (it's completely open and has a
developer API).

Another approach is, if you can's match any known shows, ask for a name
and add it to the show list, then save the list. So the list grows as
you encounter more shows.

Ted
 
T

Ted Zlatanov

c> This isn't a limitation of regular expressions. A regular expression
c> is a pattern and the regular expression is a pattern matching
c> languages, like Prolog, for example. In order to use it, you must have
c> patterns.
....
c> The same is true for your data, YOU DON'T HAVE PATTERNS TO MATCH.

Sure he does. It's not as if he's looking for DNA patterns that need to
be statistically determined. There are only so many shows he can have
in his database.

Perl regular expressions are definitely not just about matching
patterns. Especially if you consider that you can embed Perl code right
in them. Many regexp limitations just don't apply.

Ted
 
C

ccc31807

c> The same is true for your data, YOU DON'T HAVE PATTERNS TO MATCH.

Sure he does.  It's not as if he's looking for DNA patterns that need to
be statistically determined.  There are only so many shows he can have
in his database.

No he doesn't. He wants to match a title, and episode, and a file
name, and all he has are space delimited tokens.

If he wanted to match the tokens, I'd agree with you. 'Pattern' is a
concept we impose on the data, not something inherent in the data.
What is the title of "King of Royal Mounted Murderer's Row"? 'King of
Royal'? 'King of Royal Mounted'? 'King of Royal Mounted Murderer's'?
'King of Royal Mounted Murderer's Row'?

My point was that he wants to impose order on an essentially unordered
collection of tokens and does not have anything by which to collect
the words into groups. The fact that a pattern language recognizes
patterns isn't a limitation of the language, it's a description of the
language. He's trying to use the wrong tool for the job, and blames
the tool when it fails to do the job.

Besides all of that, he's basically doing data munging, and REs aren't
particularly suited to data munging.

CC.
 
T

Ted Zlatanov

c> No he doesn't. He wants to match a title, and episode, and a file
c> name, and all he has are space delimited tokens.

c> If he wanted to match the tokens, I'd agree with you. 'Pattern' is a
c> concept we impose on the data, not something inherent in the data.
c> What is the title of "King of Royal Mounted Murderer's Row"? 'King of
c> Royal'? 'King of Royal Mounted'? 'King of Royal Mounted Murderer's'?
c> 'King of Royal Mounted Murderer's Row'?

There are patterns inherent in most data and context helps establish
them. Take English, for example. Sentences are terribly ambiguous
without context. The programming distinction is between lexing ("where
are the words?") and parsing ("what are the words saying?").

Another good example is Perl code itself. Does this:

map { "no $_" } qw/1 2 3/, qw/4 5 6/;

mean to map across 1-6 or 1-3, then leave 4-6 alone? At a glance it's
confusing, so that's where the parser comes in and determines how the
expressions will be grouped.

Incidentally this is one of the things I like about Lisp: there is very
little parsing ambiguity in the language, and in fact writing a Lisp
parser is a famously easy task. You can say that generally the more
"natural" a language is (approximately meaning "the syntax is looser"),
the harder it is to parse. Obviously Perl is pretty "natural" by design.

If you're interested in more on this topic, read up on lexing and
parsing. Perl 6 has *very* extensive support for those, way beyond what
regular expressions can provide (more like the Perl 5 Parse::RecDescent
module). Whether that's a good or a bad thing depends on who you ask.

Anyhow, the context here is "names of TV show episodes." The file name
is not just space-delimited tokens, it's the name of a TV show followed
by the episode name and then the chapter number. So to parse out the TV
show name, you can either use feedback training (where the user teaches
the parser the names of the TV shows) or a knowledge database (the list
of all TV shows).

c> Besides all of that, he's basically doing data munging, and REs aren't
c> particularly suited to data munging.

I disagree. Well, either this is false or I've been using regular
expressions wrong for the last 15 years or so. Could be the latter.

Ted
 
C

ccc31807

Incidentally this is one of the things I like about Lisp: there is very
little parsing ambiguity in the language, and in fact writing a Lisp
parser is a famously easy task.

An S-expression is by itself an abstract syntax tree. You can take
(* 3 (+ 4 (- 5 6)) (/ 7 8)) and write that as an AST directly. IMO,
this is want makes Lisp both very powerful and very difficult.
Anyhow, the context here is "names of TV show episodes."  The file name
is not just space-delimited tokens, it's the name of a TV show followed
by the episode name and then the chapter number.

I very rarely watch TV, and had no context within which to understand
the question. Obviously, if you have particular strings you want to
match, you can do that several ways. Still, to me the data appears to
be scrambled, even though to someone familiar with the names of TV
programs it might be pretty clear.
c> Besides all of that, he's basically doing data munging, and REs aren't
c> particularly suited to data munging.

I disagree.  Well, either this is false or I've been using regular
expressions wrong for the last 15 years or so.  Could be the latter.

This may be a point where individual experience colors our opinions.
For the past six years or so, I've earned my living as a data munger
(my job title is Database Manager but my responsibilities are mostly
creating reports based on the results of queries).

I see data as discrete values arranged in rows and columns, which
often results in multi-dimensional structures, e.g., reporting on
students by college, zip code, academic level, program, and credit
hours completed. To me, 'data munging' means reading input files,
rearranging the data, and writing output files --- and regular
expressions don't help at all with this kind of job. Instead, I use
hashes of hash refs a great deal.

I use regular expressions a lot, but mostly in connection with
operations on individual datums, NOT what I would consider data
munging.

To me, the OP's problem is a typical data munging task, and the OP is
using the wrong tool to do it but criticizing that tool for the
inability to get the job done. It's kind of like using a wrench to
drive nails, and faulting the wrench for not being able to drive nails
very well. You can do it, yes, but a hammer is much better for the
job.

CC.
 
T

Ted Zlatanov

c> I see data as discrete values arranged in rows and columns, which
c> often results in multi-dimensional structures, e.g., reporting on
c> students by college, zip code, academic level, program, and credit
c> hours completed. To me, 'data munging' means reading input files,
c> rearranging the data, and writing output files --- and regular
c> expressions don't help at all with this kind of job. Instead, I use
c> hashes of hash refs a great deal.

Oh boy, you're missing out on half the fun then. Regular expressions
are very good for manipulating and rearranging data, especially
line-based data. When each piece of data spans multiple lines it can
get a little harder to manipulate it.

For any data manipulation task, I try to do it in this sequence:

- for any line, ALWAYS produce the same output. Else,

- for any line, produce some output based on the input so far. Else,

- once all the data is consumed, produce some aggregate output.

Your usage seems to be mostly the third case, but the first two are very
useful as well. The first one (stateless data manipulation) is
especially useful because it can be done on any chunk of the data (which
in turn makes it easiest to parallelize and possibly map-reduce).

Ted
 
C

ccc31807

 Regular expressions
are very good for manipulating and rearranging data, especially
line-based data.

What about delimited files that are 80 columns wide and 25000 rows
deep?
For any data manipulation task, I try to do it in this sequence:

I typically do the following:

while (<DATA>)
{
next unless /\w/;
chomp;
my ($var1, $var2, $var3 ... as appropriately named ) =
some_split_function($_);
next if $var2 =~ /unwanted value/;
$hash{$var1}{$var2}{$var3}{$var4} = $var5;
}

I then have all my desired data in a hash that I can sort and
manipulate, and print, resulting in this pattern:

foreach my $k1 (sort keys %hash)
{
foreach my $k2 (sort keys %{$hash{$k1}})
{
... as many levels as I need
}
}
- for any line, ALWAYS produce the same output.  Else,

- for any line, produce some output based on the input so far.  Else,

- once all the data is consumed, produce some aggregate output.

Your usage seems to be mostly the third case, but the first two are very
useful as well.

I usually have to aggregate data, so I sum or count datums on each
row. Don't get me wrong -- I use REs all the time, but for the task of
reading input, building data structures, and writing output, I don't
find them particularly useful.

CC.
 
J

Jim Gibson

ccc31807 said:
I usually have to aggregate data, so I sum or count datums on each
row. Don't get me wrong -- I use REs all the time, but for the task of
reading input, building data structures, and writing output, I don't
find them particularly useful.

Well, I do. Perhaps in your case the reason you don't need regular
expressions to parse your data is because your data is contained within
a well-structured database.

I typically use Perl to extract information from program output and log
files. These files are semi-structured, containing some fixed bits and
some variable bits. I am usually interested in the variable bits and
use regular expressions and the fixed bits to extract the data I want.

The format of the log files is sometimes under my control and sometimes
not. For these cases, regular expressions are invaluable.
 
T

Ted Zlatanov

c> What about delimited files that are 80 columns wide and 25000 rows
c> deep?

Sure, depending on the manipulation needed of course. 2MB or so of data
is hardly large, though.

c> I typically do the following:

c> while (<DATA>)
c> {
c> next unless /\w/;
c> chomp;
c> my ($var1, $var2, $var3 ... as appropriately named ) =
c> some_split_function($_);
c> next if $var2 =~ /unwanted value/;
c> $hash{$var1}{$var2}{$var3}{$var4} = $var5;
c> }

some_split_function() almost definitely uses regular expressions :)

c> I then have all my desired data in a hash that I can sort and
c> manipulate, and print, resulting in this pattern:

c> foreach my $k1 (sort keys %hash)
c> {
c> foreach my $k2 (sort keys %{$hash{$k1}})
c> {
c> ... as many levels as I need
c> }
c> }

Sure. This works great until %hash gets big. It also produces output
only after all the input is consumed, as opposed to line-based
processing which tends to be much more responsive. So you choose the
approach depending on the task you need to do and the size of your
input, although all other things being equal, go with stateless
line-by-line processing if you can. But I'm repeating myself...

c> I usually have to aggregate data, so I sum or count datums on each
c> row. Don't get me wrong -- I use REs all the time, but for the task of
c> reading input, building data structures, and writing output, I don't
c> find them particularly useful.

OK. Many of us do, so I think it's simply that you haven't had the
opportunity and need to try it, rather than a fundamental shortcoming of
regular expressions as a data processing and munging tool.

Ted
 
C

ccc31807

some_split_function() almost definitely uses regular expressions :)

Yes, it does, but (mostly) I use one of the built in modules, so I
don't have to write it by hand.
Sure.  This works great until %hash gets big.  It also produces output
only after all the input is consumed, as opposed to line-based
processing which tends to be much more responsive.

This does process line by line. It also collects particular datums as
needed. How can you calculate a sum of a datum, or a count, without
collecting the data values? You don't necessarily need all the items
in a line, but for many cases you need to see all the lines before you
can generate your report.
OK.  Many of us do, so I think it's simply that you haven't had the
opportunity and need to try it, rather than a fundamental shortcoming of
regular expressions as a data processing and munging tool.

As I said, I use REs regularly, and find them quite useful. I also
find them extremely useful in, for example, vi, where I rely heavily
on REs for various things. I also recognize that REs lie under the
hood of something like, for example, Text::parseWords or Text::CSV_XS.

I have developed a habit of using the old C-like tools, like substr,
index, etc., when I can, because they seem to lie closer at hand than
an RE. I'm pretty proficient at what I do, so something must work.

CC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top