regex guru needed

J

Jeff

Hi,

I've got a text file that is multiple space delimited '\s{2,} The
columns in this file may contain spaces, for example, one column is
comprised of cities which may have names of multiple words, i.e., 'San
Jose'.

here's a sample from the file:

31-Jan-2006 11:43:50 PM 649504 1.189 Public Website
Frankfurt DTAG Deutsche Telekom Frankfurt 1
http://www.joedog.org/
31-Jan-2006 11:42:57 PM 649504 .5 Public Website
Dallas UUNET UUNET Dallas
1 http://www.joedog.org/
31-Jan-2006 11:42:08 PM 649504 .652 Public Website
Houston UUNET UUNET Houston
1 http://www.joedog.org/
31-Jan-2006 11:39:46 PM 649504 .435 Public Website
San Jose XO XO
San Jose 1
http://www.joedog.org/
31-Jan-2006 11:37:46 PM 649504 6.573 Public Website
Sydney Optus Optus Sydney
1 http://www.joedog.org/
31-Jan-2006 11:26:43 PM 649504 .666 Public Website
New York UUNET UUNET New York
1 http://www.joedog.org/
31-Jan-2006 11:25:49 PM 649504 1.241 Public Website
Stockholm Telia Telia Stockholm 1
http://www.joedog.org/
31-Jan-2006 11:22:44 PM 649504 .722 Public Website
Boston Sprint Sprint
Boston 1 http://www.joedog.org/

And here is my best match effort to date:

open(FILE, "<haha.dat") or die "can't open file";
while($line = <FILE>){
if($line =~
m/^(.+[AM|PM]+)\s{2,}([0-9]+)\s{2,}([0-9]*\.*[0-9]*)\s{2,}([a-zA-Z\s]+)\s{2,}([a-zA-Z\s]+)\s{2,}/){
print "1: |".$1."|\n";
print "2: |".$2."|\n";
print "3: |".$3."|\n";
print "4: |".$4."|\n";
print "5: |".$5."|\n";
}
}


That effort is pretty crappy, here are the results:
1: |31-Jan-2006 11:43:50 PM|
2: |649504|
3: |1.189|
4: |Public Website Frankfurt DTAG |
5: | |
1: |31-Jan-2006 11:42:57 PM|
2: |649504|
3: |.5|
4: |Public Website Dallas UUNET UUNET
Dallas|
5: | |
1: |31-Jan-2006 11:42:08 PM|
2: |649504|
3: |.652|
4: |Public Website Houston UUNET UUNET
|
5: |Houston |
1: |31-Jan-2006 11:39:46 PM|
2: |649504|
3: |.435|
4: |Public Website San Jose XO XO
|
5: |San Jose|
1: |31-Jan-2006 11:37:46 PM|
2: |649504|
3: |6.573|
4: |Public Website Sydney Optus Optus
Sydney|
5: | |
1: |31-Jan-2006 11:26:43 PM|
2: |649504|
3: |.666|
4: |Public Website New York UUNET UUNET
|
5: |New York |
1: |31-Jan-2006 11:25:49 PM|
2: |649504|
3: |1.241|
4: |Public Website Stockholm Telia Telia
Stockholm |
5: | |
1: |31-Jan-2006 11:22:44 PM|
2: |649504|
3: |.722|
4: |Public Website Boston Sprint Sprint
Boston |
5: | |

Any thoughts?

Jeff
 
B

Brian Wakem

Jeff said:
Hi,

I've got a text file that is multiple space delimited '\s{2,} The
columns in this file may contain spaces, for example, one column is
comprised of cities which may have names of multiple words, i.e., 'San
Jose'.

here's a sample from the file:

31-Jan-2006 11:43:50 PM 649504 1.189 Public Website
Frankfurt DTAG Deutsche Telekom Frankfurt 1
http://www.joedog.org/
31-Jan-2006 11:42:57 PM 649504 .5 Public Website
Dallas UUNET UUNET Dallas
1 http://www.joedog.org/
31-Jan-2006 11:42:08 PM 649504 .652 Public Website
Houston UUNET UUNET Houston
1 http://www.joedog.org/
31-Jan-2006 11:39:46 PM 649504 .435 Public Website
San Jose XO XO
San Jose 1
http://www.joedog.org/
31-Jan-2006 11:37:46 PM 649504 6.573 Public Website
Sydney Optus Optus Sydney
1 http://www.joedog.org/
31-Jan-2006 11:26:43 PM 649504 .666 Public Website
New York UUNET UUNET New York
1 http://www.joedog.org/
31-Jan-2006 11:25:49 PM 649504 1.241 Public Website
Stockholm Telia Telia Stockholm 1
http://www.joedog.org/
31-Jan-2006 11:22:44 PM 649504 .722 Public Website
Boston Sprint Sprint
Boston 1 http://www.joedog.org/

And here is my best match effort to date:

open(FILE, "<haha.dat") or die "can't open file";
while($line = <FILE>){
if($line =~
m/^(.+[AM|PM]+)\s{2,}([0-9]+)\s{2,}([0-9]*\.*[0-9]*)\s{2, ([a-zA-Z\s]+)\s{2,}([a-zA-Z\s]+)\s{2,}/){
print "1: |".$1."|\n";
print "2: |".$2."|\n";
print "3: |".$3."|\n";
print "4: |".$4."|\n";
print "5: |".$5."|\n";
}
}


Your data looks screwed to me. If it really is multispace delimited then
your records do not have equal number of fields.

Drop the needlessly complex and long-winded regex for a simple split with
simple regex:

my @array = split/\s{2,}/;
print "$_\n" foreach @array;

You'll see that only one record has 5 fields. One of them has as many as 9.
 
X

Xicheng

Jeff said:
Hi,

I've got a text file that is multiple space delimited '\s{2,} The
columns in this file may contain spaces, for example, one column is
comprised of cities which may have names of multiple words, i.e., 'San
Jose'.

here's a sample from the file:

31-Jan-2006 11:43:50 PM 649504 1.189 Public Website
Frankfurt DTAG Deutsche Telekom Frankfurt 1
http://www.joedog.org/
31-Jan-2006 11:42:57 PM 649504 .5 Public Website
Dallas UUNET UUNET Dallas
1 http://www.joedog.org/
31-Jan-2006 11:42:08 PM 649504 .652 Public Website
Houston UUNET UUNET Houston
1 http://www.joedog.org/
31-Jan-2006 11:39:46 PM 649504 .435 Public Website
San Jose XO XO
San Jose 1
http://www.joedog.org/
31-Jan-2006 11:37:46 PM 649504 6.573 Public Website
Sydney Optus Optus Sydney
1 http://www.joedog.org/
31-Jan-2006 11:26:43 PM 649504 .666 Public Website
New York UUNET UUNET New York
1 http://www.joedog.org/
31-Jan-2006 11:25:49 PM 649504 1.241 Public Website
Stockholm Telia Telia Stockholm 1
http://www.joedog.org/
31-Jan-2006 11:22:44 PM 649504 .722 Public Website
Boston Sprint Sprint
Boston 1 http://www.joedog.org/

And here is my best match effort to date:

It looks to me that you are handling some fixed-width column-data. I
think the best way is using unpack() instead of regex, do somthink like
this:

while(<DATA>) {
my($date,$col2,$col3) = unpack("A24 A6 .......",$_);
print "|date|$col2|.......";
}

Xicheng
open(FILE, "<haha.dat") or die "can't open file";
while($line = <FILE>){
if($line =~
m/^(.+[AM|PM]+)\s{2,}([0-9]+)\s{2,}([0-9]*\.*[0-9]*)\s{2,}([a-zA-Z\s]+)\s{2,}([a-zA-Z\s]+)\s{2,}/){
print "1: |".$1."|\n";
print "2: |".$2."|\n";
print "3: |".$3."|\n";
print "4: |".$4."|\n";
print "5: |".$5."|\n";
}
}


That effort is pretty crappy, here are the results:
1: |31-Jan-2006 11:43:50 PM|
2: |649504|
3: |1.189|
4: |Public Website Frankfurt DTAG |
5: | |
1: |31-Jan-2006 11:42:57 PM|
2: |649504|
3: |.5|
4: |Public Website Dallas UUNET UUNET
Dallas|
5: | |
1: |31-Jan-2006 11:42:08 PM|
2: |649504|
3: |.652|
4: |Public Website Houston UUNET UUNET
|
5: |Houston |
1: |31-Jan-2006 11:39:46 PM|
2: |649504|
3: |.435|
4: |Public Website San Jose XO XO
|
5: |San Jose|
1: |31-Jan-2006 11:37:46 PM|
2: |649504|
3: |6.573|
4: |Public Website Sydney Optus Optus
Sydney|
5: | |
1: |31-Jan-2006 11:26:43 PM|
2: |649504|
3: |.666|
4: |Public Website New York UUNET UUNET
|
5: |New York |
1: |31-Jan-2006 11:25:49 PM|
2: |649504|
3: |1.241|
4: |Public Website Stockholm Telia Telia
Stockholm |
5: | |
1: |31-Jan-2006 11:22:44 PM|
2: |649504|
3: |.722|
4: |Public Website Boston Sprint Sprint
Boston |
5: | |

Any thoughts?

Jeff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top