regex guru needed

Discussion in 'Perl Misc' started by Jeff, Feb 4, 2006.

  1. Jeff

    Jeff Guest

    Hi,

    I've got a text file that is multiple space delimited '\s{2,} The
    columns in this file may contain spaces, for example, one column is
    comprised of cities which may have names of multiple words, i.e., 'San
    Jose'.

    here's a sample from the file:

    31-Jan-2006 11:43:50 PM 649504 1.189 Public Website
    Frankfurt DTAG Deutsche Telekom Frankfurt 1
    http://www.joedog.org/
    31-Jan-2006 11:42:57 PM 649504 .5 Public Website
    Dallas UUNET UUNET Dallas
    1 http://www.joedog.org/
    31-Jan-2006 11:42:08 PM 649504 .652 Public Website
    Houston UUNET UUNET Houston
    1 http://www.joedog.org/
    31-Jan-2006 11:39:46 PM 649504 .435 Public Website
    San Jose XO XO
    San Jose 1
    http://www.joedog.org/
    31-Jan-2006 11:37:46 PM 649504 6.573 Public Website
    Sydney Optus Optus Sydney
    1 http://www.joedog.org/
    31-Jan-2006 11:26:43 PM 649504 .666 Public Website
    New York UUNET UUNET New York
    1 http://www.joedog.org/
    31-Jan-2006 11:25:49 PM 649504 1.241 Public Website
    Stockholm Telia Telia Stockholm 1
    http://www.joedog.org/
    31-Jan-2006 11:22:44 PM 649504 .722 Public Website
    Boston Sprint Sprint
    Boston 1 http://www.joedog.org/

    And here is my best match effort to date:

    open(FILE, "<haha.dat") or die "can't open file";
    while($line = <FILE>){
    if($line =~
    m/^(.+[AM|PM]+)\s{2,}([0-9]+)\s{2,}([0-9]*\.*[0-9]*)\s{2,}([a-zA-Z\s]+)\s{2,}([a-zA-Z\s]+)\s{2,}/){
    print "1: |".$1."|\n";
    print "2: |".$2."|\n";
    print "3: |".$3."|\n";
    print "4: |".$4."|\n";
    print "5: |".$5."|\n";
    }
    }


    That effort is pretty crappy, here are the results:
    1: |31-Jan-2006 11:43:50 PM|
    2: |649504|
    3: |1.189|
    4: |Public Website Frankfurt DTAG |
    5: | |
    1: |31-Jan-2006 11:42:57 PM|
    2: |649504|
    3: |.5|
    4: |Public Website Dallas UUNET UUNET
    Dallas|
    5: | |
    1: |31-Jan-2006 11:42:08 PM|
    2: |649504|
    3: |.652|
    4: |Public Website Houston UUNET UUNET
    |
    5: |Houston |
    1: |31-Jan-2006 11:39:46 PM|
    2: |649504|
    3: |.435|
    4: |Public Website San Jose XO XO
    |
    5: |San Jose|
    1: |31-Jan-2006 11:37:46 PM|
    2: |649504|
    3: |6.573|
    4: |Public Website Sydney Optus Optus
    Sydney|
    5: | |
    1: |31-Jan-2006 11:26:43 PM|
    2: |649504|
    3: |.666|
    4: |Public Website New York UUNET UUNET
    |
    5: |New York |
    1: |31-Jan-2006 11:25:49 PM|
    2: |649504|
    3: |1.241|
    4: |Public Website Stockholm Telia Telia
    Stockholm |
    5: | |
    1: |31-Jan-2006 11:22:44 PM|
    2: |649504|
    3: |.722|
    4: |Public Website Boston Sprint Sprint
    Boston |
    5: | |

    Any thoughts?

    Jeff
    Jeff, Feb 4, 2006
    #1
    1. Advertising

  2. Jeff

    Brian Wakem Guest

    Jeff wrote:

    > Hi,
    >
    > I've got a text file that is multiple space delimited '\s{2,} The
    > columns in this file may contain spaces, for example, one column is
    > comprised of cities which may have names of multiple words, i.e., 'San
    > Jose'.
    >
    > here's a sample from the file:
    >
    > 31-Jan-2006 11:43:50 PM 649504 1.189 Public Website
    > Frankfurt DTAG Deutsche Telekom Frankfurt 1
    > http://www.joedog.org/
    > 31-Jan-2006 11:42:57 PM 649504 .5 Public Website
    > Dallas UUNET UUNET Dallas
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:42:08 PM 649504 .652 Public Website
    > Houston UUNET UUNET Houston
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:39:46 PM 649504 .435 Public Website
    > San Jose XO XO
    > San Jose 1
    > http://www.joedog.org/
    > 31-Jan-2006 11:37:46 PM 649504 6.573 Public Website
    > Sydney Optus Optus Sydney
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:26:43 PM 649504 .666 Public Website
    > New York UUNET UUNET New York
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:25:49 PM 649504 1.241 Public Website
    > Stockholm Telia Telia Stockholm 1
    > http://www.joedog.org/
    > 31-Jan-2006 11:22:44 PM 649504 .722 Public Website
    > Boston Sprint Sprint
    > Boston 1 http://www.joedog.org/
    >
    > And here is my best match effort to date:
    >
    > open(FILE, "<haha.dat") or die "can't open file";
    > while($line = <FILE>){
    > if($line =~
    > m/^(.+[AM|PM]+)\s{2,}([0-9]+)\s{2,}([0-9]*\.*[0-9]*)\s{2,

    ([a-zA-Z\s]+)\s{2,}([a-zA-Z\s]+)\s{2,}/){
    > print "1: |".$1."|\n";
    > print "2: |".$2."|\n";
    > print "3: |".$3."|\n";
    > print "4: |".$4."|\n";
    > print "5: |".$5."|\n";
    > }
    > }



    Your data looks screwed to me. If it really is multispace delimited then
    your records do not have equal number of fields.

    Drop the needlessly complex and long-winded regex for a simple split with
    simple regex:

    my @array = split/\s{2,}/;
    print "$_\n" foreach @array;

    You'll see that only one record has 5 fields. One of them has as many as 9.



    --
    Brian Wakem
    Email: http://homepage.ntlworld.com/b.wakem/myemail.png
    Brian Wakem, Feb 4, 2006
    #2
    1. Advertising

  3. Jeff

    Xicheng Guest

    Jeff wrote:
    > Hi,
    >
    > I've got a text file that is multiple space delimited '\s{2,} The
    > columns in this file may contain spaces, for example, one column is
    > comprised of cities which may have names of multiple words, i.e., 'San
    > Jose'.
    >
    > here's a sample from the file:
    >
    > 31-Jan-2006 11:43:50 PM 649504 1.189 Public Website
    > Frankfurt DTAG Deutsche Telekom Frankfurt 1
    > http://www.joedog.org/
    > 31-Jan-2006 11:42:57 PM 649504 .5 Public Website
    > Dallas UUNET UUNET Dallas
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:42:08 PM 649504 .652 Public Website
    > Houston UUNET UUNET Houston
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:39:46 PM 649504 .435 Public Website
    > San Jose XO XO
    > San Jose 1
    > http://www.joedog.org/
    > 31-Jan-2006 11:37:46 PM 649504 6.573 Public Website
    > Sydney Optus Optus Sydney
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:26:43 PM 649504 .666 Public Website
    > New York UUNET UUNET New York
    > 1 http://www.joedog.org/
    > 31-Jan-2006 11:25:49 PM 649504 1.241 Public Website
    > Stockholm Telia Telia Stockholm 1
    > http://www.joedog.org/
    > 31-Jan-2006 11:22:44 PM 649504 .722 Public Website
    > Boston Sprint Sprint
    > Boston 1 http://www.joedog.org/
    >
    > And here is my best match effort to date:


    It looks to me that you are handling some fixed-width column-data. I
    think the best way is using unpack() instead of regex, do somthink like
    this:

    while(<DATA>) {
    my($date,$col2,$col3) = unpack("A24 A6 .......",$_);
    print "|date|$col2|.......";
    }

    Xicheng
    >
    > open(FILE, "<haha.dat") or die "can't open file";
    > while($line = <FILE>){
    > if($line =~
    > m/^(.+[AM|PM]+)\s{2,}([0-9]+)\s{2,}([0-9]*\.*[0-9]*)\s{2,}([a-zA-Z\s]+)\s{2,}([a-zA-Z\s]+)\s{2,}/){
    > print "1: |".$1."|\n";
    > print "2: |".$2."|\n";
    > print "3: |".$3."|\n";
    > print "4: |".$4."|\n";
    > print "5: |".$5."|\n";
    > }
    > }
    >
    >
    > That effort is pretty crappy, here are the results:
    > 1: |31-Jan-2006 11:43:50 PM|
    > 2: |649504|
    > 3: |1.189|
    > 4: |Public Website Frankfurt DTAG |
    > 5: | |
    > 1: |31-Jan-2006 11:42:57 PM|
    > 2: |649504|
    > 3: |.5|
    > 4: |Public Website Dallas UUNET UUNET
    > Dallas|
    > 5: | |
    > 1: |31-Jan-2006 11:42:08 PM|
    > 2: |649504|
    > 3: |.652|
    > 4: |Public Website Houston UUNET UUNET
    > |
    > 5: |Houston |
    > 1: |31-Jan-2006 11:39:46 PM|
    > 2: |649504|
    > 3: |.435|
    > 4: |Public Website San Jose XO XO
    > |
    > 5: |San Jose|
    > 1: |31-Jan-2006 11:37:46 PM|
    > 2: |649504|
    > 3: |6.573|
    > 4: |Public Website Sydney Optus Optus
    > Sydney|
    > 5: | |
    > 1: |31-Jan-2006 11:26:43 PM|
    > 2: |649504|
    > 3: |.666|
    > 4: |Public Website New York UUNET UUNET
    > |
    > 5: |New York |
    > 1: |31-Jan-2006 11:25:49 PM|
    > 2: |649504|
    > 3: |1.241|
    > 4: |Public Website Stockholm Telia Telia
    > Stockholm |
    > 5: | |
    > 1: |31-Jan-2006 11:22:44 PM|
    > 2: |649504|
    > 3: |.722|
    > 4: |Public Website Boston Sprint Sprint
    > Boston |
    > 5: | |
    >
    > Any thoughts?
    >
    > Jeff
    Xicheng, Feb 4, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Efy.
    Replies:
    2
    Views:
    1,081
  2. Joe

    Control Guru Needed

    Joe, Jan 20, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    356
  3. John Thompson

    ASP.NET Image Upload... Guru needed

    John Thompson, Jun 30, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    1,136
    Steve C. Orr [MVP, MCSD]
    Jun 30, 2004
  4. Andreas Klemt

    Regular Expressions Guru needed. Please help!

    Andreas Klemt, Aug 18, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    280
    Andreas Klemt
    Aug 18, 2004
  5. =?Utf-8?B?VGltOjouLg==?=

    Postback Probelm... GURU Needed...

    =?Utf-8?B?VGltOjouLg==?=, Jul 28, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    424
    Brock Allen
    Jul 28, 2005
Loading...

Share This Page