Parsing Large Files

Discussion in 'Perl Misc' started by Jose Yimpho, Nov 4, 2003.

  1. Jose Yimpho

    Jose Yimpho Guest

    Perl newbie here.. I'm experienced with other languages, but this is
    my first grapple with Perl + Regular Expressions, and I could use some
    help or a starting point on this problem.

    I have a text file that contains lines like what's at the bottom of
    this message. I would like to create a new file that contained
    comma-separated values that contains the info from the file. Possible
    entries are company name, street address, city, state, zip, phone,
    fax, email, url, rep, membership type, business type, and major
    products.

    Thanks for your help,
    Joe Laughlin




    ----------------------------------------
    A Street Games
    489 Park Ave
    Idaho Idaho Falls ID 83402
    Phone: 208-542-2824 Fax: 208-542-2824

    Business Representative: Mike Antonson
    Membership Type: C - Ret
    Business type: Accessories, Board games, Collectable card games,
    Family
    games, Magazines, Miniatures, Retailer, Roleplaying games, Video
    games,
    Wargames, Comic Books
    Major products: Role-Playing Games, Games Workshop Products, CCGs

    2 Big Guyz
    15901 Indian Head Hwy
    Accokeek MD 20607
    Phone: 240-210-0302

    www.2bigguyz.com
    Business Representative: Andrew Turlington
    Membership Type: C - Ret
    Business type: Accessories, Board games, Books, Collectable card
    games,
    Magazines, Miniatures, Retailer, Wargames, Comic Books

    21st Century Comics
    1531 S Harbor Blvd
    Fullerton CA 92832
    Phone: 714-992-6649 Fax: 714-992-6604

    www.21stcenturycomics.com
    Business Representative: Barry Short
    Membership Type: C - Ret
    Business type: Accessories, Books, Collectable card games, Other card
    games,
    Miniatures, Retailer, Roleplaying games, Wargames
    Major products: Wizards of the Coast Products; Wizkids Products
    -------------------------------------
     
    Jose Yimpho, Nov 4, 2003
    #1
    1. Advertising

  2. Jose Yimpho <> wrote:

    > Subject: Parsing Large Files



    I see nothing relating to large files in your post, so why
    did you say that there would be something relating to large
    files in your Subject?


    > Perl newbie here.. I'm experienced with other languages, but this is
    > my first grapple with Perl + Regular Expressions, and I could use some
    > help or a starting point on this problem.



    You haven't told us enough to be of much help...


    > I have a text file that contains lines like what's at the bottom of
    > this message.



    To parse a file we need to know the rules that the file will follow.

    What rules will the file follow?


    > Possible



    Which ones are optional?

    Which ones are required?


    > entries are company name,



    Is that always the 1st line?


    > street address,



    Is that always the 2nd line?


    > phone,



    Does that one always start with "Phone:" ?


    > email,



    Is that always the 5th line?


    > url,



    (you know those aren't really URLs, right?)


    > rep, membership type, business type, and major
    > products.



    Do those ones always have the something-ending-with-colon headings?


    > Business type: Accessories, Board games, Collectable card games,
    > Family
    > games, Magazines, Miniatures, Retailer, Roleplaying games, Video
    > games,
    > Wargames, Comic Books



    Even worse than the sample-with-no-spec approach to getting help
    is letting your newsreader break the data for you.

    Is that all on one line in your Real Data?


    Maybe this will get you started:

    ---------------------------
    #!/usr/bin/perl
    use strict;
    use warnings;

    { local $/ = ''; # enable paragraph mode
    while ( <DATA> ) {
    my($name, $street, $addr, $phone, $email) = /(.*)\n/g;
    my($city, $state, $zip) = $addr =~ /(.*?)\s+([A-Z][A-Z])\s+(\d+)$/;
    my($rep) = /^Business Representative:\s+(.*)/m;

    print "$name\n$street\n$city - $state - $zip\n$rep\n";
    print "-----\n";
    }
    }

    __DATA__
    # your data here
    ---------------------------


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 4, 2003
    #2
    1. Advertising

  3. Jose Yimpho

    Jose Yimpho Guest

    Tad McClellan wrote:

    > Jose Yimpho <> wrote:
    >
    >> Subject: Parsing Large Files

    >
    >
    > I see nothing relating to large files in your post, so why
    > did you say that there would be something relating to large
    > files in your Subject?
    >


    There's about 20,000 lines in the file. I thought that was large?

    >
    >> Perl newbie here.. I'm experienced with other languages, but this is
    >> my first grapple with Perl + Regular Expressions, and I could use some
    >> help or a starting point on this problem.

    >
    >
    > You haven't told us enough to be of much help...


    Sorry...

    >
    >
    >> I have a text file that contains lines like what's at the bottom of
    >> this message.

    >
    >
    > To parse a file we need to know the rules that the file will follow.
    >
    > What rules will the file follow?
    >
    >
    >> Possible

    >
    >
    > Which ones are optional?
    >
    > Which ones are required?
    >
    >
    >> entries are company name,

    >
    >
    > Is that always the 1st line?


    Yes

    >
    >
    >> street address,

    >
    >
    > Is that always the 2nd line?


    Yes, the city, state, and zip are always the third line.

    >
    >
    >> phone,

    >
    >
    > Does that one always start with "Phone:" ?


    Yes, and the Fax number has Fax: in front of it.

    >
    >
    >> email,

    >
    >
    > Is that always the 5th line?


    No, it's sometimes there.

    >
    >
    >> url,

    >
    >
    > (you know those aren't really URLs, right?)


    Forgive me.

    >
    >
    >> rep, membership type, business type, and major
    >> products.

    >
    >
    > Do those ones always have the something-ending-with-colon headings?


    Yes

    >
    >
    >> Business type: Accessories, Board games, Collectable card games,
    >> Family
    >> games, Magazines, Miniatures, Retailer, Roleplaying games, Video
    >> games,
    >> Wargames, Comic Books

    >
    >
    > Even worse than the sample-with-no-spec approach to getting help
    > is letting your newsreader break the data for you.
    >
    > Is that all on one line in your Real Data?


    No, not all on one line. I don't think the newsreader broke any data (the
    data is on multiple lines for each entitity wuth a blank line in between
    each entitity).

    Also, something like the following is legal (the linebreaks are
    intentional):

    Business type: Accessories, Board Games, Books,
    Other card games, Family
    Games, Magazines, Minatures
    Major products: Wizkids Products; Wizards of the Coast
    Products; Reaper Minatures





    >
    >
    > Maybe this will get you started:
    >
    > ---------------------------
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > { local $/ = ''; # enable paragraph mode
    > while ( <DATA> ) {
    > my($name, $street, $addr, $phone, $email) = /(.*)\n/g;
    > my($city, $state, $zip) = $addr =~ /(.*?)\s+([A-Z][A-Z])\s+(\d+)$/;
    > my($rep) = /^Business Representative:\s+(.*)/m;
    >
    > print "$name\n$street\n$city - $state - $zip\n$rep\n";
    > print "-----\n";
    > }
    > }
    >
    > __DATA__
    > # your data here
    > ---------------------------
    >
    >


    Thanks, that will get me started. Would appreciate any other help you could
    give. If there's anything I can answer, let me know.

    With regards to the paragraph grouping, I tried something like this last
    night:

    $/ = '';
    while <FILE>
    {
    print;
    $count++;
    }
    print "\nNumber of paragraphs: $count\n";

    It printed the file contents, and then: 'Number of paragraphs: 1', which
    didn't seem right to me, as I was trying to count the number of paragraphs
    (or blank lines) in the file. Setting the $/ sets the 'splitter' to split
    on all blank lines, right? and each iteration of the while loop reads in
    one section of the input (split by blank lines), right? Not sure why it
    was printing out a 1.

    Joe Laughlin
     
    Jose Yimpho, Nov 4, 2003
    #3
  4. Jose Yimpho

    Ben Morrow Guest

    Jose Yimpho <> wrote:
    > With regards to the paragraph grouping, I tried something like this last
    > night:
    >
    > $/ = '';
    > while <FILE>
    > {
    > print;
    > $count++;
    > }
    > print "\nNumber of paragraphs: $count\n";
    >
    > It printed the file contents, and then: 'Number of paragraphs: 1', which
    > didn't seem right to me, as I was trying to count the number of paragraphs
    > (or blank lines) in the file.


    Are the lines between your paragraphs truly blank? If they contain any
    whitespace (in the case of Win32 files opened in binary mode this
    includes the \r at the end of each line), then they will not be
    counted a paragraph breaks by Perl.

    Try

    $/ = $\ = "";
    while <FILE> {
    print "Line $.: |$_|";
    }

    to see what Perl considers each paragraph to contain. If your file
    does have 'blank' lines with spaces in, and you want to get rid of
    them, use

    perl -pi~ -e's/^\s+$//' file

    ..

    Ben

    --
    $.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
    $x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
    {$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t #
    $J::u::s::t, $a::n::eek:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
     
    Ben Morrow, Nov 4, 2003
    #4
  5. Jose Yimpho

    Jose Yimpho Guest

    Ben Morrow wrote:

    >
    > Jose Yimpho <> wrote:
    >> With regards to the paragraph grouping, I tried something like this last
    >> night:
    >>
    >> $/ = '';
    >> while <FILE>
    >> {
    >> print;
    >> $count++;
    >> }
    >> print "\nNumber of paragraphs: $count\n";
    >>
    >> It printed the file contents, and then: 'Number of paragraphs: 1', which
    >> didn't seem right to me, as I was trying to count the number of
    >> paragraphs (or blank lines) in the file.

    >
    > Are the lines between your paragraphs truly blank? If they contain any
    > whitespace (in the case of Win32 files opened in binary mode this
    > includes the \r at the end of each line), then they will not be
    > counted a paragraph breaks by Perl.
    >
    > Try
    >
    > $/ = $\ = "";
    > while <FILE> {
    > print "Line $.: |$_|";
    > }
    >
    > to see what Perl considers each paragraph to contain. If your file
    > does have 'blank' lines with spaces in, and you want to get rid of
    > them, use
    >
    > perl -pi~ -e's/^\s+$//' file
    >
    > .
    >
    > Ben
    >


    Yeah, I thought that too.

    In vi (in Redhat 9), I created a file similiar to:

    =============
    Hello this

    is a

    great file

    and I am proud of it.
    ============

    But I still got a paragraph count of one.
     
    Jose Yimpho, Nov 4, 2003
    #5
  6. Jose Yimpho <> wrote:
    > With regards to the paragraph grouping, I tried something like this last
    > night:
    >
    > $/ = '';
    > while <FILE>


    syntax error: should be: while (<FILE>)

    > {
    > print;
    > $count++;
    > }
    > print "\nNumber of paragraphs: $count\n";
    >
    > It printed the file contents, and then: 'Number of paragraphs: 1', which
    > didn't seem right to me, as I was trying to count the number of paragraphs
    > (or blank lines) in the file. Setting the $/ sets the 'splitter' to split
    > on all blank lines, right? and each iteration of the while loop reads in
    > one section of the input (split by blank lines), right? Not sure why it
    > was printing out a 1.


    Are your blank lines truly empty, or do they have whitespace in them?
    For instance, if each line ends with "\r\n", and your processing the
    file on a unixy OS where "\n" is the end of line character, you don't
    have any empty lines in the file. Test this theory with: $/="\r\n\r\n";

    --
    Glenn Jackman
    NCF Sysadmin
     
    Glenn Jackman, Nov 4, 2003
    #6
  7. Jose Yimpho <> wrote:
    > In vi (in Redhat 9), I created a file similiar to:

    [...]
    > But I still got a paragraph count of one.


    In vi, is your file format 'dos'?
    :set fileformat
    If so, set it to 'unix' before you save.
    :set ff=unix
    :wq

    --
    Glenn Jackman
    NCF Sysadmin
     
    Glenn Jackman, Nov 4, 2003
    #7
  8. Jose Yimpho <> wrote:

    > I tried something like this last

    ^^^^^^^^^^^^^^
    > night:
    >
    > $/ = '';
    > while <FILE>
    > {



    Please post *real* code.

    Have you seen the Posting Guidelines that are posted here frequently?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 4, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PedroX
    Replies:
    9
    Views:
    1,564
    Bryce K. Nielsen
    Jun 28, 2005
  2. Daniel Kramer

    string parsing screwing up on large files?

    Daniel Kramer, Dec 20, 2003, in forum: Python
    Replies:
    2
    Views:
    316
    Bengt Richter
    Dec 20, 2003
  3. alex masselot
    Replies:
    2
    Views:
    883
    Joseph Kesselman
    Jan 10, 2007
  4. Replies:
    2
    Views:
    357
    Jerry Coffin
    Sep 13, 2006
  5. thufir
    Replies:
    3
    Views:
    225
    Thufir
    Apr 12, 2008
Loading...

Share This Page