Parsing Large Files

J

Jose Yimpho

Perl newbie here.. I'm experienced with other languages, but this is
my first grapple with Perl + Regular Expressions, and I could use some
help or a starting point on this problem.

I have a text file that contains lines like what's at the bottom of
this message. I would like to create a new file that contained
comma-separated values that contains the info from the file. Possible
entries are company name, street address, city, state, zip, phone,
fax, email, url, rep, membership type, business type, and major
products.

Thanks for your help,
Joe Laughlin




----------------------------------------
A Street Games
489 Park Ave
Idaho Idaho Falls ID 83402
Phone: 208-542-2824 Fax: 208-542-2824
(e-mail address removed)
Business Representative: Mike Antonson
Membership Type: C - Ret
Business type: Accessories, Board games, Collectable card games,
Family
games, Magazines, Miniatures, Retailer, Roleplaying games, Video
games,
Wargames, Comic Books
Major products: Role-Playing Games, Games Workshop Products, CCGs

2 Big Guyz
15901 Indian Head Hwy
Accokeek MD 20607
Phone: 240-210-0302
(e-mail address removed)
www.2bigguyz.com
Business Representative: Andrew Turlington
Membership Type: C - Ret
Business type: Accessories, Board games, Books, Collectable card
games,
Magazines, Miniatures, Retailer, Wargames, Comic Books

21st Century Comics
1531 S Harbor Blvd
Fullerton CA 92832
Phone: 714-992-6649 Fax: 714-992-6604
(e-mail address removed)
www.21stcenturycomics.com
Business Representative: Barry Short
Membership Type: C - Ret
Business type: Accessories, Books, Collectable card games, Other card
games,
Miniatures, Retailer, Roleplaying games, Wargames
Major products: Wizards of the Coast Products; Wizkids Products
-------------------------------------
 
T

Tad McClellan

Jose Yimpho said:
Subject: Parsing Large Files


I see nothing relating to large files in your post, so why
did you say that there would be something relating to large
files in your Subject?

Perl newbie here.. I'm experienced with other languages, but this is
my first grapple with Perl + Regular Expressions, and I could use some
help or a starting point on this problem.


You haven't told us enough to be of much help...

I have a text file that contains lines like what's at the bottom of
this message.


To parse a file we need to know the rules that the file will follow.

What rules will the file follow?



Which ones are optional?

Which ones are required?

entries are company name,


Is that always the 1st line?

street address,


Is that always the 2nd line?



Does that one always start with "Phone:" ?



Is that always the 5th line?



(you know those aren't really URLs, right?)

rep, membership type, business type, and major
products.


Do those ones always have the something-ending-with-colon headings?

Business type: Accessories, Board games, Collectable card games,
Family
games, Magazines, Miniatures, Retailer, Roleplaying games, Video
games,
Wargames, Comic Books


Even worse than the sample-with-no-spec approach to getting help
is letting your newsreader break the data for you.

Is that all on one line in your Real Data?


Maybe this will get you started:

---------------------------
#!/usr/bin/perl
use strict;
use warnings;

{ local $/ = ''; # enable paragraph mode
while ( <DATA> ) {
my($name, $street, $addr, $phone, $email) = /(.*)\n/g;
my($city, $state, $zip) = $addr =~ /(.*?)\s+([A-Z][A-Z])\s+(\d+)$/;
my($rep) = /^Business Representative:\s+(.*)/m;

print "$name\n$street\n$city - $state - $zip\n$rep\n";
print "-----\n";
}
}

__DATA__
# your data here
 
J

Jose Yimpho

Tad said:
I see nothing relating to large files in your post, so why
did you say that there would be something relating to large
files in your Subject?

There's about 20,000 lines in the file. I thought that was large?
You haven't told us enough to be of much help...
Sorry...



To parse a file we need to know the rules that the file will follow.

What rules will the file follow?




Which ones are optional?

Which ones are required?




Is that always the 1st line?
Yes



Is that always the 2nd line?

Yes, the city, state, and zip are always the third line.
Does that one always start with "Phone:" ?

Yes, and the Fax number has Fax: in front of it.
Is that always the 5th line?

No, it's sometimes there.
(you know those aren't really URLs, right?)

Forgive me.
Do those ones always have the something-ending-with-colon headings?
Yes



Even worse than the sample-with-no-spec approach to getting help
is letting your newsreader break the data for you.

Is that all on one line in your Real Data?

No, not all on one line. I don't think the newsreader broke any data (the
data is on multiple lines for each entitity wuth a blank line in between
each entitity).

Also, something like the following is legal (the linebreaks are
intentional):

Business type: Accessories, Board Games, Books,
Other card games, Family
Games, Magazines, Minatures
Major products: Wizkids Products; Wizards of the Coast
Products; Reaper Minatures




Maybe this will get you started:

---------------------------
#!/usr/bin/perl
use strict;
use warnings;

{ local $/ = ''; # enable paragraph mode
while ( <DATA> ) {
my($name, $street, $addr, $phone, $email) = /(.*)\n/g;
my($city, $state, $zip) = $addr =~ /(.*?)\s+([A-Z][A-Z])\s+(\d+)$/;
my($rep) = /^Business Representative:\s+(.*)/m;

print "$name\n$street\n$city - $state - $zip\n$rep\n";
print "-----\n";
}
}

__DATA__
# your data here

Thanks, that will get me started. Would appreciate any other help you could
give. If there's anything I can answer, let me know.

With regards to the paragraph grouping, I tried something like this last
night:

$/ = '';
while <FILE>
{
print;
$count++;
}
print "\nNumber of paragraphs: $count\n";

It printed the file contents, and then: 'Number of paragraphs: 1', which
didn't seem right to me, as I was trying to count the number of paragraphs
(or blank lines) in the file. Setting the $/ sets the 'splitter' to split
on all blank lines, right? and each iteration of the while loop reads in
one section of the input (split by blank lines), right? Not sure why it
was printing out a 1.

Joe Laughlin
 
B

Ben Morrow

Jose Yimpho said:
With regards to the paragraph grouping, I tried something like this last
night:

$/ = '';
while <FILE>
{
print;
$count++;
}
print "\nNumber of paragraphs: $count\n";

It printed the file contents, and then: 'Number of paragraphs: 1', which
didn't seem right to me, as I was trying to count the number of paragraphs
(or blank lines) in the file.

Are the lines between your paragraphs truly blank? If they contain any
whitespace (in the case of Win32 files opened in binary mode this
includes the \r at the end of each line), then they will not be
counted a paragraph breaks by Perl.

Try

$/ = $\ = "";
while <FILE> {
print "Line $.: |$_|";
}

to see what Perl considers each paragraph to contain. If your file
does have 'blank' lines with spaces in, and you want to get rid of
them, use

perl -pi~ -e's/^\s+$//' file

..

Ben
 
J

Jose Yimpho

Ben said:
Are the lines between your paragraphs truly blank? If they contain any
whitespace (in the case of Win32 files opened in binary mode this
includes the \r at the end of each line), then they will not be
counted a paragraph breaks by Perl.

Try

$/ = $\ = "";
while <FILE> {
print "Line $.: |$_|";
}

to see what Perl considers each paragraph to contain. If your file
does have 'blank' lines with spaces in, and you want to get rid of
them, use

perl -pi~ -e's/^\s+$//' file

.

Ben

Yeah, I thought that too.

In vi (in Redhat 9), I created a file similiar to:

=============
Hello this

is a

great file

and I am proud of it.
============

But I still got a paragraph count of one.
 
G

Glenn Jackman

Jose Yimpho said:
With regards to the paragraph grouping, I tried something like this last
night:

$/ = '';
while <FILE>

syntax error: should be: while ( said:
{
print;
$count++;
}
print "\nNumber of paragraphs: $count\n";

It printed the file contents, and then: 'Number of paragraphs: 1', which
didn't seem right to me, as I was trying to count the number of paragraphs
(or blank lines) in the file. Setting the $/ sets the 'splitter' to split
on all blank lines, right? and each iteration of the while loop reads in
one section of the input (split by blank lines), right? Not sure why it
was printing out a 1.

Are your blank lines truly empty, or do they have whitespace in them?
For instance, if each line ends with "\r\n", and your processing the
file on a unixy OS where "\n" is the end of line character, you don't
have any empty lines in the file. Test this theory with: $/="\r\n\r\n";
 
G

Glenn Jackman

Jose Yimpho said:
In vi (in Redhat 9), I created a file similiar to: [...]
But I still got a paragraph count of one.

In vi, is your file format 'dos'?
:set fileformat
If so, set it to 'unix' before you save.
:set ff=unix
:wq
 
T

Tad McClellan

Jose Yimpho said:
I tried something like this last ^^^^^^^^^^^^^^
night:

$/ = '';
while <FILE>
{


Please post *real* code.

Have you seen the Posting Guidelines that are posted here frequently?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top