suggestions on intelligent processing of data sets in a file

A

alt.testing

Hi all,
I am writing a script to parse files, and insert data into mysql.
The task is simple enough with files containing "standard" fields.
However; there are many files, and this is not the case.
Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512


Now; other than the obvious and easy solution of breaking up the files
into chunks that are "known" and consistent in themselves, in terms of
data fields, I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)

I don't mind using modules, but would prefer to use ones shipped as
standard. Else, build my own, as I really want to start a bit of "OO",
and this could be a good start.

I have a felling, that creating a class, and building some methods
that can create objects (each respective to a different set) that
reference/manipulate the actual data structures (or something similar)
might be a good approach. This way operations can actually be built on
the fly? Mind you, I've not yet created a module, so this is my first
time. Best approach, or something else, perhaps?

Could anyone suggest some things, that I might try?

tia


Full Context (some rough ideas as a starting point)
===============================================================================
#!/usr/bin/perl

use strict;
use warnings;

use DBI;

my $email_index;
my $name_index;
my $location_index;
my $mobile_index;


my $input_file = $ARGV[0];
my @working_data_array;
my $email;
my $mobile;
my $name;
my $location;
my $counter;

my $email_regex = qr/^
*[a-zA-Z0-9_.-]*@[a-zA-Z0-9_.-]*\.[a-zA-Z0-9_.-]*/;
my $mobile_regex = qr/^ *[04][0-9 ]{8,12}/;
my $name_regex = qr/^ *[a-z -]*/;
my $location_regex = qr/^ *[a-zA-Z0-9 ]*/;

&set_indexes;

open ( IN_FILE, "< $input_file" ) or die "$!";

while ( <IN_FILE> ) {
next unless ( /@/ );
chomp;
@working_data_array = split( /,/ );

$email = $working_data_array[$email_index];
$name = $working_data_array[$name_index];
$location = $working_data_array[$location_index];
$mobile = $working_data_array[$mobile_index];

print "$email";
print "$name";
print "$location";
print "$mobile\n";

}

close IN_FILE;

exit;

sub set_indexes() {
for $counter ( 0 .. $#ARGV ){
$email_index = $counter-1 if ( $ARGV[$counter] =~ /email/ );
$name_index = $counter-1 if ( $ARGV[$counter] =~ /name/ );
$location_index = $counter-1 if ( $ARGV[$counter] =~ /location/ );
$mobile_index = $counter-1 if ( $ARGV[$counter] =~ /mobile/ );
}
}
 
T

Tad McClellan

Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512

I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)


------------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

while ( <DATA> ) {
chomp;
my %record;
foreach my $part ( split /,\s*/ ) {
if ( $part =~ /^\d+$/ ) # all digits
{ $record{postcode} = $part }
elsif ( $part =~ /^[\d\s]+$/ ) # digits with spaces
{ $record{phone} = $part }
elsif ( $part =~ /@/ ) # contains at-sign
{ $record{email} = $part }
else
{ $record{name} = $part }
}
print Dumper \%record;
}

__DATA__
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512
 
A

alt.testing

alt.testing@{g}mail.com said:
Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512

I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)


------------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

while ( <DATA> ) {
chomp;
my %record;
foreach my $part ( split /,\s*/ ) {
if ( $part =~ /^\d+$/ ) # all digits
{ $record{postcode} = $part }
elsif ( $part =~ /^[\d\s]+$/ ) # digits with spaces
{ $record{phone} = $part }
elsif ( $part =~ /@/ ) # contains at-sign
{ $record{email} = $part }
else
{ $record{name} = $part }
}
print Dumper \%record;
}

__DATA__
(e-mail address removed), Firstname Lastname
(e-mail address removed), Firstname Lastname, 2004, 0412 321 512
(e-mail address removed), Firstname Lastname, 0412 321 512
------------------------

thanks Tad
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top