Regex to extract CSV file

V

Vito Corleone

Hi,

I have CSV file that looks like this:
1,this is the title,2004/03/05,this is details
2,another title,2004/05/05,another details

And I extract it using split like this:
@row = split(",", $line);

The problem is, if there is coma in text, it will turn to this:
1,"title , with coma",2004/03/05,and the details
2,title without coma,2004/05/09,"but details, has coma"

Is there any efficient way to extract this?
 
A

Anno Siegel

Vito Corleone said:
Hi,

I have CSV file that looks like this:
1,this is the title,2004/03/05,this is details
2,another title,2004/05/05,another details

And I extract it using split like this:
@row = split(",", $line);

The problem is, if there is coma in text, it will turn to this:
1,"title , with coma",2004/03/05,and the details
2,title without coma,2004/05/09,"but details, has coma"

That is a FAQ: "How can I split a [character] delimited string except
when inside [character]?"
Is there any efficient way to extract this?

A search on CPAN for "CSV" would have shown you a handful of modules
for that purpose.

Anno
 
J

Josef Moellers

Vito said:
Hi,

I have CSV file that looks like this:
1,this is the title,2004/03/05,this is details
2,another title,2004/05/05,another details

And I extract it using split like this:
@row = split(",", $line);

The problem is, if there is coma in text, it will turn to this:
1,"title , with coma",2004/03/05,and the details
2,title without coma,2004/05/09,"but details, has coma"

Is there any efficient way to extract this?

Use Text::CSV from CPAN.

#! /usr/bin/perl -w

use Text::CSV;

$line = '1,"title , with coma",2004/03/05,and the details';
$csv = Text::CSV->new();
$status = $csv->parse($line);
@columns = $csv->fields();
print $columns[1], "\n";

exit 0;

prints

title , with coma
 
J

John Bokma

Josef said:
From the perl manpage on my system:

"Did we mention that you should definitely consider using the -w switch?"

" The "warnings" pragma is a replacement for the command line flag "-w",

perldoc warnings
perldoc perllexwarn
 
P

Peter J. Acklam

Josef Moellers said:
From the perl manpage on my system:

"Did we mention that you should definitely consider using the -w
switch?"

It is, in a way, cleaner to "use warnings", but since it was
introduced very recently it can't be used in scripts that are to
be processed by older versions of perl.

I only use "-w" in one-liners, otherwise I use

#!/usr/bin/env perl
...
BEGIN { $^W = 1 } # equivalent to "-w" option

Peter
 
J

Joe Smith

Josef said:
From the perl manpage on my system:
"Did we mention that you should definitely consider using the -w switch?"

How old is your version of perl? If your version of perl understands
use warnings;
then that is recommended in place of the -w switch.
 
J

Josef Moellers

Joe said:
How old is your version of perl? If your version of perl understands
use warnings;
then that is recommended in place of the -w switch.

It's perl v5.8.1. Maybe the version of my brain is quite old, so go figure.
 
A

Alf Timms

Vito Corleone said:
Hi,

I have CSV file that looks like this:
1,this is the title,2004/03/05,this is details
2,another title,2004/05/05,another details

And I extract it using split like this:
@row = split(",", $line);

The problem is, if there is coma in text, it will turn to this:
1,"title , with coma",2004/03/05,and the details
2,title without coma,2004/05/09,"but details, has coma"

That is a FAQ: "How can I split a [character] delimited string except
when inside [character]?"
Is there any efficient way to extract this?

A search on CPAN for "CSV" would have shown you a handful of modules
for that purpose.

Anno

vito,

these things are best done quickly with a regular expression. no need for cpan here:

@row = $line =~ /([^",]+|"[^"]+")/g;

still one problem. double-quoted elements are still double-quoted. easily fixed:

foreach( @row ) { s/^"(.*)"$/$1/ }

alf
 
U

Uri Guttman

AT> these things are best done quickly with a regular expression. no need for cpan here:

AT> @row = $line =~ /([^",]+|"[^"]+")/g;

and what about embedded "'s in a field?

AT> still one problem. double-quoted elements are still double-quoted. easily fixed:

AT> foreach( @row ) { s/^"(.*)"$/$1/ }

more than one problem left.

these things are best done quickly and CORRECTLY by a module.

uri
 
D

Dale Henderson

VC> Hi, I have CSV file that looks like this: 1,this is the
VC> title,2004/03/05,this is details 2,another
VC> title,2004/05/05,another details

VC> And I extract it using split like this: @row = split(",",
VC> $line);

VC> The problem is, if there is coma in text, it will turn to
VC> this: 1,"title , with coma",2004/03/05,and the details 2,title
VC> without coma,2004/05/09,"but details, has coma"

VC> Is there any efficient way to extract this?


This is ironic. Just the other night I was flipping through
"Mastering Regular Expressions" 1st ed and found this example

@fields=();
while ($text =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g){
push (@fields, defined($1)?$1:$3);
}
push (@fields,undef) if $text =~ m/,$/;


Of course the text doesn't say anything about its efficiency. The
most efficient solution is probably to use Text::CSV as others
have pointed out.

If you're interested in a super overkill solution, you can use
the DBI module with the csv DBD (can't remember what its called)
and access your csv file like a database. :)
 
E

Eric Bohlman

If you're interested in a super overkill solution, you can use
the DBI module with the csv DBD (can't remember what its called)

Would you believe DBD::CSV?
 
C

Clyde Ingram

Vito,
VC> The problem is, if there is coma in text, it will turn to
VC> this: 1,"title , with coma",2004/03/05,and the details 2,title
VC> without coma,2004/05/09,"but details, has coma"

VC> Is there any efficient way to extract this?

perldoc Text::parseWords

Perhaps:
@words = quotewords( ',', 0, $line );

Regards,
Clyde
 
C

Clyde Ingram

Eric,

Eric Bohlman said:
Would you believe DBD::CSV?

Interesting ... Isn't DBD::CSV trustworthy?

It may be a bit heavy handed, of course.
But for fun, I once knocked up this (on Windoze XPee):

#!e:/bin/perl.exe -w

use strict;
use Data::Dumper;
local $Data::Dumper::Terse = 0;
local $Data::Dumper::Indent = 1;

use DBI;

my $CSV_DIR="D:/Clyde/perldev/Trial";

my $dbh = DBI->connect("DBI:CSV:f_dir=$CSV_DIR")
or die "Cannot connect: " . $DBI::errstr;
$dbh->{'csv_tables'}->{'fractions'} = { 'file' => 'fractions.csv'};

$dbh->{'RaiseError'} = 1;
$@ = '';
eval {
my $sth = $dbh->prepare("SELECT * FROM fractions")
or die "Cannot prepare: " . $dbh->errstr();
$sth->execute() or die "Cannot execute: " . $sth->errstr();

while (my $row = $sth->fetchrow_hashref) {
print("Found result row:\n" . Data::Dumper->Dump( [$row] ) . "\n");
}
$sth->finish();
$dbh->disconnect();

};
if ($@) { die "SQL database error: $@"; }

The data file "fractions.csv" starts:

Numerator,Denominator,Decimal,Percentage,Total
5,6,0.833333333,83.33333333,84.16666667
8,3,2.666666667,266.6666667,269.3333333


And the output starts:
Found result row:
$VAR1 = {
'Total' => '84.16666667',
'Numerator' => '5',
'Denominator' => '6',
'Percentage' => '83.33333333',
'Decimal' => '0.833333333'
};

Found result row:
$VAR1 = {
'Total' => '269.3333333',
'Numerator' => '8',
'Denominator' => '3',
'Percentage' => '266.6666667',
'Decimal' => '2.666666667'
};

(I don't recall having to register the database with XP)

Regards,
Clyde
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top