Assistance parsing text file using Text::CSV_XS

  • Thread starter Domenico Discepola
  • Start date
D

Domenico Discepola

Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
The input file is structured as follows. "Fields" are separated with a
"\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting. How can I use Text::CSV_XS to solve my
problem? My code below only outputs the first line in the input file.
Thanks in advance.


#!perl
use strict;
use warnings;
use diagnostics;
use Text::CSV_XS;

our $g_file_input = shift @ARGV;
die "Usage: $0 filename\n" unless $g_file_input;

######
my ( @arr01 );

#Record seperator - I tried using this and commenting this out
# local $/ = "\x0c";

my $csv = Text::CSV_XS->new( {'sep_char' => "\x0d\x0a", 'binary' => 1,
'always_quote' => 1 } );

open(TFILE, "< ${g_file_input}") || die "$!";
while (<TFILE>) {

my $line = $_;
my $status = $csv->parse($line) || print "Cannot parse\n";
my @arr_temp = $csv->fields();
push ( @arr01, [@arr_temp]);
print join('|', $_), "\n" for @arr_temp;

#exiting here for debugging only
exit;
}
close (TFILE) || die "$!\n";
 
S

Scott W Gifford

Domenico Discepola said:
Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
The input file is structured as follows. "Fields" are separated with a
"\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting. How can I use Text::CSV_XS to solve my
problem? My code below only outputs the first line in the input file.
Thanks in advance.

Text::CSV_XS assumes that it's handed a full record at a time, and
expects you to independently figure out where one record ends and the
next one begins.

So you have three choices.

The easiest is to use Text::xSV instead of Text::CSV_XS. This handles
embedded newlines as you'd expect, and in general works quite well.
Unfortunately I've found it's about 6 times slower than Text::CSV_XS.
If you can't afford that kind of slowdown, read on.

The next easiest thing to do is find record boundaries on your own.
In one application I wrote, I found this worked well; the file I had
always had lines ending in a quote followed by a newline, so I just
kept appending lines to a buffer until I found a quote at the end of a
line that wasn't preceded by an escape character, then passed it on to
Text::CSV_XS. This won't work with all data files, so it might not be
for you.

The third option is to take each line, ask Text::CSV_XS to parse it,
and if it fails, append the next line and try again. This should work
with properly formed CSV files, but will behave poorly in the face of
an error; if there's some corruption on the first line, you may not
read anything, since it will keep appending and finding the same
error.

Good luck!

----ScottG.
 
T

Tad McClellan

Domenico Discepola said:
I'm trying to parse a text file


We need the data as well as the code if we are to be able
to test the code...

"Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting.

our $g_file_input = shift @ARGV;


You should always prefer lexical (my) variables over package (our)
variables, except when you can't.


And you can, so make that:

my $g_file_input = shift @ARGV;

#Record seperator - I tried using this and commenting this out
# local $/ = "\x0c";


If you leave it commented out, then you are reading 1 line at
a time rather than 1 record at a time.

I don't see how it would not be working if uncommented...

.... if I had data to run it against I could try it and see.

But I don't, so I can't. (hint)

open(TFILE, "< ${g_file_input}") || die "$!";


Why the unnecessary curly braces?

while (<TFILE>) {
my $line = $_;


If you want it in $line then put it there rather than putting
it somewhere else only to copy it to where you really want
it to be.

Calling it a "line" when it is not a line is asking for trouble.

my @arr_temp = $csv->fields();
push ( @arr01, [@arr_temp]);


No need to copy all that data, just take a reference directly:

push ( @arr01, \@arr_temp);
 
A

Anno Siegel

Scott W Gifford said:
Text::CSV_XS assumes that it's handed a full record at a time, and
expects you to independently figure out where one record ends and the
next one begins.

Well, *record* separation is easily done in this case. Just set

local $/ = "x0c";

and use <>, chomp() and whatever as usual to get one record each time.
If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
part.

OP only mentions embedded record separators, not field separators, so
this should work.

Anno
 
B

Brad Baxter

Well, *record* separation is easily done in this case. Just set

local $/ = "x0c";

and use <>, chomp() and whatever as usual to get one record each time.
If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
part.

It isn't upset if you specify 'binary' => 1 in the new() call.

OP only mentions embedded record separators, not field separators, so
this should work.

I see a reference to an 'eol' character in CSV_XS, but it's apparently
only for output--not reading.

Regards,

Brad
 
D

Domenico Discepola

I see a reference to an 'eol' character in CSV_XS, but it's apparently
only for output--not reading.
Yes, the 'eol' attribute is what confused me into thinking I can use this
module.
 
D

Domenico Discepola

Tad McClellan said:
We need the data as well as the code if we are to be able
to test the code...
... if I had data to run it against I could try it and see.

But I don't, so I can't. (hint)

I will reproduce the data here but because there exists embedded binary
characters, I can only "simulate" them:

begin sample data file

"field 1: value1"\n"field 2: value2a\nvalue2b"\n"field 3: value3"\n\x0c
"field 4: value 4"\n"field 5: value5"\n\x0c

end sample data file

This data was exported from a Lotus Notes database using the structured text
format. Note that each "record" can contain different "fields" (as is shown
in the sample data).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,188
Latest member
Crypto TaxSoftware

Latest Threads

Top