Assistance parsing text file using Text::CSV_XS

Domenico Discepola · Sep 1, 2004

Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
The input file is structured as follows. "Fields" are separated with a
"\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting. How can I use Text::CSV_XS to solve my
problem? My code below only outputs the first line in the input file.
Thanks in advance.

#!perl
use strict;
use warnings;
use diagnostics;
use Text::CSV_XS;

our $g_file_input = shift @ARGV;
die "Usage: $0 filename\n" unless $g_file_input;

######
my ( @arr01 );

#Record seperator - I tried using this and commenting this out
# local $/ = "\x0c";

my $csv = Text::CSV_XS->new( {'sep_char' => "\x0d\x0a", 'binary' => 1,
'always_quote' => 1 } );

open(TFILE, "< ${g_file_input}") || die "$!";
while (<TFILE>) {

my $line = $_;
my $status = $csv->parse($line) || print "Cannot parse\n";
my @arr_temp = $csv->fields();
push ( @arr01, [@arr_temp]);
print join('|', $_), "\n" for @arr_temp;

#exiting here for debugging only
exit;
}
close (TFILE) || die "$!\n";

Scott W Gifford · Sep 1, 2004

Domenico Discepola said:
Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
The input file is structured as follows. "Fields" are separated with a
"\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting. How can I use Text::CSV_XS to solve my
problem? My code below only outputs the first line in the input file.
Thanks in advance.

Text::CSV_XS assumes that it's handed a full record at a time, and
expects you to independently figure out where one record ends and the
next one begins.

So you have three choices.

The easiest is to use Text::xSV instead of Text::CSV_XS. This handles
embedded newlines as you'd expect, and in general works quite well.
Unfortunately I've found it's about 6 times slower than Text::CSV_XS.
If you can't afford that kind of slowdown, read on.

The next easiest thing to do is find record boundaries on your own.
In one application I wrote, I found this worked well; the file I had
always had lines ending in a quote followed by a newline, so I just
kept appending lines to a buffer until I found a quote at the end of a
line that wasn't preceded by an escape character, then passed it on to
Text::CSV_XS. This won't work with all data files, so it might not be
for you.

The third option is to take each line, ask Text::CSV_XS to parse it,
and if it fails, append the next line and try again. This should work
with properly formed CSV files, but will behave poorly in the face of
an error; if there's some corruption on the first line, you may not
read anything, since it will keep appending and finding the same
error.

Good luck!

----ScottG.

Tad McClellan · Sep 1, 2004

Domenico Discepola said:
I'm trying to parse a text file

We need the data as well as the code if we are to be able
to test the code...

"Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting.

our $g_file_input = shift @ARGV;

You should always prefer lexical (my) variables over package (our)
variables, except when you can't.

And you can, so make that:

my $g_file_input = shift @ARGV;

#Record seperator - I tried using this and commenting this out
# local $/ = "\x0c";

If you leave it commented out, then you are reading 1 line at
a time rather than 1 record at a time.

I don't see how it would not be working if uncommented...

.... if I had data to run it against I could try it and see.

But I don't, so I can't. (hint)

open(TFILE, "< ${g_file_input}") || die "$!";

Why the unnecessary curly braces?

while (<TFILE>) {
my $line = $_;

If you want it in $line then put it there rather than putting
it somewhere else only to copy it to where you really want
it to be.

Calling it a "line" when it is not a line is asking for trouble.

my @arr_temp = $csv->fields();
push ( @arr01, [@arr_temp]);

No need to copy all that data, just take a reference directly:

push ( @arr01, \@arr_temp);

Anno Siegel · Sep 1, 2004

Scott W Gifford said:
Text::CSV_XS assumes that it's handed a full record at a time, and
expects you to independently figure out where one record ends and the
next one begins.

Well, *record* separation is easily done in this case. Just set

local $/ = "x0c";

and use <>, chomp() and whatever as usual to get one record each time.
If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
part.

OP only mentions embedded record separators, not field separators, so
this should work.

Anno

Brad Baxter · Sep 2, 2004

Well, *record* separation is easily done in this case. Just set

local $/ = "x0c";

and use <>, chomp() and whatever as usual to get one record each time.
If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
part.

It isn't upset if you specify 'binary' => 1 in the new() call.

OP only mentions embedded record separators, not field separators, so
this should work.

I see a reference to an 'eol' character in CSV_XS, but it's apparently
only for output--not reading.

Regards,

Brad

Domenico Discepola · Sep 2, 2004

I see a reference to an 'eol' character in CSV_XS, but it's apparently
only for output--not reading.

Yes, the 'eol' attribute is what confused me into thinking I can use this
module.

Domenico Discepola · Sep 2, 2004

Tad McClellan said:
We need the data as well as the code if we are to be able
to test the code...

... if I had data to run it against I could try it and see.

But I don't, so I can't. (hint)

I will reproduce the data here but because there exists embedded binary
characters, I can only "simulate" them:

begin sample data file

"field 1: value1"\n"field 2: value2a\nvalue2b"\n"field 3: value3"\n\x0c
"field 4: value 4"\n"field 5: value5"\n\x0c

end sample data file

This data was exported from a Lotus Notes database using the structured text
format. Note that each "record" can contain different "fields" (as is shown
in the sample data).

Problem Splitting Text String	2	Dec 29, 2022
CSV_XS: removing leading '0' from a parsed field	1	Jun 28, 2005
text::CSV	2	Sep 15, 2010
CSV_XS Question	1	Apr 14, 2008
Text::CSV_XS Trying to find empty field	3	Oct 2, 2006
Parse using Text::CSV into Hash	9	Nov 10, 2011
Help parsing a text file	6	Aug 29, 2011
Parsing multiple lines from text file using regex	0	Oct 27, 2013

Assistance parsing text file using Text::CSV_XS

Domenico Discepola

Scott W Gifford

Tad McClellan

Anno Siegel

Brad Baxter

Domenico Discepola

Domenico Discepola

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads