Assistance parsing text file using Text::CSV_XS

Discussion in 'Perl Misc' started by Domenico Discepola, Sep 1, 2004.

  1. Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
    The input file is structured as follows. "Fields" are separated with a
    "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
    separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
    the need for double-quoting. How can I use Text::CSV_XS to solve my
    problem? My code below only outputs the first line in the input file.
    Thanks in advance.


    #!perl
    use strict;
    use warnings;
    use diagnostics;
    use Text::CSV_XS;

    our $g_file_input = shift @ARGV;
    die "Usage: $0 filename\n" unless $g_file_input;

    ######
    my ( @arr01 );

    #Record seperator - I tried using this and commenting this out
    # local $/ = "\x0c";

    my $csv = Text::CSV_XS->new( {'sep_char' => "\x0d\x0a", 'binary' => 1,
    'always_quote' => 1 } );

    open(TFILE, "< ${g_file_input}") || die "$!";
    while (<TFILE>) {

    my $line = $_;
    my $status = $csv->parse($line) || print "Cannot parse\n";
    my @arr_temp = $csv->fields();
    push ( @arr01, [@arr_temp]);
    print join('|', $_), "\n" for @arr_temp;

    #exiting here for debugging only
    exit;
    }
    close (TFILE) || die "$!\n";
    Domenico Discepola, Sep 1, 2004
    #1
    1. Advertising

  2. "Domenico Discepola" <> writes:

    > Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
    > The input file is structured as follows. "Fields" are separated with a
    > "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
    > separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
    > the need for double-quoting. How can I use Text::CSV_XS to solve my
    > problem? My code below only outputs the first line in the input file.
    > Thanks in advance.


    Text::CSV_XS assumes that it's handed a full record at a time, and
    expects you to independently figure out where one record ends and the
    next one begins.

    So you have three choices.

    The easiest is to use Text::xSV instead of Text::CSV_XS. This handles
    embedded newlines as you'd expect, and in general works quite well.
    Unfortunately I've found it's about 6 times slower than Text::CSV_XS.
    If you can't afford that kind of slowdown, read on.

    The next easiest thing to do is find record boundaries on your own.
    In one application I wrote, I found this worked well; the file I had
    always had lines ending in a quote followed by a newline, so I just
    kept appending lines to a buffer until I found a quote at the end of a
    line that wasn't preceded by an escape character, then passed it on to
    Text::CSV_XS. This won't work with all data files, so it might not be
    for you.

    The third option is to take each line, ask Text::CSV_XS to parse it,
    and if it fails, append the next line and try again. This should work
    with properly formed CSV files, but will behave poorly in the face of
    an error; if there's some corruption on the first line, you may not
    read anything, since it will keep appending and finding the same
    error.

    Good luck!

    ----ScottG.
    Scott W Gifford, Sep 1, 2004
    #2
    1. Advertising

  3. Domenico Discepola <> wrote:

    > I'm trying to parse a text file



    We need the data as well as the code if we are to be able
    to test the code...


    > "Records" are
    > separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
    > the need for double-quoting.



    > our $g_file_input = shift @ARGV;



    You should always prefer lexical (my) variables over package (our)
    variables, except when you can't.


    And you can, so make that:

    my $g_file_input = shift @ARGV;


    > #Record seperator - I tried using this and commenting this out
    > # local $/ = "\x0c";



    If you leave it commented out, then you are reading 1 line at
    a time rather than 1 record at a time.

    I don't see how it would not be working if uncommented...

    .... if I had data to run it against I could try it and see.

    But I don't, so I can't. (hint)


    > open(TFILE, "< ${g_file_input}") || die "$!";



    Why the unnecessary curly braces?


    > while (<TFILE>) {
    > my $line = $_;



    If you want it in $line then put it there rather than putting
    it somewhere else only to copy it to where you really want
    it to be.

    Calling it a "line" when it is not a line is asking for trouble.

    while ( my $record = <TFILE> ) { # $record instead of $line


    > my @arr_temp = $csv->fields();
    > push ( @arr01, [@arr_temp]);



    No need to copy all that data, just take a reference directly:

    push ( @arr01, \@arr_temp);


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 1, 2004
    #3
  4. Domenico Discepola

    Anno Siegel Guest

    Scott W Gifford <> wrote in comp.lang.perl.misc:
    > "Domenico Discepola" <> writes:
    >
    > > Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
    > > The input file is structured as follows. "Fields" are separated with a
    > > "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
    > > separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
    > > the need for double-quoting. How can I use Text::CSV_XS to solve my
    > > problem? My code below only outputs the first line in the input file.
    > > Thanks in advance.

    >
    > Text::CSV_XS assumes that it's handed a full record at a time, and
    > expects you to independently figure out where one record ends and the
    > next one begins.


    Well, *record* separation is easily done in this case. Just set

    local $/ = "x0c";

    and use <>, chomp() and whatever as usual to get one record each time.
    If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
    part.

    OP only mentions embedded record separators, not field separators, so
    this should work.

    Anno
    Anno Siegel, Sep 1, 2004
    #4
  5. Domenico Discepola

    Brad Baxter Guest

    On Wed, 1 Sep 2004, Anno Siegel wrote:

    > Scott W Gifford <> wrote in comp.lang.perl.misc:
    > > "Domenico Discepola" <> writes:
    > >
    > > > Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
    > > > The input file is structured as follows. "Fields" are separated with a
    > > > "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
    > > > separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
    > > > the need for double-quoting. How can I use Text::CSV_XS to solve my
    > > > problem? My code below only outputs the first line in the input file.
    > > > Thanks in advance.

    > >
    > > Text::CSV_XS assumes that it's handed a full record at a time, and
    > > expects you to independently figure out where one record ends and the
    > > next one begins.

    >
    > Well, *record* separation is easily done in this case. Just set
    >
    > local $/ = "x0c";
    >
    > and use <>, chomp() and whatever as usual to get one record each time.
    > If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
    > part.


    It isn't upset if you specify 'binary' => 1 in the new() call.


    > OP only mentions embedded record separators, not field separators, so
    > this should work.


    I see a reference to an 'eol' character in CSV_XS, but it's apparently
    only for output--not reading.

    Regards,

    Brad
    Brad Baxter, Sep 2, 2004
    #5

  6. > > OP only mentions embedded record separators, not field separators, so
    > > this should work.

    >
    > I see a reference to an 'eol' character in CSV_XS, but it's apparently
    > only for output--not reading.
    >

    Yes, the 'eol' attribute is what confused me into thinking I can use this
    module.
    Domenico Discepola, Sep 2, 2004
    #6
  7. "Tad McClellan" <> wrote in message
    news:...
    > Domenico Discepola <> wrote:
    >
    > We need the data as well as the code if we are to be able
    > to test the code...
    >
    > > "Records" are
    > > separated with a "\x0c" (FF). My fields can contain embedded CRLF's

    hence
    > > the need for double-quoting.


    > ... if I had data to run it against I could try it and see.
    >
    > But I don't, so I can't. (hint)


    I will reproduce the data here but because there exists embedded binary
    characters, I can only "simulate" them:

    begin sample data file

    "field 1: value1"\n"field 2: value2a\nvalue2b"\n"field 3: value3"\n\x0c
    "field 4: value 4"\n"field 5: value5"\n\x0c

    end sample data file

    This data was exported from a Lotus Notes database using the structured text
    format. Note that each "record" can contain different "fields" (as is shown
    in the sample data).
    Domenico Discepola, Sep 2, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luis Esteban Valencia

    Need assistance connecting to AD using LDAP

    Luis Esteban Valencia, Jan 12, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    382
    Luis Esteban Valencia
    Jan 12, 2005
  2. Christian Meissner

    Announce: csv_xs.py

    Christian Meissner, Nov 10, 2004, in forum: Python
    Replies:
    8
    Views:
    363
    Mike Meyer
    Nov 16, 2004
  3. Alex Hunsley
    Replies:
    1
    Views:
    104
    Alex Hunsley
    Jun 28, 2005
  4. Pam
    Replies:
    3
    Views:
    149
    J. Gleixner
    Oct 3, 2006
  5. CSV_XS Question

    , Apr 14, 2008, in forum: Perl Misc
    Replies:
    1
    Views:
    133
    Ben Morrow
    Apr 14, 2008
Loading...

Share This Page