Split and RegEx Help

V

Vaughn Sargent

Hi,

I have some flat file data coming in from a client that is ~
delimited. I have no control over the incoming data so I'm stuck with
the ~. What I need to do is split the data into fields (which should
be easy using split) however, some text fields may contain a ~ but
it's not the delimiter, it's just part of the text field. All text
fields are enclosed with double quotes. Number and date fields are
not. So I may recieve data such as the following 5 field line:

"Field - One"~234.00~"Field ~ 3"~20040830~"Field 5"

What I want returned is:

Field 1: "Field - One"
Field 2: 234.00
Field 3: "Field ~ 3"
Field 4: 20040830
Field 5: "Field 5"

But using split perl returns:

Field 1: "Field - One"
Field 2: 234.00
Field 3: "Field
Field 4: 3"
Field 5: 20040830
Field 6: "Field 5"

It would also be possible for any text field to contain more than one
~ as a non-delimiter which may or may not be next to each other.

I guess what I would like to tell perl is "Hey, Perl, split this line
for me using ~ as the delimiter but ~ isn't a delimiter if there are
double quotes around it."

I thought it might be possible to use split and tack on a regular
expression. I'm a newbie when it comes to regular expressions.

If anyone can help me out I'd be very greatful.

Vaughn
 
J

Jürgen Exner

Vaughn said:
I have some flat file data coming in from a client that is ~
delimited. I have no control over the incoming data so I'm stuck with
the ~. What I need to do is split the data into fields (which should
be easy using split) however, some text fields may contain a ~ but
it's not the delimiter, it's just part of the text field. All text
fields are enclosed with double quotes.

You may want to have a look at Text::CSV.
Although it uses a comma as the separator for the data fields it should be
trivial to copy the source code and modify it to use the ~ instead.

jue
 
U

Uri Guttman

JE> You may want to have a look at Text::CSV. Although it uses a
JE> comma as the separator for the data fields it should be trivial to
JE> copy the source code and modify it to use the ~ instead.

without even looking, i wager it has an option to set the separator. it
is too easy and such a commonly needed feature to believe it doesn't
support that.

uri
 
J

Jürgen Exner

Uri said:
with >> the ~. What I need to do is split the data into fields
(which should >> be easy using split) however, some text fields may
contain a ~ but >> it's not the delimiter, it's just part of the
text field. All text >> fields are enclosed with double quotes.


without even looking, i wager it has an option to set the separator.
it is too easy and such a commonly needed feature to believe it
doesn't support that.

For a second you really scared me because I didn't check, either.

However according to the module doc on CPAN the standard Text::CSV does not
support changing the separator character (I win).
For that you need to use Text::CSV_XS (you win):

new(\%attr)

sep_char


The char used for separating fields, by default a comme. (,)

Now, what do we do with the prices?

jue
 
U

Uri Guttman

JE> For a second you really scared me because I didn't check, either.

JE> However according to the module doc on CPAN the standard Text::CSV does not
JE> support changing the separator character (I win).
JE> For that you need to use Text::CSV_XS (you win):

JE> new(\%attr)

JE> sep_char


JE> The char used for separating fields, by default a comme. (,)

JE> Now, what do we do with the prices?

well, i say it is a push (pun intended!).

odd how the xs version which usually requires more work has the option.

uri
 
T

Tore Aursand

I have some flat file data coming in from a client that is ~ delimited.
I have no control over the incoming data so I'm stuck with the ~. What
I need to do is split the data into fields (which should be easy using
split) however, some text fields may contain a ~ but it's not the
delimiter, it's just part of the text field. All text fields are
enclosed with double quotes.

You could try the Text::parseWords module, which I think comes with Perl
these days;

#!/usr/bin/perl
#
use strict;
use warnings;
use Data::Dumper;
use Text::parseWords;

while ( <DATA> ) {
chomp;
my @fields = quotewords( '~', 0, $_ );
print Dumper( \@fields );
}

__DATA__
"Field - One"~234.00~"Field ~ 3"~20040830~"Field 5"
 
V

Vaughn Sargent

Uri Guttman said:
JE> For a second you really scared me because I didn't check, either.

JE> However according to the module doc on CPAN the standard Text::CSV does not
JE> support changing the separator character (I win).
JE> For that you need to use Text::CSV_XS (you win):

JE> new(\%attr)

JE> sep_char


JE> The char used for separating fields, by default a comme. (,)

JE> Now, what do we do with the prices?

well, i say it is a push (pun intended!).

odd how the xs version which usually requires more work has the option.

uri


Thank you very much. I installed the Text::CSV_XS module and it works
just as I needed it too. Now any text field that contains my
delimiter of ~ is not seen as a delimiter as text fields are double
quoted. Also, I have the option to change the delimiter to ~ instead
of the default ,

Thanks again!
Vaughn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top