Need help with pattern matching/substitution.

H

Hemant Shah

Folks,

I have a script that reads data from a file. Each line in a file is comma
seperated list of values.

Example:

DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

I want to check if any value contains a quote and add another quote to it.

Example:

DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN''T DELETE '


How to I formulate a pattern for substitution?

Thanks.

--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
P

Peter Scott

I have a script that reads data from a file. Each line in a file is comma
seperated list of values.

Example:

DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

That isn't valid in any CSV format I know. To keep us from guessing,
please post the specification for the syntax of the lines you are parsing.
 
H

Hemant Shah

While said:
That isn't valid in any CSV format I know. To keep us from guessing,
please post the specification for the syntax of the lines you are parsing.

Yes, this is not a valid CSV format. The format is:

TableName|CSV format for data values.


--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
D

Dr.Ruud

Hemant Shah schreef:
I have a script that reads data from a file. Each line in a file is
comma seperated list of values.

Example:

DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

I want to check if any value contains a quote and add another quote
to it.

That is only possible if you can make several assumptions, from the
structure of each record.
Are they fixed length?

An error prone approach:

#!/usr/bin/perl
use warnings ;
use strict ;

while ( <DATA> )
{
chomp ;

# replace '...'<comma> by "..."<newline>
s/'(.*?)',/"$1"\n/g ;

# special treatment of the last field
s/(?:\n|,)'(.*)'$/\n"$1"/ ;

# double any quotes
s/'/''/g ;

# change all <newline>s back to commas
s/\n/,/g ;

print "$_\n" ;
}

__DATA__
DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006',10,'CAN'T DELETE '

This prints:
DWH_TRRM|"001","A000855747",100,"02-20-2006","CAN DELETE "
DWH_TRRM|"002","A000855737",146,"02-20-2006","CAN''T DELETE "
DWH_TRRM|"002","A000855737",146,"02-20-2006",10,"CAN''T DELETE "
 
P

Peter Scott

Yes, this is not a valid CSV format. The format is:

TableName|CSV format for data values.

It still isn't valid. The fields are single quoted yet the last field in
the second record has an unescaped single quote in it. That doesn't
correspond to any CSV format I know of and is ambiguous. Without more
information limiting the syntax, Abigail is right: your problem is
unsolvable.
 
T

Ted Zlatanov

On 28 Sep 2006, (e-mail address removed) wrote:

It still isn't valid. The fields are single quoted yet the last field in
the second record has an unescaped single quote in it. That doesn't
correspond to any CSV format I know of and is ambiguous. Without more
information limiting the syntax, Abigail is right: your problem is
unsolvable.

I think with the (maybe reasonable, maybe not) assumption that
internal quotes will never have a comma next to them, it's solvable.
I agree that as it stands, the format is not parseable.

The OP may want to look at the producers of that data and see if they
can be fixed to produce real CSV at the source. It will be a lot
easier than fixing it after the damage to the data has been done.

Ted
 
H

Hemant Shah

While said:
Hemant Shah schreef:


That is only possible if you can make several assumptions, from the
structure of each record.
Are they fixed length?

No, each line may be of different length, depending on number of columns in
the table, also the string with quote can also be in any position in the
list.

The problem is that this file is generated from a COBOL program that reads
EBCDIC data and generated ASCII file. My perl script has to read the file
and dump data into DB2 tables. There is not much string manipulation they can
do in COBOL so I have to do it in perl.

It was decided that the strings will never have a '?' so the COBOL program
now uses '?' instead of a quote in the data and my perl script replaces all
'?' with two quotes.

An error prone approach:

#!/usr/bin/perl
use warnings ;
use strict ;

while ( <DATA> )
{
chomp ;

# replace '...'<comma> by "..."<newline>
s/'(.*?)',/"$1"\n/g ;

# special treatment of the last field
s/(?:\n|,)'(.*)'$/\n"$1"/ ;

# double any quotes
s/'/''/g ;

# change all <newline>s back to commas
s/\n/,/g ;

print "$_\n" ;
}

__DATA__
DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
DWH_TRRM|'002','A000855737',146,'02-20-2006',10,'CAN'T DELETE '

This prints:
DWH_TRRM|"001","A000855747",100,"02-20-2006","CAN DELETE "
DWH_TRRM|"002","A000855737",146,"02-20-2006","CAN''T DELETE "
DWH_TRRM|"002","A000855737",146,"02-20-2006",10,"CAN''T DELETE "

--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top