Need help with pattern matching/substitution.

Discussion in 'Perl Misc' started by Hemant Shah, Sep 26, 2006.

  1. Hemant Shah

    Hemant Shah Guest

    Folks,

    I have a script that reads data from a file. Each line in a file is comma
    seperated list of values.

    Example:

    DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

    I want to check if any value contains a quote and add another quote to it.

    Example:

    DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN''T DELETE '


    How to I formulate a pattern for substitution?

    Thanks.

    --
    Hemant Shah /"\ ASCII ribbon campaign
    E-mail: \ / ---------------------
    X against HTML mail
    TO REPLY, REMOVE NoJunkMail / \ and postings
    FROM MY E-MAIL ADDRESS.
    -----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
    I haven't lost my mind, Above opinions are mine only.
    it's backed up on tape somewhere. Others can have their own.
     
    Hemant Shah, Sep 26, 2006
    #1
    1. Advertising

  2. Hemant Shah

    Peter Scott Guest

    On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
    > I have a script that reads data from a file. Each line in a file is comma
    > seperated list of values.
    >
    > Example:
    >
    > DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    > DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '


    That isn't valid in any CSV format I know. To keep us from guessing,
    please post the specification for the syntax of the lines you are parsing.

    --
    Peter Scott
    http://www.perlmedic.com/
    http://www.perldebugged.com/
     
    Peter Scott, Sep 27, 2006
    #2
    1. Advertising

  3. Hemant Shah

    Hemant Shah Guest

    While stranded on information super highway Peter Scott wrote:
    > On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
    >> I have a script that reads data from a file. Each line in a file is comma
    >> seperated list of values.
    >>
    >> Example:
    >>
    >> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    >> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

    >
    > That isn't valid in any CSV format I know. To keep us from guessing,
    > please post the specification for the syntax of the lines you are parsing.


    Yes, this is not a valid CSV format. The format is:

    TableName|CSV format for data values.


    >
    > --
    > Peter Scott
    > http://www.perlmedic.com/
    > http://www.perldebugged.com/
    >


    --
    Hemant Shah /"\ ASCII ribbon campaign
    E-mail: \ / ---------------------
    X against HTML mail
    TO REPLY, REMOVE NoJunkMail / \ and postings
    FROM MY E-MAIL ADDRESS.
    -----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
    I haven't lost my mind, Above opinions are mine only.
    it's backed up on tape somewhere. Others can have their own.
     
    Hemant Shah, Sep 27, 2006
    #3
  4. Hemant Shah

    Dr.Ruud Guest

    Hemant Shah schreef:

    > I have a script that reads data from a file. Each line in a file is
    > comma seperated list of values.
    >
    > Example:
    >
    > DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    > DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
    >
    > I want to check if any value contains a quote and add another quote
    > to it.


    That is only possible if you can make several assumptions, from the
    structure of each record.
    Are they fixed length?

    An error prone approach:

    #!/usr/bin/perl
    use warnings ;
    use strict ;

    while ( <DATA> )
    {
    chomp ;

    # replace '...'<comma> by "..."<newline>
    s/'(.*?)',/"$1"\n/g ;

    # special treatment of the last field
    s/(?:\n|,)'(.*)'$/\n"$1"/ ;

    # double any quotes
    s/'/''/g ;

    # change all <newline>s back to commas
    s/\n/,/g ;

    print "$_\n" ;
    }

    __DATA__
    DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
    DWH_TRRM|'002','A000855737',146,'02-20-2006',10,'CAN'T DELETE '

    This prints:
    DWH_TRRM|"001","A000855747",100,"02-20-2006","CAN DELETE "
    DWH_TRRM|"002","A000855737",146,"02-20-2006","CAN''T DELETE "
    DWH_TRRM|"002","A000855737",146,"02-20-2006",10,"CAN''T DELETE "

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 28, 2006
    #4
  5. Hemant Shah

    Peter Scott Guest

    On Wed, 27 Sep 2006 18:06:56 +0000, Hemant Shah wrote:
    > While stranded on information super highway Peter Scott wrote:
    >> On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
    >>> I have a script that reads data from a file. Each line in a file is comma
    >>> seperated list of values.
    >>>
    >>> Example:
    >>>
    >>> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    >>> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '

    >>
    >> That isn't valid in any CSV format I know. To keep us from guessing,
    >> please post the specification for the syntax of the lines you are parsing.

    >
    > Yes, this is not a valid CSV format. The format is:
    >
    > TableName|CSV format for data values.


    It still isn't valid. The fields are single quoted yet the last field in
    the second record has an unescaped single quote in it. That doesn't
    correspond to any CSV format I know of and is ambiguous. Without more
    information limiting the syntax, Abigail is right: your problem is
    unsolvable.

    --
    Peter Scott
    http://www.perlmedic.com/
    http://www.perldebugged.com/
     
    Peter Scott, Sep 28, 2006
    #5
  6. Hemant Shah

    Ted Zlatanov Guest

    On 28 Sep 2006, wrote:

    On Wed, 27 Sep 2006 18:06:56 +0000, Hemant Shah wrote:
    > While stranded on information super highway Peter Scott wrote:
    >> On Tue, 26 Sep 2006 22:31:57 +0000, Hemant Shah wrote:
    >>>> I have a script that reads data from a file. Each line in a file is comma
    >>>> seperated list of values.
    >>>>
    >>>> Example:
    >>>>
    >>>> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    >>>> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
    >>>
    >>> That isn't valid in any CSV format I know. To keep us from guessing,
    >>> please post the specification for the syntax of the lines you are parsing.

    >>
    >> Yes, this is not a valid CSV format. The format is:
    >>
    >> TableName|CSV format for data values.

    >
    > It still isn't valid. The fields are single quoted yet the last field in
    > the second record has an unescaped single quote in it. That doesn't
    > correspond to any CSV format I know of and is ambiguous. Without more
    > information limiting the syntax, Abigail is right: your problem is
    > unsolvable.


    I think with the (maybe reasonable, maybe not) assumption that
    internal quotes will never have a comma next to them, it's solvable.
    I agree that as it stands, the format is not parseable.

    The OP may want to look at the producers of that data and see if they
    can be fixed to produce real CSV at the source. It will be a lot
    easier than fixing it after the damage to the data has been done.

    Ted
     
    Ted Zlatanov, Sep 28, 2006
    #6
  7. Hemant Shah

    Hemant Shah Guest

    While stranded on information super highway Dr.Ruud wrote:
    > Hemant Shah schreef:
    >
    >> I have a script that reads data from a file. Each line in a file is
    >> comma seperated list of values.
    >>
    >> Example:
    >>
    >> DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    >> DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
    >>
    >> I want to check if any value contains a quote and add another quote
    >> to it.

    >
    > That is only possible if you can make several assumptions, from the
    > structure of each record.
    > Are they fixed length?


    No, each line may be of different length, depending on number of columns in
    the table, also the string with quote can also be in any position in the
    list.

    The problem is that this file is generated from a COBOL program that reads
    EBCDIC data and generated ASCII file. My perl script has to read the file
    and dump data into DB2 tables. There is not much string manipulation they can
    do in COBOL so I have to do it in perl.

    It was decided that the strings will never have a '?' so the COBOL program
    now uses '?' instead of a quote in the data and my perl script replaces all
    '?' with two quotes.


    >
    > An error prone approach:
    >
    > #!/usr/bin/perl
    > use warnings ;
    > use strict ;
    >
    > while ( <DATA> )
    > {
    > chomp ;
    >
    > # replace '...'<comma> by "..."<newline>
    > s/'(.*?)',/"$1"\n/g ;
    >
    > # special treatment of the last field
    > s/(?:\n|,)'(.*)'$/\n"$1"/ ;
    >
    > # double any quotes
    > s/'/''/g ;
    >
    > # change all <newline>s back to commas
    > s/\n/,/g ;
    >
    > print "$_\n" ;
    > }
    >
    > __DATA__
    > DWH_TRRM|'001','A000855747',100,'02-20-2006','CAN DELETE '
    > DWH_TRRM|'002','A000855737',146,'02-20-2006','CAN'T DELETE '
    > DWH_TRRM|'002','A000855737',146,'02-20-2006',10,'CAN'T DELETE '
    >
    > This prints:
    > DWH_TRRM|"001","A000855747",100,"02-20-2006","CAN DELETE "
    > DWH_TRRM|"002","A000855737",146,"02-20-2006","CAN''T DELETE "
    > DWH_TRRM|"002","A000855737",146,"02-20-2006",10,"CAN''T DELETE "
    >
    > --
    > Affijn, Ruud
    >
    > "Gewoon is een tijger."
    >
    >


    --
    Hemant Shah /"\ ASCII ribbon campaign
    E-mail: \ / ---------------------
    X against HTML mail
    TO REPLY, REMOVE NoJunkMail / \ and postings
    FROM MY E-MAIL ADDRESS.
    -----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
    I haven't lost my mind, Above opinions are mine only.
    it's backed up on tape somewhere. Others can have their own.
     
    Hemant Shah, Sep 28, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vibha Tripathi

    Regular Expression for pattern substitution

    Vibha Tripathi, Jul 1, 2005, in forum: Python
    Replies:
    3
    Views:
    323
    Devan L
    Jul 1, 2005
  2. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    251
    Marc Bissonnette
    Jan 13, 2004
  3. Arturi
    Replies:
    8
    Views:
    131
    Anno Siegel
    Apr 5, 2005
  4. Bobby Chamness
    Replies:
    2
    Views:
    246
    Xicheng Jia
    May 3, 2007
  5. pbd22

    Need REGEX pattern matching help

    pbd22, Feb 20, 2008, in forum: Javascript
    Replies:
    2
    Views:
    111
    Dr J R Stockton
    Feb 21, 2008
Loading...

Share This Page