Substituting in a group

Discussion in 'Perl Misc' started by aquadoll, Jun 20, 2008.

  1. aquadoll

    aquadoll Guest

    (Duplicate copy - not sure if the previous msg got posted !!)

    Hello,
    I am having the following kind of lines:

    ABC XXX,2231,"Math, Physics",0.45,2
    PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    ABC PQR,1213,Physics,0.5,1

    I want to detect when there are groups of subjects in the 3rd column,
    remove the quotes in those cases and replace the comma by # inside the
    groups. So, the above lines would be transformed to:

    ABC XXX,2231,Math# Physics,0.45,2
    PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    ABC PQR,1213,Physics,0.5,1

    I could not think of any one-liner, so I tried the following:
    (Assuming I am reading each line in a variable called $Entry)

    if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
    {
    my $TempEntry=$Entry;
    $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
    # Change comma to # in this phrase
    $TempEntry =~ s/,/#/g;
    print "TempEntry=$TempEntry\n";
    # Now replace the original phrase with this phrase in the original
    entry
    $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
    print "New Entry=$Entry\n";
    }


    The above does not work - for some reason all commas get transformed
    into # for the first two lines. Where is the problem?

    Also, is there a not-so-cryptic one-liner for this one?

    Thanks.
     
    aquadoll, Jun 20, 2008
    #1
    1. Advertising

  2. aquadoll

    patrick Guest

    On Jun 20, 8:51 am, aquadoll <> wrote:
    > (Duplicate copy - not sure if the previous msg got posted !!)
    >
    > Hello,
    > I am having the following kind of lines:
    >
    > ABC XXX,2231,"Math, Physics",0.45,2
    > PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    > ABC PQR,1213,Physics,0.5,1
    >
    > I want to detect when there are groups of subjects in the 3rd column,
    > remove the quotes in those cases and replace the comma by # inside the
    > groups. So, the above lines would be transformed to:
    >
    > ABC XXX,2231,Math# Physics,0.45,2
    > PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    > ABC PQR,1213,Physics,0.5,1
    >
    > I could not think of any one-liner, so I tried the following:
    > (Assuming I am reading each line in a variable called $Entry)
    >
    > if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
    > {
    >    my $TempEntry=$Entry;
    >    $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
    >    # Change comma to # in this phrase
    >    $TempEntry =~ s/,/#/g;
    >    print "TempEntry=$TempEntry\n";
    >    # Now replace the original phrase with this phrase in the original
    > entry
    >    $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
    >    print "New Entry=$Entry\n";
    >
    > }
    >
    > The above does not work - for some reason all commas get transformed
    > into # for the first two lines. Where is the problem?
    >
    > Also, is there a not-so-cryptic one-liner for this one?
    >
    > Thanks.


    You might try
    perl -F'"' -lane '$F[0] =~ s/"//; $F[1] =~ s/"//;$F[1] =~ s/,/#/;print
    @F' in.txt > out.txt

    Patrick
     
    patrick, Jun 20, 2008
    #2
    1. Advertising

  3. aquadoll

    Paul Lalli Guest

    On Jun 20, 11:51 am, aquadoll <> wrote:
    > (Duplicate copy - not sure if the previous msg got posted !!)
    >
    > Hello,
    > I am having the following kind of lines:
    >
    > ABC XXX,2231,"Math, Physics",0.45,2
    > PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    > ABC PQR,1213,Physics,0.5,1
    >
    > I want to detect when there are groups of subjects in the 3rd column,
    > remove the quotes in those cases and replace the comma by # inside the
    > groups. So, the above lines would be transformed to:
    >
    > ABC XXX,2231,Math# Physics,0.45,2
    > PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    > ABC PQR,1213,Physics,0.5,1



    > I could not think of any one-liner, so I tried the following:
    > (Assuming I am reading each line in a variable called $Entry)
    >
    > if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
    > {
    >    my $TempEntry=$Entry;
    >    $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;


    This gets rid of all the quotes in the TempEntry.

    >    # Change comma to # in this phrase
    >    $TempEntry =~ s/,/#/g;


    This changes ALL commas in the entire entry, not just the commas that
    were originally part of the quoted material.

    >    print "TempEntry=$TempEntry\n";
    >    # Now replace the original phrase with this phrase in the original
    > entry
    >    $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
    >    print "New Entry=$Entry\n";
    >
    > }
    >
    > The above does not work - for some reason all commas get transformed
    > into # for the first two lines. Where is the problem?


    $TempEntry is the whole line, not just the part of $Entry you cared
    about.

    #First obtain the grouped items substring
    my ($group) = ($TempEntry =~ /("[^"]+?")/);
    #Create a copy of the group string to modify:
    my $mod_group = $group;
    #Remove all commas from the group
    $mod_group =~ tr/,/#/;
    #Remove the quotes from the group:
    $mod_group =~ s/^"|"#//g;
    #Replace the original group with the modified group in the original
    Entry
    $TempEntry =~ s/$group/$mod_group/;


    Hope that helps,
    Paul Lalli
     
    Paul Lalli, Jun 20, 2008
    #3
  4. aquadoll

    Dave B Guest

    aquadoll wrote:

    > ABC XXX,2231,"Math, Physics",0.45,2
    > PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    > ABC PQR,1213,Physics,0.5,1
    >
    > I want to detect when there are groups of subjects in the 3rd column,
    > remove the quotes in those cases and replace the comma by # inside the
    > groups. So, the above lines would be transformed to:
    >
    > ABC XXX,2231,Math# Physics,0.45,2
    > PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    > ABC PQR,1213,Physics,0.5,1
    >[snip]
    > Also, is there a not-so-cryptic one-liner for this one?


    I'm a beginner in perl, so please forgive any naivety. This oneliner seems
    to work:

    $ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}' file
    XXX,2231,Math# Physics,0.45,2
    PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    ABC PQR,1213,Physics,0.5,1

    This assumes that the text between double quotes (the part that is matched
    in the first place) does not appear elsewhere before the double quotes, and
    assumes that it's the only text in double quotes in the line.

    --
    D.
     
    Dave B, Jun 20, 2008
    #4
  5. aquadoll

    aquadoll Guest

    On Jun 20, 11:02 am, Paul Lalli <> wrote:
    > On Jun 20, 11:51 am, aquadoll <> wrote:
    >
    >
    >
    > > (Duplicate copy - not sure if the previous msg got posted !!)

    >
    > > Hello,
    > > I am having the following kind of lines:

    >
    > > ABC XXX,2231,"Math, Physics",0.45,2
    > > PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    > > ABC PQR,1213,Physics,0.5,1

    >
    > > I want to detect when there are groups of subjects in the 3rd column,
    > > remove the quotes in those cases and replace the comma by # inside the
    > > groups. So, the above lines would be transformed to:

    >
    > > ABC XXX,2231,Math# Physics,0.45,2
    > > PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    > > ABC PQR,1213,Physics,0.5,1
    > > I could not think of any one-liner, so I tried the following:
    > > (Assuming I am reading each line in a variable called $Entry)

    >
    > > if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
    > > {
    > >    my $TempEntry=$Entry;
    > >    $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

    >
    > This gets rid of all the quotes in the TempEntry.
    >
    > >    # Change comma to # in this phrase
    > >    $TempEntry =~ s/,/#/g;

    >
    > This changes ALL commas in the entire entry, not just the commas that
    > were originally part of the quoted material.
    >
    > >    print "TempEntry=$TempEntry\n";
    > >    # Now replace the original phrase with this phrase in the original
    > > entry
    > >    $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
    > >    print "New Entry=$Entry\n";

    >
    > > }

    >
    > > The above does not work - for some reason all commas get transformed
    > > into # for the first two lines. Where is the problem?

    >
    > $TempEntry is the whole line, not just the part of $Entry you cared
    > about.
    >
    > #First obtain the grouped items substring
    > my ($group) = ($TempEntry =~ /("[^"]+?")/);
    > #Create a copy of the group string to modify:
    > my $mod_group = $group;
    > #Remove all commas from the group
    > $mod_group =~ tr/,/#/;
    > #Remove the quotes from the group:
    > $mod_group =~ s/^"|"#//g;
    > #Replace the original group with the modified group in the original
    > Entry
    > $TempEntry =~ s/$group/$mod_group/;
    >
    > Hope that helps,
    > Paul Lalli


    Hello,
    Thanks for all the replies. I was actually trying to get the part of
    $Entry I am interested in, in $TempEntry.
    I used the following 2 lines (as shown in the OP):
    $TempEntry=$Entry
    $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

    Why did the above did not get "the part of $Entry I am interested in"
    in $TempEntry? What did I do wrong?
    Thanks.
     
    aquadoll, Jun 20, 2008
    #5
  6. patrick wrote:
    > On Jun 20, 8:51 am, aquadoll <> wrote:
    >>
    >> I am having the following kind of lines:
    >>
    >> ABC XXX,2231,"Math, Physics",0.45,2
    >> PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    >> ABC PQR,1213,Physics,0.5,1
    >>
    >> I want to detect when there are groups of subjects in the 3rd column,
    >> remove the quotes in those cases and replace the comma by # inside the
    >> groups. So, the above lines would be transformed to:
    >>
    >> ABC XXX,2231,Math# Physics,0.45,2
    >> PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    >> ABC PQR,1213,Physics,0.5,1

    >
    > You might try
    > perl -F'"' -lane '$F[0] =~ s/"//; $F[1] =~ s/"//;$F[1] =~ s/,/#/;print
    > @F' in.txt > out.txt


    split() *removes* the expression you are splitting on so there are no
    '"' characters in @F to remove so that could be simplified to:

    perl -F'"' -lane '$F[1] =~ s/,/#/;print @F' in.txt > out.txt

    But that only changes the first ',' to a '#' and not all of them so you
    probably want this instead:

    perl -F'"' -lane '$F[1] =~ s/,/#/g;print @F' in.txt > out.txt

    Or:

    perl -F'"' -lane '$F[1] =~ tr/,/#/;print @F' in.txt > out.txt



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
     
    John W. Krahn, Jun 20, 2008
    #6
  7. aquadoll

    Willem Guest

    aquadoll wrote:
    ) ABC XXX,2231,"Math, Physics",0.45,2
    ) PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
    ) ABC PQR,1213,Physics,0.5,1
    )
    ) I want to detect when there are groups of subjects in the 3rd column,
    ) remove the quotes in those cases and replace the comma by # inside the
    ) groups. So, the above lines would be transformed to:
    )
    ) ABC XXX,2231,Math# Physics,0.45,2
    ) PQR ERR,2217,Physics# Chemistry# Math,0.21,5
    ) ABC PQR,1213,Physics,0.5,1
    )
    ) I could not think of any one-liner, so I tried the following:
    ) (Assuming I am reading each line in a variable called $Entry)

    How about:

    while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

    Which should do what you want, even for multiple quoted strings.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jun 20, 2008
    #7
  8. aquadoll

    Willem Guest

    aquadoll wrote:
    ) Hello,
    ) Thanks for all the replies. I was actually trying to get the part of
    ) $Entry I am interested in, in $TempEntry.
    ) I used the following 2 lines (as shown in the OP):
    ) $TempEntry=$Entry
    ) $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
    )
    ) Why did the above did not get "the part of $Entry I am interested in"
    ) in $TempEntry? What did I do wrong?

    It's a substitution. You substitute the quoted part with the part
    between quotes. The rest remains intact.

    To get just the part between quotes, use this:

    my ($TempEntry) = $Entry =~ /"(.*?)"/;

    Why the complicated match string by the way ?
    Do you only want to match quoted strings that contain a comma ?
    It seems needlessly complex.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jun 20, 2008
    #8
  9. aquadoll

    Willem Guest

    Willem wrote:
    ) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

    Of course,
    while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
    is slightly easier.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jun 20, 2008
    #9
  10. aquadoll

    aquadoll Guest

    On Jun 20, 12:01 pm, Willem <> wrote:
    > Willem wrote:
    >
    > ) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }
    >
    > Of course,
    >   while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
    > is slightly easier.
    >
    > SaSW, Willem
    > --
    > Disclaimer: I am in no way responsible for any of the statements
    >             made in the above text. For all I know I might be
    >             drugged or something..
    >             No I'm not paranoid. You all think I'm paranoid, don't you !
    > #EOT


    Thanks for the great discussion - I learnt a few things.
    One last question: what does $ and [1] stands for in the above post in
    "$-[1]-1"? Where can I find more about that in perldoc?
     
    aquadoll, Jun 20, 2008
    #10
  11. aquadoll

    Ben Morrow Guest

    Quoth aquadoll <>:
    >
    > One last question: what does $ and [1] stands for in the above post in
    > "$-[1]-1"? Where can I find more about that in perldoc?


    $-[1] is the second element of the array @-, which is documented in
    perlvar. The syntax is exactly the same as for $a[1] or any other array.

    Ben

    --
    I must not fear. Fear is the mind-killer. I will face my fear and
    I will let it pass through me. When the fear is gone there will be
    nothing. Only I will remain.
    Frank Herbert, 'Dune'
     
    Ben Morrow, Jun 20, 2008
    #11
  12. aquadoll

    aquadoll Guest

    On Jun 20, 12:01 pm, Willem <> wrote:
    > Willem wrote:
    >
    > ) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }
    >
    > Of course,
    >   while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
    > is slightly easier.
    >
    > SaSW, Willem
    > --
    > Disclaimer: I am in no way responsible for any of the statements
    >             made in the above text. For all I know I might be
    >             drugged or something..
    >             No I'm not paranoid. You all think I'm paranoid, don't you !
    > #EOT


    Found it here: http://perldoc.perl.org/perlvar.html#@-
     
    aquadoll, Jun 20, 2008
    #12
  13. Willem wrote:
    > Willem wrote:
    > ) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }
    >
    > Of course,
    > while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
    > is slightly easier.


    Or use a statement modifier:

    substr( $_, $-[1] - 1, $+[1] - $-[1] ) =~ s/,/#/g while s/"(.*?)"/$1/;

    Or maybe:

    /".*?"/ and substr( $_, $-[0], $+[0] - $-[0] ) =~ tr/,"/#/d;

    And avoid using the substitution operator or capturing parentheses.


    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
     
    John W. Krahn, Jun 20, 2008
    #13
  14. aquadoll

    Brad Baxter Guest

    On Jun 20, 11:51 am, aquadoll <> wrote:
    > Also, is there a not-so-cryptic one-liner for this one?


    I have to wonder if you're saying that you're not familiar
    with tools for parsing CSV-like data.

    perl -MText::parseWords=parse_line -ne'@a=parse_line(",",0,$_);
    $a[2]=~tr/,/#/;print join",",@a' infile

    --
    Brad
     
    Brad Baxter, Jun 21, 2008
    #14
  15. On Fri, 20 Jun 2008 19:20:15 +0200, Dave B wrote:

    > aquadoll wrote:
    >
    >> ABC XXX,2231,"Math, Physics",0.45,2
    >> PQR ERR,2217,"Physics, Chemistry, Math",0.21,5 ABC
    >> PQR,1213,Physics,0.5,1
    >>
    >> I want to detect when there are groups of subjects in the 3rd column,
    >> remove the quotes in those cases and replace the comma by # inside the
    >> groups. So, the above lines would be transformed to:
    >>
    >> ABC XXX,2231,Math# Physics,0.45,2
    >> PQR ERR,2217,Physics# Chemistry# Math,0.21,5 ABC PQR,1213,Physics,0.5,1
    >>[snip]
    >> Also, is there a not-so-cryptic one-liner for this one?

    >
    > I'm a beginner in perl, so please forgive any naivety. This oneliner
    > seems to work:
    >
    > $ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
    > file XXX,2231,Math# Physics,0.45,2
    > PQR ERR,2217,Physics# Chemistry# Math,0.21,5 ABC PQR,1213,Physics,0.5,1
    >
    > This assumes that the text between double quotes (the part that is
    > matched in the first place) does not appear elsewhere before the double
    > quotes, and assumes that it's the only text in double quotes in the
    > line.


    Very good! Now change it slightly (untested):

    perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

    The first substitution is guaranteed to match the same string as the
    match, so the assumption is not relevant anymore.

    M4
     
    Martijn Lievaart, Jun 21, 2008
    #15
  16. aquadoll

    Dave B Guest

    Martijn Lievaart wrote:

    >> $ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
    >>[snip]

    >
    > Very good! Now change it slightly (untested):
    >
    > perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'
    >
    > The first substitution is guaranteed to match the same string as the
    > match, so the assumption is not relevant anymore.


    This avoids the use of an extra variable, but on the other hand does not
    remove the double quotes if there are no commas in the matched string (which
    should not happen anyway, according to the OP).
    Thanks!

    --
    D.
     
    Dave B, Jun 21, 2008
    #16
  17. aquadoll

    Dave B Guest

    Dave B wrote:
    > Martijn Lievaart wrote:
    >
    >>> $ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
    >>> [snip]

    >> Very good! Now change it slightly (untested):
    >>
    >> perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'
    >>
    >> The first substitution is guaranteed to match the same string as the
    >> match, so the assumption is not relevant anymore.

    >
    > This avoids the use of an extra variable, but on the other hand does not
    > remove the double quotes if there are no commas in the matched string (which
    > should not happen anyway, according to the OP).


    Maybe this is the safest:

    perl -pe 'if (/"([^"]*)"/) {$n=$1; $n=~s/,/#/g; s/"([^"]*)"/$n/;}'

    --
    D.
     
    Dave B, Jun 21, 2008
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andy Fish
    Replies:
    2
    Views:
    414
    Young Seeker
    Dec 29, 2003
  2. Christian Schlichtherle
    Replies:
    8
    Views:
    484
    John Currier
    Jul 5, 2005
  3. lugal
    Replies:
    2
    Views:
    320
    Peter Otten
    Mar 23, 2005
  4. John Salerno
    Replies:
    6
    Views:
    349
    John Salerno
    May 10, 2006
  5. brahatha
    Replies:
    0
    Views:
    294
    brahatha
    Jun 6, 2007
Loading...

Share This Page