Substituting in a group

A

aquadoll

(Duplicate copy - not sure if the previous msg got posted !!)

Hello,
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
my $TempEntry=$Entry;
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
# Change comma to # in this phrase
$TempEntry =~ s/,/#/g;
print "TempEntry=$TempEntry\n";
# Now replace the original phrase with this phrase in the original
entry
$Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
print "New Entry=$Entry\n";
}


The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

Also, is there a not-so-cryptic one-liner for this one?

Thanks.
 
P

patrick

(Duplicate copy - not sure if the previous msg got posted !!)

Hello,
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
   my $TempEntry=$Entry;
   $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
   # Change comma to # in this phrase
   $TempEntry =~ s/,/#/g;
   print "TempEntry=$TempEntry\n";
   # Now replace the original phrase with this phrase in the original
entry
   $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
   print "New Entry=$Entry\n";

}

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

Also, is there a not-so-cryptic one-liner for this one?

Thanks.

You might try
perl -F'"' -lane '$F[0] =~ s/"//; $F[1] =~ s/"//;$F[1] =~ s/,/#/;print
@F' in.txt > out.txt

Patrick
 
P

Paul Lalli

(Duplicate copy - not sure if the previous msg got posted !!)

Hello,
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
   my $TempEntry=$Entry;
   $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

This gets rid of all the quotes in the TempEntry.
   # Change comma to # in this phrase
   $TempEntry =~ s/,/#/g;

This changes ALL commas in the entire entry, not just the commas that
were originally part of the quoted material.
   print "TempEntry=$TempEntry\n";
   # Now replace the original phrase with this phrase in the original
entry
   $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
   print "New Entry=$Entry\n";

}

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

$TempEntry is the whole line, not just the part of $Entry you cared
about.

#First obtain the grouped items substring
my ($group) = ($TempEntry =~ /("[^"]+?")/);
#Create a copy of the group string to modify:
my $mod_group = $group;
#Remove all commas from the group
$mod_group =~ tr/,/#/;
#Remove the quotes from the group:
$mod_group =~ s/^"|"#//g;
#Replace the original group with the modified group in the original
Entry
$TempEntry =~ s/$group/$mod_group/;


Hope that helps,
Paul Lalli
 
D

Dave B

aquadoll said:
ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1
[snip]
Also, is there a not-so-cryptic one-liner for this one?

I'm a beginner in perl, so please forgive any naivety. This oneliner seems
to work:

$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}' file
XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

This assumes that the text between double quotes (the part that is matched
in the first place) does not appear elsewhere before the double quotes, and
assumes that it's the only text in double quotes in the line.
 
A

aquadoll

(Duplicate copy - not sure if the previous msg got posted !!)
Hello,
I am having the following kind of lines:
ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1
I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:
ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1
I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)
if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
   my $TempEntry=$Entry;
   $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

This gets rid of all the quotes in the TempEntry.
   # Change comma to # in this phrase
   $TempEntry =~ s/,/#/g;

This changes ALL commas in the entire entry, not just the commas that
were originally part of the quoted material.
   print "TempEntry=$TempEntry\n";
   # Now replace the original phrase with this phrase in the original
entry
   $Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
   print "New Entry=$Entry\n";

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

$TempEntry is the whole line, not just the part of $Entry you cared
about.

#First obtain the grouped items substring
my ($group) = ($TempEntry =~ /("[^"]+?")/);
#Create a copy of the group string to modify:
my $mod_group = $group;
#Remove all commas from the group
$mod_group =~ tr/,/#/;
#Remove the quotes from the group:
$mod_group =~ s/^"|"#//g;
#Replace the original group with the modified group in the original
Entry
$TempEntry =~ s/$group/$mod_group/;

Hope that helps,
Paul Lalli

Hello,
Thanks for all the replies. I was actually trying to get the part of
$Entry I am interested in, in $TempEntry.
I used the following 2 lines (as shown in the OP):
$TempEntry=$Entry
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

Why did the above did not get "the part of $Entry I am interested in"
in $TempEntry? What did I do wrong?
Thanks.
 
J

John W. Krahn

patrick said:
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

You might try
perl -F'"' -lane '$F[0] =~ s/"//; $F[1] =~ s/"//;$F[1] =~ s/,/#/;print
@F' in.txt > out.txt

split() *removes* the expression you are splitting on so there are no
'"' characters in @F to remove so that could be simplified to:

perl -F'"' -lane '$F[1] =~ s/,/#/;print @F' in.txt > out.txt

But that only changes the first ',' to a '#' and not all of them so you
probably want this instead:

perl -F'"' -lane '$F[1] =~ s/,/#/g;print @F' in.txt > out.txt

Or:

perl -F'"' -lane '$F[1] =~ tr/,/#/;print @F' in.txt > out.txt



John
 
W

Willem

aquadoll wrote:
) ABC XXX,2231,"Math, Physics",0.45,2
) PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
) ABC PQR,1213,Physics,0.5,1
)
) I want to detect when there are groups of subjects in the 3rd column,
) remove the quotes in those cases and replace the comma by # inside the
) groups. So, the above lines would be transformed to:
)
) ABC XXX,2231,Math# Physics,0.45,2
) PQR ERR,2217,Physics# Chemistry# Math,0.21,5
) ABC PQR,1213,Physics,0.5,1
)
) I could not think of any one-liner, so I tried the following:
) (Assuming I am reading each line in a variable called $Entry)

How about:

while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Which should do what you want, even for multiple quoted strings.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Willem

aquadoll wrote:
) Hello,
) Thanks for all the replies. I was actually trying to get the part of
) $Entry I am interested in, in $TempEntry.
) I used the following 2 lines (as shown in the OP):
) $TempEntry=$Entry
) $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
)
) Why did the above did not get "the part of $Entry I am interested in"
) in $TempEntry? What did I do wrong?

It's a substitution. You substitute the quoted part with the part
between quotes. The rest remains intact.

To get just the part between quotes, use this:

my ($TempEntry) = $Entry =~ /"(.*?)"/;

Why the complicated match string by the way ?
Do you only want to match quoted strings that contain a comma ?
It seems needlessly complex.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Willem

Willem wrote:
) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
A

aquadoll

Willem wrote:

) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
  while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Thanks for the great discussion - I learnt a few things.
One last question: what does $ and [1] stands for in the above post in
"$-[1]-1"? Where can I find more about that in perldoc?
 
B

Ben Morrow

Quoth aquadoll said:
One last question: what does $ and [1] stands for in the above post in
"$-[1]-1"? Where can I find more about that in perldoc?

$-[1] is the second element of the array @-, which is documented in
perlvar. The syntax is exactly the same as for $a[1] or any other array.

Ben
 
A

aquadoll

Willem wrote:

) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
  while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Found it here: http://perldoc.perl.org/perlvar.html#@-
 
J

John W. Krahn

Willem said:
Willem wrote:
) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

Or use a statement modifier:

substr( $_, $-[1] - 1, $+[1] - $-[1] ) =~ s/,/#/g while s/"(.*?)"/$1/;

Or maybe:

/".*?"/ and substr( $_, $-[0], $+[0] - $-[0] ) =~ tr/,"/#/d;

And avoid using the substitution operator or capturing parentheses.


John
 
B

Brad Baxter

Also, is there a not-so-cryptic one-liner for this one?

I have to wonder if you're saying that you're not familiar
with tools for parsing CSV-like data.

perl -MText::parseWords=parse_line -ne'@a=parse_line(",",0,$_);
$a[2]=~tr/,/#/;print join",",@a' infile
 
M

Martijn Lievaart

aquadoll said:
ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5 ABC
PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5 ABC PQR,1213,Physics,0.5,1
[snip]
Also, is there a not-so-cryptic one-liner for this one?

I'm a beginner in perl, so please forgive any naivety. This oneliner
seems to work:

$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
file XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5 ABC PQR,1213,Physics,0.5,1

This assumes that the text between double quotes (the part that is
matched in the first place) does not appear elsewhere before the double
quotes, and assumes that it's the only text in double quotes in the
line.

Very good! Now change it slightly (untested):

perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

The first substitution is guaranteed to match the same string as the
match, so the assumption is not relevant anymore.

M4
 
D

Dave B

Martijn said:
$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
[snip]

Very good! Now change it slightly (untested):

perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

The first substitution is guaranteed to match the same string as the
match, so the assumption is not relevant anymore.

This avoids the use of an extra variable, but on the other hand does not
remove the double quotes if there are no commas in the matched string (which
should not happen anyway, according to the OP).
Thanks!
 
D

Dave B

Dave said:
Martijn said:
$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
[snip]
Very good! Now change it slightly (untested):

perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

The first substitution is guaranteed to match the same string as the
match, so the assumption is not relevant anymore.

This avoids the use of an extra variable, but on the other hand does not
remove the double quotes if there are no commas in the matched string (which
should not happen anyway, according to the OP).

Maybe this is the safest:

perl -pe 'if (/"([^"]*)"/) {$n=$1; $n=~s/,/#/g; s/"([^"]*)"/$n/;}'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,278
Latest member
BuzzDefenderpro

Latest Threads

Top