Substituting in a group

aquadoll · Jun 20, 2008

(Duplicate copy - not sure if the previous msg got posted !!)

Hello,
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
my $TempEntry=$Entry;
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
# Change comma to # in this phrase
$TempEntry =~ s/,/#/g;
print "TempEntry=$TempEntry\n";
# Now replace the original phrase with this phrase in the original
entry
$Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
print "New Entry=$Entry\n";
}

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

Also, is there a not-so-cryptic one-liner for this one?

Thanks.

patrick · Jun 20, 2008

(Duplicate copy - not sure if the previous msg got posted !!)

Hello,
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
my $TempEntry=$Entry;
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
# Change comma to # in this phrase
$TempEntry =~ s/,/#/g;
print "TempEntry=$TempEntry\n";
# Now replace the original phrase with this phrase in the original
entry
$Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
print "New Entry=$Entry\n";

}

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

Also, is there a not-so-cryptic one-liner for this one?

Thanks.

You might try
perl -F'"' -lane '$F[0] =~ s/"//; $F[1] =~ s/"//;$F[1] =~ s/,/#/;print
@F' in.txt > out.txt

Patrick

Paul Lalli · Jun 20, 2008

(Duplicate copy - not sure if the previous msg got posted !!)

Hello,
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
my $TempEntry=$Entry;
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

This gets rid of all the quotes in the TempEntry.

# Change comma to # in this phrase
$TempEntry =~ s/,/#/g;

This changes ALL commas in the entire entry, not just the commas that
were originally part of the quoted material.

print "TempEntry=$TempEntry\n";
# Now replace the original phrase with this phrase in the original
entry
$Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
print "New Entry=$Entry\n";

}

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

$TempEntry is the whole line, not just the part of $Entry you cared
about.

#First obtain the grouped items substring
my ($group) = ($TempEntry =~ /("[^"]+?")/);
#Create a copy of the group string to modify:
my $mod_group = $group;
#Remove all commas from the group
$mod_group =~ tr/,/#/;
#Remove the quotes from the group:
$mod_group =~ s/^"|"#//g;
#Replace the original group with the modified group in the original
Entry
$TempEntry =~ s/$group/$mod_group/;

Hope that helps,
Paul Lalli

Dave B · Jun 20, 2008

aquadoll said:
ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1
[snip]
Also, is there a not-so-cryptic one-liner for this one?

I'm a beginner in perl, so please forgive any naivety. This oneliner seems
to work:

$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}' file
XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

This assumes that the text between double quotes (the part that is matched
in the first place) does not appear elsewhere before the double quotes, and
assumes that it's the only text in double quotes in the line.

aquadoll · Jun 20, 2008

(Duplicate copy - not sure if the previous msg got posted !!)

Click to expand...

Hello,
I am having the following kind of lines:

Click to expand...

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

Click to expand...

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

Click to expand...

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1
I could not think of any one-liner, so I tried the following:
(Assuming I am reading each line in a variable called $Entry)

Click to expand...

if($Entry =~ /"[A-Za-z\s]*(,[A-Za-z\s]*)+"/)
{
my $TempEntry=$Entry;
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

Click to expand...

This gets rid of all the quotes in the TempEntry.

# Change comma to # in this phrase
$TempEntry =~ s/,/#/g;

Click to expand...

This changes ALL commas in the entire entry, not just the commas that
were originally part of the quoted material.

print "TempEntry=$TempEntry\n";
# Now replace the original phrase with this phrase in the original
entry
$Entry =~ s/"[A-Za-z\s]*(,[A-Za-z\s]*)+"/$TempEntry/;
print "New Entry=$Entry\n";

}

Click to expand...

The above does not work - for some reason all commas get transformed
into # for the first two lines. Where is the problem?

Click to expand...

$TempEntry is the whole line, not just the part of $Entry you cared
about.

#First obtain the grouped items substring
my ($group) = ($TempEntry =~ /("[^"]+?")/);
#Create a copy of the group string to modify:
my $mod_group = $group;
#Remove all commas from the group
$mod_group =~ tr/,/#/;
#Remove the quotes from the group:
$mod_group =~ s/^"|"#//g;
#Replace the original group with the modified group in the original
Entry
$TempEntry =~ s/$group/$mod_group/;

Hope that helps,
Paul Lalli

Hello,
Thanks for all the replies. I was actually trying to get the part of
$Entry I am interested in, in $TempEntry.
I used the following 2 lines (as shown in the OP):
$TempEntry=$Entry
$TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;

Why did the above did not get "the part of $Entry I am interested in"
in $TempEntry? What did I do wrong?
Thanks.

John W. Krahn · Jun 20, 2008

patrick said:
I am having the following kind of lines:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
ABC PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5
ABC PQR,1213,Physics,0.5,1

Click to expand...

You might try
perl -F'"' -lane '$F[0] =~ s/"//; $F[1] =~ s/"//;$F[1] =~ s/,/#/;print
@F' in.txt > out.txt

split() *removes* the expression you are splitting on so there are no
'"' characters in @F to remove so that could be simplified to:

perl -F'"' -lane '$F[1] =~ s/,/#/;print @F' in.txt > out.txt

But that only changes the first ',' to a '#' and not all of them so you
probably want this instead:

perl -F'"' -lane '$F[1] =~ s/,/#/g;print @F' in.txt > out.txt

Or:

perl -F'"' -lane '$F[1] =~ tr/,/#/;print @F' in.txt > out.txt

John

Willem · Jun 20, 2008

aquadoll wrote:
) ABC XXX,2231,"Math, Physics",0.45,2
) PQR ERR,2217,"Physics, Chemistry, Math",0.21,5
) ABC PQR,1213,Physics,0.5,1
)
) I want to detect when there are groups of subjects in the 3rd column,
) remove the quotes in those cases and replace the comma by # inside the
) groups. So, the above lines would be transformed to:
)
) ABC XXX,2231,Math# Physics,0.45,2
) PQR ERR,2217,Physics# Chemistry# Math,0.21,5
) ABC PQR,1213,Physics,0.5,1
)
) I could not think of any one-liner, so I tried the following:
) (Assuming I am reading each line in a variable called $Entry)

How about:

while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Which should do what you want, even for multiple quoted strings.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Willem · Jun 20, 2008

aquadoll wrote:
) Hello,
) Thanks for all the replies. I was actually trying to get the part of
) $Entry I am interested in, in $TempEntry.
) I used the following 2 lines (as shown in the OP):
) $TempEntry=$Entry
) $TempEntry =~ s/"([A-Za-z\s]*([,][A-Za-z\s]*)+)"/$1/;
)
) Why did the above did not get "the part of $Entry I am interested in"
) in $TempEntry? What did I do wrong?

It's a substitution. You substitute the quoted part with the part
between quotes. The rest remains intact.

To get just the part between quotes, use this:

my ($TempEntry) = $Entry =~ /"(.*?)"/;

Why the complicated match string by the way ?
Do you only want to match quoted strings that contain a comma ?
It seems needlessly complex.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Willem · Jun 20, 2008

Willem wrote:
) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

aquadoll · Jun 20, 2008

Willem wrote:

) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Thanks for the great discussion - I learnt a few things.
One last question: what does $ and [1] stands for in the above post in
"$-[1]-1"? Where can I find more about that in perldoc?

Ben Morrow · Jun 20, 2008

Quoth aquadoll said:
One last question: what does $ and [1] stands for in the above post in
"$-[1]-1"? Where can I find more about that in perldoc?

$-[1] is the second element of the array @-, which is documented in
perlvar. The syntax is exactly the same as for $a[1] or any other array.

Ben

aquadoll · Jun 20, 2008

Willem wrote:

) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Found it here: http://perldoc.perl.org/perlvar.html#@-

John W. Krahn · Jun 20, 2008

Willem said:
Willem wrote:
) while (s/(")(.*?)"/$2/) { substr($_,$+[1]-1,$+[2]-$+[1]) =~ s/,/#/g }

Of course,
while (s/"(.*?)"/$1/) { substr($_,$-[1]-1,$+[1]-$-[1]) =~ s/,/#/g }
is slightly easier.

Or use a statement modifier:

substr( $_, $-[1] - 1, $+[1] - $-[1] ) =~ s/,/#/g while s/"(.*?)"/$1/;

Or maybe:

/".*?"/ and substr( $_, $-[0], $+[0] - $-[0] ) =~ tr/,"/#/d;

And avoid using the substitution operator or capturing parentheses.

John

Brad Baxter · Jun 21, 2008

Also, is there a not-so-cryptic one-liner for this one?

I have to wonder if you're saying that you're not familiar
with tools for parsing CSV-like data.

perl -MText:

arseWords=parse_line -ne'@a=parse_line(",",0,$_);
$a[2]=~tr/,/#/;print join",",@a' infile

Martijn Lievaart · Jun 21, 2008

aquadoll said:
aquadoll said:

ABC XXX,2231,"Math, Physics",0.45,2
PQR ERR,2217,"Physics, Chemistry, Math",0.21,5 ABC
PQR,1213,Physics,0.5,1

I want to detect when there are groups of subjects in the 3rd column,
remove the quotes in those cases and replace the comma by # inside the
groups. So, the above lines would be transformed to:

ABC XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5 ABC PQR,1213,Physics,0.5,1
[snip]
Also, is there a not-so-cryptic one-liner for this one?

Click to expand...

I'm a beginner in perl, so please forgive any naivety. This oneliner
seems to work:

$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
file XXX,2231,Math# Physics,0.45,2
PQR ERR,2217,Physics# Chemistry# Math,0.21,5 ABC PQR,1213,Physics,0.5,1

This assumes that the text between double quotes (the part that is
matched in the first place) does not appear elsewhere before the double
quotes, and assumes that it's the only text in double quotes in the
line.

Very good! Now change it slightly (untested):

perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

The first substitution is guaranteed to match the same string as the
match, so the assumption is not relevant anymore.

M4

Dave B · Jun 21, 2008

Martijn said:
$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
[snip]

Click to expand...

Very good! Now change it slightly (untested):

perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

The first substitution is guaranteed to match the same string as the
match, so the assumption is not relevant anymore.

This avoids the use of an extra variable, but on the other hand does not
remove the double quotes if there are no commas in the matched string (which
should not happen anyway, according to the OP).
Thanks!

Dave B · Jun 21, 2008

Dave said:
Martijn said:

$ perl -pe 'if (s/"([^"]*)"/$1/) {$m=$n=$1; $n=~s/,/#/g; s/$m/$n/;}'
[snip]

Click to expand...

Very good! Now change it slightly (untested):

perl -pe 'if (/"([^"]*)"/) {$n=$1; s/"([^"]*)"/$n/ if ($n=~s/,/#/g); }'

The first substitution is guaranteed to match the same string as the
match, so the assumption is not relevant anymore.

Click to expand...

This avoids the use of an extra variable, but on the other hand does not
remove the double quotes if there are no commas in the matched string (which
should not happen anyway, according to the OP).

Maybe this is the safest:

perl -pe 'if (/"([^"]*)"/) {$n=$1; $n=~s/,/#/g; s/"([^"]*)"/$n/;}'

Regex: Matching Characters NOT in a Certain Range	6	Oct 4, 2005
Strange problem with PCRE	7	Jun 16, 2012
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
replace regex in file using a dictionary	3	Apr 5, 2011
Iterating through a file, sticking iterated array entries in	2	Sep 20, 2007
[SUMMARY] TumbleDRYer (#53)	2	Nov 3, 2005
Sencha Touch--Support 2 browsers in just 228K!	64	Jul 16, 2010
Where can I find a summary table of various reqular expressions options?	0	Apr 6, 2004

Substituting in a group

aquadoll

patrick

Paul Lalli

Dave B

aquadoll

John W. Krahn

Willem

Willem

Willem

aquadoll

Ben Morrow

aquadoll

John W. Krahn

Brad Baxter

Martijn Lievaart

Dave B

Dave B

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads