use binary operator on ascii text string

S

Sean.Dewis

Hi everyone

I'm pretty crap at perl, so I'd appreciate so help from you guys.

I have a string value held in $body variable.

What I need to do is manipulate each individual character value in the
string with OR - "|" and then replace that character with the
character's new value.

I'm using chr(ord($c) | 64) to get the new value, but I'm stuck on two
things: -

1) How to go through the string byte by byte and perform the OR 64 on
it
2) How to get the character equivalent back into the string in the
right place

For example the string is "abcdefg", by (I know it's not true)
performing OR 64 on each char I want "fghijkl" out.

Any idea's? Code examples would be appreciated.

TIA

Sean
 
D

David Squire

Hi everyone

I'm pretty crap at perl, so I'd appreciate so help from you guys.

I have a string value held in $body variable.

What I need to do is manipulate each individual character value in the
string with OR - "|" and then replace that character with the
character's new value.

I'm using chr(ord($c) | 64) to get the new value, but I'm stuck on two
things: -

1) How to go through the string byte by byte and perform the OR 64 on
it
2) How to get the character equivalent back into the string in the
right place

For example the string is "abcdefg", by (I know it's not true)
performing OR 64 on each char I want "fghijkl" out.

??? The characters in "abcdefg" already have bit 7 set on - as the must
since ord('a') is 97 > 64 (at least in ASCII, and many derived encodings).
Any idea's?

Learn how to use apostrophes correctly, for a start :)
Code examples would be appreciated.

Here's some code that does what I think you want, but as I have
described above, that is not actually that clear. I bet that there are
nicer ways to do this too, which others will most likely soon point out :)

----

#!/usr/bin/perl
use strict;
use warnings;

while (my $line = <DATA>) {
chomp $line;
my @line_array = split //, $line;
my @new_line_array = map {$_ | chr(64)} @line_array;
my $new_line = join '', @new_line_array;
print "$new_line\n";
}


__DATA__
abcdefg
1234567687568
%^&*^*()&^)&^

----

Output:

abcdefg
qrstuvwvxwuvx
e^fj^jhif^if^


DS
 
D

David Squire

David said:
Hi everyone

I'm pretty crap at perl, so I'd appreciate so help from you guys.

I have a string value held in $body variable.

What I need to do is manipulate each individual character value in the
string with OR - "|" and then replace that character with the
character's new value.

I'm using chr(ord($c) | 64) to get the new value, but I'm stuck on two
things: -

1) How to go through the string byte by byte and perform the OR 64 on
it
2) How to get the character equivalent back into the string in the
right place
[snip]

Here's some code that does what I think you want, but as I have
described above, that is not actually that clear. I bet that there are
nicer ways to do this too, which others will most likely soon point out :)

[snip]

.... such as this, which explicitly deals with bytes, rather than hoping
that that is what characters are in the default encoding:

----

#!/usr/bin/perl
use strict;
use warnings;

while (my $line = <DATA>) {
chomp $line;
my @line_array = unpack 'C*', $line;
my @new_line_array = map {$_ | 64} @line_array;
my $new_line = pack 'C*', @new_line_array;
print "$new_line\n";
}
 
S

Sherm Pendley

David Squire said:
??? The characters in "abcdefg" already have bit 7 set on - as the
must since ord('a') is 97 > 64

??? The value of a bit is 2^position, starting at position 0.

2^7 = 128.

sherm--
 
D

David Squire

Sherm said:
??? The value of a bit is 2^position, starting at position 0.

2^7 = 128.

I started counting at 1. The OP stated that he was doing | 64, so the
bit reffered to was clear in any case.

DS
 
S

Sherm Pendley

David Squire said:
I started counting at 1.

Yes, obviously - that's why I posted the correction. Beginning at one is
incorrect in any base-n notation, not just binary. For any value of n, the
value of position x as n^x. That only works when the positions are numbered
starting with zero.

It's not a matter of personal preference or opinion, it's part of the math-
ematical definition of base-n notation.

sherm--
 
D

David Squire

Sherm said:
Yes, obviously - that's why I posted the correction. Beginning at one is
incorrect in any base-n notation, not just binary. For any value of n, the
value of position x as n^x. That only works when the positions are numbered
starting with zero.

It's not a matter of personal preference or opinion, it's part of the math-
ematical definition of base-n notation.

And entirely unrelated to helping with the OP's question. I can just as
easily say that the value at the nth position is x^(n-1), and then count
1st, 2nd, 3rd, etc.

You have again snipped context that made it clear that there was no
ambiguity in what I posted.

Choosing to start at 0 is indeed arbitrary - though of course you are
right about the most common convention.


DS
 
S

Sherm Pendley

David Squire said:
And entirely unrelated to helping with the OP's question.

Sorry. I guess I didn't realize I was getting paid for working at this
help desk and therefore obligated to answer questions.
I can just
as easily say that the value at the nth position is x^(n-1), and then
count 1st, 2nd, 3rd, etc.

The difference is that I'm talking about an established rule that's been
widely agreed upon for decades - and that's just within the realm of
computer science. You, on the other hand, are just making stuff up to
rationalize your mistakes.
You have again snipped context that made it clear that there was no
ambiguity in what I posted.

You're right - It was unambiguously wrong.
Choosing to start at 0 is indeed arbitrary

arbitrary, adj:
1. Determined by chance, whim, or impulse, and not by necessity, reason,
or principle: stopped at the first motel we passed, an arbitrary
choice.
2. Based on or subject to individual judgment or preference: The diet
imposes overall calorie limits, but daily menus are arbitrary.
3. Established by a court or judge rather than by a specific law or
statute: an arbitrary penalty.
4. Not limited by law; despotic: the arbitrary rule of a dictator.

The original decision to start at zero was indeed arbitrary. But that was a
long time ago. One could just as easily argue that the use of the Arabic
numerals 1 and 0 are arbitrary.

Now it's an established convention, and following it is not subject to
individual judgment or preference, assuming of course that you expect to
be understood.

sherm--
 
M

Mumia W.

Hi everyone

I'm pretty crap at perl, so I'd appreciate so help from you guys.

I have a string value held in $body variable.

What I need to do is manipulate each individual character value in the
string with OR - "|" and then replace that character with the
character's new value.

I'm using chr(ord($c) | 64) to get the new value
[...]

Then you're pretty much there. Just use the substitution operator to
replace each character with the result of the code you have above, and
you're almost set.

You'll also have to change $c to the match variable $&, and the
substitution operator will need the 'g' option (global--go through the
entire string) and the 'e' option (execute code).
 
D

David Squire

David said:
David said:
Hi everyone

I'm pretty crap at perl, so I'd appreciate so help from you guys.

I have a string value held in $body variable.

What I need to do is manipulate each individual character value in the
string with OR - "|" and then replace that character with the
character's new value.

I'm using chr(ord($c) | 64) to get the new value, but I'm stuck on two
things: -

1) How to go through the string byte by byte and perform the OR 64 on
it
2) How to get the character equivalent back into the string in the
right place
[snip]

Here's some code that does what I think you want, but as I have
described above, that is not actually that clear. I bet that there are
nicer ways to do this too, which others will most likely soon point
out :)

[snip]

... such as this, which explicitly deals with bytes, rather than hoping
that that is what characters are in the default encoding:

----

#!/usr/bin/perl
use strict;
use warnings;

while (my $line = <DATA>) {
chomp $line;
my @line_array = unpack 'C*', $line;
my @new_line_array = map {$_ | 64} @line_array;
my $new_line = pack 'C*', @new_line_array;
print "$new_line\n";
}

----

Well, I might as well give the last (?) in the series, following Mumia's
suggestion:

----

#!/usr/bin/perl
use strict;
use warnings;

my $mask = 64;
while (<DATA>) {
s/(.)/chr(ord($1) | $mask)/eg;
print;
}


__DATA__
abcdefg
1234567687568
%^&*^*()&^)&^

----

Output:

abcdefg
qrstuvwvxwuvx
e^fj^jhif^if^


.... though I still prefer the explicit byte-wise one above.


Cheers,

DS
 
B

Ben Morrow

Quoth David Squire said:
Well, I might as well give the last (?) in the series, following Mumia's
suggestion:

while (<DATA>) {
print $_ | (chr(64) x length);
}

:)

Ben
 
J

John W. Krahn

I'm pretty crap at perl, so I'd appreciate so help from you guys.

I have a string value held in $body variable.

What I need to do is manipulate each individual character value in the
string with OR - "|" and then replace that character with the
character's new value.

I'm using chr(ord($c) | 64) to get the new value, but I'm stuck on two
things: -

1) How to go through the string byte by byte and perform the OR 64 on
it
2) How to get the character equivalent back into the string in the
right place

For example the string is "abcdefg", by (I know it's not true)
performing OR 64 on each char I want "fghijkl" out.

Any idea's? Code examples would be appreciated.

$body =~ s/(.)/ $1 | "\x40" /seg;



John
 
P

Peter J. Holzer

Sherm said:
Yes, obviously - that's why I posted the correction. Beginning at one is
incorrect in any base-n notation, not just binary. For any value of n, the
value of position x as n^x. That only works when the positions are numbered
starting with zero.

It's not a matter of personal preference or opinion, it's part of the math-
ematical definition of base-n notation.

But base-n notation is not the only notation in use. For example, the
RFCs describing the IP protocol (RFC 791 etc.) count bits from the MSB
to the LSB. They also start at zero, so if that convention is used on
bytes, bit 0 has the value 128, bit 1 has the value 64, etc. So David
could have said that the characters already have bit 1 set on and
confused the hell out of everyone :).

I have seen numbering from 1..n (from either direction) instead of
0..n-1, too, but I'm too lazy to look for a widely known example. (But
if you read mathematical papers you will notice that many prefer to use
indexes starting at 1, even if it makes the formulas (formulae?) more
complicated because they have to write (i-1) instead of i all the time.

I don't care much as long as it is consistent. What really annoys me are
people who start counting at zero but claim that "zeroth" is not an
English word, so they use "the seventh bit" and "bit 6" interchangeably.

hp
 
J

Jürgen Exner

Peter said:
I don't care much as long as it is consistent. What really annoys me
are people who start counting at zero but claim that "zeroth" is not
an English word, so they use "the seventh bit" and "bit 6"
interchangeably.

Sure as hell confusing.
But the first element in a Perl array happens to be the element with the
index 0.

Guess it's just something you have to get used to.

jue
 
P

Peter J. Holzer

David said:
... such as this, which explicitly deals with bytes, rather than hoping
that that is what characters are in the default encoding:

----

#!/usr/bin/perl
use strict;
use warnings;

while (my $line = <DATA>) {
chomp $line;
my @line_array = unpack 'C*', $line;
my @new_line_array = map {$_ | 64} @line_array;
my $new_line = pack 'C*', @new_line_array;
print "$new_line\n";
}

I don't think this is a good idea, as it depends on whether $line is
stored as bytes or as UTF-8 internally, which shouldn't make any
semantic difference.

hp
 
D

David Squire

Peter said:
I don't think this is a good idea, as it depends on whether $line is
stored as bytes or as UTF-8 internally, which shouldn't make any
semantic difference.

It was not clear to me from the OP what the actual application was. I
guess I suspect that bit masking is more likely to be applied to bytes
of data than characters...

.... now, had he been masking with 32, I could imagine that this was a
hacky way to convert things to lowercase.


DS
 
P

Peter J. Holzer

David said:
It was not clear to me from the OP what the actual application was. I
guess I suspect that bit masking is more likely to be applied to bytes
of data than characters...

Yes, but I would still argue that the "bytes" in $line are what you get
by splitting it into "characters", not by using unpack 'C*'.

(In fact, I'm not sure if the behaviour of unpack 'C*' is correct - the
docs aren't clear and it does violate the principle of least
astonishment).

Consider this script:

#!/usr/bin/perl
use warnings;
use strict;

my $x = "\x{FC}";
utf8::upgrade($x);
my $y = "\x{FC}";

print "\$x and \$y are", ($x eq $y ? "" : " not"), " equal\n";

my @x = unpack 'C*', $x;

print "\$x is_utf8: ", utf8::is_utf8($x), "\n";
for (@x) { print "$_\n" }

my @y = unpack 'C*', $y;

print "\$y is_utf8: ", utf8::is_utf8($y), "\n";
for (@y) { print "$_\n" }
__END__

With perl, v5.8.4 built for i386-linux-thread-multi, it prints:

$x and $y are equal
$x is_utf8: 1
195
188
$y is_utf8:
252

So while perl thinks that $x and $y are equal, unpacking them with C*
yields different results. I don't think this should be the case, as it
can introduce hard-to-find bugs if a string of (0..255) is for some
reason stored as UTF-8.

hp
 
D

David Squire

Peter said:
Yes, but I would still argue that the "bytes" in $line are what you get
by splitting it into "characters", not by using unpack 'C*'.

Well, to me a byte is a byte is a byte: 8 bits. I agree that the OP's
example used a line of text as the example, so using unpack 'C*' is not
a good idea.
(In fact, I'm not sure if the behaviour of unpack 'C*' is correct - the
docs aren't clear and it does violate the principle of least
astonishment).

I don't think the docs are that unclear. In perlfunc#pack it says:

"C An unsigned char value. Only does bytes. See U for Unicode."

I agree that calling this a char, and using the mnemonic 'C' is
potentially confusing in today's world of multiple multi-byte character
sets.

So, if I want bytes, that's what I would use. Mind you, I would only be
doing this for something like a bit-based set representation, not when I
was playing with characters intended to represent text (which may or may
not be stored as bytes).


Regards,

DS
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top