filter out "strange" text in perl ? íµ▓½τ┤░Φâ₧

J

Jack

Hi

I am parsing a text file and see what looks like in the datafiel a NULL
(nothing) in between my delimeter but Perl is recognizing a value when
I print to the screen as: íµ▓½τ┤░Φâ₧

This is screwing up my program and I want to get rid of it ! Does
anyone know how to auto match / detect this so I can remove it / deal
with it ?!!

Thanks Jack..
 
A

anno4000

Jack said:
Hi

I am parsing a text file and see what looks like in the datafiel a NULL
(nothing) in between my delimeter but Perl is recognizing a value when
I print to the screen as: íµ▓½τ┤░Φâ₧

Posting "strange characters" to Usenet is useless. Every news reader
will show something else. In fact, my reader shows your string
differently in the subject and the body of the message.

To communicate the data unambiguously you could print the numeric
value of each character:

printf "%d ", ord $_ for split //, $string;
print "\n";

Be sure to post not only the output but also the proglet that
generated it.
This is screwing up my program and I want to get rid of it ! Does
anyone know how to auto match / detect this so I can remove it / deal
with it ?!!

What exactly do you mean by "this"? Simply deleting the exact sequence
of bytes wherever it appears would be a horrible solution, and probably
not a solution at all if similar but not identical strings appear
elsewhere.

You should find out why the disruptive strings are there in the first
place. Then there may be a realistic chance to get rid of them.

Anno
 
J

Jack

Posting "strange characters" to Usenet is useless. Every news reader
will show something else. In fact, my reader shows your string
differently in the subject and the body of the message.

To communicate the data unambiguously you could print the numeric
value of each character:

printf "%d ", ord $_ for split //, $string;
print "\n";

Be sure to post not only the output but also the proglet that
generated it.


What exactly do you mean by "this"? Simply deleting the exact sequence
of bytes wherever it appears would be a horrible solution, and probably
not a solution at all if similar but not identical strings appear
elsewhere.

You should find out why the disruptive strings are there in the first
place. Then there may be a realistic chance to get rid of them.

Anno

Ok then - does anyone know what the syntax is to detect:
1- ASCII
2- double byte characters
3- UTF-8

Thank you,

Jack
 
B

Ben Bacarisse

Ok then - does anyone know what the syntax is to detect:

That's not a syntax question. Code untested:

For a single character:

ord $char < 128

For use in a regex:

[[:ascii:]]
2- double byte characters

ord $char >= 256

[^\0-\xff]

UTF-8 and ASCII overlap.

Yes, but in a very useful way. If you have an octet stream that might
be ASCII or UTF-8 or even "both mixed up" you can tell them apart. I
put that in quotes because, as you say, the encodings overlap in that
ASCII octets are valid UTF-8 encodings and they encode the same
character.

When you see an octet with the top bit set you tell if it the first
octet of a UTF-8 encoding *and* how many of the following octets are
part of the character. The following octets are all of the form
10xxxxxx so even if you find yourself in the middle of a multi-octet
character you can skip to the next one (it will start 11xxxxxx or
0xxxxxxx).
 
J

Joe Smith

Jack said:
Ok then - does anyone know what the syntax is to detect:
1- ASCII
2- double byte characters
3- UTF-8

Your list excludes single-byte characters with the high-order
bit set, such as ISO 8859-1 (Latin-1 alphabet), since ASCII is
defined as 7-bit codes.

-Joe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top