filter out "strange" text in perl ? ÃÂµâ–“Â½Ï„â”¤â–‘Î¦Ã¢â‚§

Jack · Nov 11, 2006

Hi

I am parsing a text file and see what looks like in the datafiel a NULL
(nothing) in between my delimeter but Perl is recognizing a value when
I print to the screen as: ÃÂµâ–“Â½Ï„â”¤â–‘Î¦Ã¢â‚§

This is screwing up my program and I want to get rid of it ! Does
anyone know how to auto match / detect this so I can remove it / deal
with it ?!!

Thanks Jack..

anno4000 · Nov 11, 2006

Jack said:
Hi

I am parsing a text file and see what looks like in the datafiel a NULL
(nothing) in between my delimeter but Perl is recognizing a value when
I print to the screen as: ÃÂµâ–“Â½Ï„â”¤â–‘Î¦Ã¢â‚§

Posting "strange characters" to Usenet is useless. Every news reader
will show something else. In fact, my reader shows your string
differently in the subject and the body of the message.

To communicate the data unambiguously you could print the numeric
value of each character:

printf "%d ", ord $_ for split //, $string;
print "\n";

Be sure to post not only the output but also the proglet that
generated it.

This is screwing up my program and I want to get rid of it ! Does
anyone know how to auto match / detect this so I can remove it / deal
with it ?!!

What exactly do you mean by "this"? Simply deleting the exact sequence
of bytes wherever it appears would be a horrible solution, and probably
not a solution at all if similar but not identical strings appear
elsewhere.

You should find out why the disruptive strings are there in the first
place. Then there may be a realistic chance to get rid of them.

Anno

Jack · Nov 11, 2006

Posting "strange characters" to Usenet is useless. Every news reader
will show something else. In fact, my reader shows your string
differently in the subject and the body of the message.

To communicate the data unambiguously you could print the numeric
value of each character:

printf "%d ", ord $_ for split //, $string;
print "\n";

Be sure to post not only the output but also the proglet that
generated it.

What exactly do you mean by "this"? Simply deleting the exact sequence
of bytes wherever it appears would be a horrible solution, and probably
not a solution at all if similar but not identical strings appear
elsewhere.

You should find out why the disruptive strings are there in the first
place. Then there may be a realistic chance to get rid of them.

Anno

Ok then - does anyone know what the syntax is to detect:
1- ASCII
2- double byte characters
3- UTF-8

Thank you,

Jack

anno4000 · Nov 11, 2006

Jack said:
Ok then - does anyone know what the syntax is to detect:

That's not a syntax question. Code untested:

1- ASCII

For a single character:

ord $char < 128

For use in a regex:

[[:ascii:]]

2- double byte characters

ord $char >= 256

[^\0-\xff]

3- UTF-8

UTF-8 and ASCII overlap.

Anno

Ben Bacarisse · Nov 12, 2006

Ok then - does anyone know what the syntax is to detect:

Click to expand...

That's not a syntax question. Code untested:

1- ASCII

Click to expand...

For a single character:

ord $char < 128

For use in a regex:

[[:ascii:]]

2- double byte characters

Click to expand...

ord $char >= 256

[^\0-\xff]

3- UTF-8

Click to expand...

UTF-8 and ASCII overlap.

Yes, but in a very useful way. If you have an octet stream that might
be ASCII or UTF-8 or even "both mixed up" you can tell them apart. I
put that in quotes because, as you say, the encodings overlap in that
ASCII octets are valid UTF-8 encodings and they encode the same
character.

When you see an octet with the top bit set you tell if it the first
octet of a UTF-8 encoding *and* how many of the following octets are
part of the character. The following octets are all of the form
10xxxxxx so even if you find yourself in the middle of a multi-octet
character you can skip to the next one (it will start 11xxxxxx or
0xxxxxxx).

Joe Smith · Nov 12, 2006

Jack said:
Ok then - does anyone know what the syntax is to detect:
1- ASCII
2- double byte characters
3- UTF-8

Your list excludes single-byte characters with the high-order
bit set, such as ISO 8859-1 (Latin-1 alphabet), since ASCII is
defined as 7-bit codes.

-Joe

a little parsing challenge â˜º	70	Jul 17, 2011
Converting my index.pl(cgi) to html::template one	4	Apr 26, 2005
Emacs Lisp vs Perl: Validate Local File Links	1	Apr 13, 2012
Possibly useful perl script to filter lines in one file out of another.	23	Aug 23, 2009
reformatting a text file that has some binary in it	19	Apr 15, 2009
emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
is list comprehension necessary?	15	Oct 26, 2010

filter out "strange" text in perl ? ÃÂµâ–“Â½Ï„â”¤â–‘Î¦Ã¢â‚§

Jack

anno4000

Jack

anno4000

Ben Bacarisse

Joe Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

filter out "strange" text in perl ? Ã­Âµâ–“Â½Ï„â”¤â–‘Î¦Ã¢â‚§

Jack

anno4000

Jack

anno4000

Ben Bacarisse

Joe Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

filter out "strange" text in perl ? ÃÂµâ–“Â½Ï„â”¤â–‘Î¦Ã¢â‚§