Skip non english character values

aaron80v · Jan 11, 2007

Hi,

Occasionally the excel file I am dealing with contains non english
characters in certain fields (delimited by comma) such as Chinese,
Japanese and Korean. How do I check and skip those so that my perl
script won't break?

eg.

ABC, ??????, CDF

Right after processing ABC, I would want to jump to CDF.

Aaron

Jürgen Exner · Jan 11, 2007

Occasionally the excel file I am dealing with contains non english
characters in certain fields (delimited by comma) such as Chinese,
Japanese and Korean. How do I check and skip those so that my perl
script won't break?

Perl is fully Unicode-capable and can handle non-English characters just
fine. If you script can handle them is a different question, of course.

I would use tr/// with the proper options (complement of English characters;
delete) to transliterate the unwanted characters into oblivion.

Having said that I think the whole idea is nuts and at least I would be
pretty upset if you would bastardize my name.

jue

Paul Lalli · Jan 11, 2007

[email protected] said:
Occasionally the excel file I am dealing with contains non english
characters in certain fields (delimited by comma) such as Chinese,
Japanese and Korean. How do I check and skip those so that my perl
script won't break?

If your perl script "breaks" when encountering non-English characters,
your script is broken and should be fixed. How exactly does it
"break"? Please post a short-but-complete script that demonstrates
what you're doing wrong.

How you "skip" over the non-english characters depends entirely on how
you are processing the data. Line by line, character by character,
field by field, other? Again, please post a short-but-complete script
that demonstrates what you're doing.

Have you read the Posting Guidelines that are posted here twice a week?

Paul Lalli

aaron80v · Jan 11, 2007

Hi,

Thanks. All I am trying to do is to read the content of the 4th
delimiter value and remove \n from it. I don't see why it should break
for non-English.

while (<STUFF>) {

next if /^(\s)*$/;

@str1 = split(/,/);
if ($str1[3] =~ /\n/) {
$i++;
$_=~ s/\n/ /eg;
#$_=~ s/\s+/ /g;
}
foreach $name (@str1) {
chomp($name);
}
print OUT "$_";
}

What posting guide? Isn't most groups have about the same posting
guide?

Aaron

Paul Lalli · Jan 11, 2007

[email protected] said:
Thanks. All I am trying to do is to read the content of the 4th
delimiter value and remove \n from it. I don't see why it should break
for non-English.

You have still not said HOW it breaks. What does "break" even mean?
Does your program crash? Inifinite Loop? Incorrect output? No
output? WHAT HAPPENS?

This is now the second time I've asked this question. I should not
have to ask it at all. I will not ask it again.

while (<STUFF>) {

next if /^(\s)*$/;

What do you think the parentheses are doing in that statement?

@str1 = split(/,/);
if ($str1[3] =~ /\n/) {

You have a severe logic problem. You're reading a file line-by-line,
but are searching one of the internal fields for a newline. This can't
happen, unless there actually are only four fields in the file. And if
there are, you really just need to chomp() the line before hand.

$i++;
$_=~ s/\n/ /eg;

What do you think the e is doing in that statement?

#$_=~ s/\s+/ /g;
}
foreach $name (@str1) {
chomp($name);
}

Again. Logic problem. Only the very last field can POSSIBLY have a
newline character, so it makes no sense of any kind to chomp each one.

print OUT "$_";

What do you think the quotes are doing in that statement? Please read:
perldoc -q quoting

}

What posting guide?

Like I said, the Posting Guidelines that are posted here twice a week.
They have the words "Posting Guidelines" in their subject. They are
not difficult to find.

Isn't most groups have about the same posting guide?

If you had read the Posting Guidelines for this group, you would have
been able to avoid SEVERAL things you've done in this posting that has
made people decide to skip over your post, and likely kill file you.
Those things include:
not use strict and warnings
using inconsident indentation
not posting sample input
not posting desired output
not posting actual output
not quoting the material you're replying to.
not posting a short-but-COMPLETE script

The posting guidelines are there to give you these tips, so that you
get the best chances of someone who knows what your problem might be
actually reading and responding to your post. Please do not reply
again until you read them.

Paul Lalli

aaron80v · Jan 13, 2007

Thanks Paul.

I will try to compliant to guide as much as possible. Perhaps it would
be best to explain what I am trying to accomplish.

1. Multiple Excel files with different fields which I need to clean and
keep them delimited (^) before importing to a database.
2. Any fields can have \n and can have it more than once.
3. The job is to remove all \n except the actual \n at the end of the
last field.
4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

Eg input file:

AAA^ BBB^ CCC^ DDDaa

DDDbb
DDDcc

DDDdd DDDee
DDDff

DDDgg^EEE^FFF^??????^GGG^HHH

Eg output file: (one line without \n except the one after HHH)

AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff
DDDgg^EEE^FFF^Chinese Characters^GGG^HHH

Here is the code which isn't sufficient for what I am trying to
accomplish. I will worry about the language part later. Right now, I
have problem differentiating the last \n from any \n that occur before
it.

use strict;
use warnings;

my $stuff = "d:\\PerlWork\\myfile.txt";
open STUFF, $stuff or die "Cannot open file $stuff for read :$!";

my $out = "d:\\PerlWork\\FileMani.txt";
open OUT, ">$out" or die "Cannot open file $out for write :$!";

while (<STUFF>) {

# skip reading the blank lines
next if /^(\s)*$/;

# tokenize it with the delimited ^ included.
my @str1 = split(/(\^)/);

# remove any \n that may appear anywhere
foreach my $name (@str1) {
chomp($name);
print OUT $name;
}
}
close (STUFF);
close (OUT);

Aaron

Paul Lalli · Jan 13, 2007

[email protected] said:
Thanks Paul.

I will try to compliant to guide as much as possible.

You've already failed that, as you've already *again* refused to quote
the post you're replying to. I wish you the best of luck with your
program. Good bye.

Paul Lalli

Jürgen Exner · Jan 13, 2007

4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

That is impossible. Simpler example that I can actually type:
It is like asking if the character "ö" is a German or a Swedish character.
The answer is yes --- to both of the them.

jue

aaron80v · Jan 13, 2007

Thanks Paul again.

So Jue, thanks for pointing it out. I guess there is just no way to
figure the language out.

Aaron.

Dr.Ruud · Jan 13, 2007

Jürgen Exner schreef:

(e-mail address removed) wrote:

That is impossible. Simpler example that I can actually type:
It is like asking if the character "ö" is a German or a Swedish
character. The answer is yes --- to both of the them.

English even: coöperation, noöne (with a diaeresis, not an umlaut)
http://en.wikipedia.org/wiki/Diaeresis_(diacritic)

Joe Smith · Jan 14, 2007

1. Multiple Excel files with different fields which I need to clean and
keep them delimited (^) before importing to a database.

If your data is delimiter by '^', then you should tell perl to use '^'
as the input record separator.

2. Any fields can have \n and can have it more than once.
3. The job is to remove all \n except the actual \n at the end of the
last field.

You could eliminate them all, then add back the one that should be there.

4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

Here's an example on how to reject (or to mark) characters that are
not alphanumunderscore, not blanks, not '^'.

Cygwin% cat test.pl
#!/usr/bin/perl
use strict; use warnings;

$/ = '^'; # Use caret as record terminator on input
while (<DATA>) {
s/\s+/ /gs; # Convert newline and other spacing to single space
s/([^\w\s^])/sprintf "(%02x)",ord $1/eg; # Mark unexpected characters
print;
}
print "\n";

__DATA__
AAA^ BBB^ CCC^ DDDaa

DDDbb
DDDcc

DDDdd DDDee
DDDff

DDDgg^EEE^FFF^??????^GGG^HHH
Cygwin% perl test.pl
AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff DDDgg^EEE^FFF^(3f)(3f)(3f)(3f)(3f)(3f)^GGG^HHH
Cygwin%

-Joe

aaron80v · Jan 16, 2007

Michele said:
You didn't try hard, did you? You missed the very first step, i.e. you
failed to properly quote the post you're replying to...

Michele

Hi Michele,

Those things include:
1. not use strict and warnings
2. using inconsident indentation
3. not posting sample input
4. not posting desired output
5. not posting actual output
6. not quoting the material you're replying to.
7. not posting a short-but-COMPLETE script

Sure.. Got your point...

aaron80v · Jan 16, 2007

Joe said:
1. Multiple Excel files with different fields which I need to clean and
keep them delimited (^) before importing to a database.

Click to expand...

If your data is delimiter by '^', then you should tell perl to use '^'
as the input record separator.

2. Any fields can have \n and can have it more than once.
3. The job is to remove all \n except the actual \n at the end of the
last field.

Click to expand...

You could eliminate them all, then add back the one that should be there.

4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

Click to expand...

Here's an example on how to reject (or to mark) characters that are
not alphanumunderscore, not blanks, not '^'.

Cygwin% cat test.pl
#!/usr/bin/perl
use strict; use warnings;

$/ = '^'; # Use caret as record terminator on input
while (<DATA>) {
s/\s+/ /gs; # Convert newline and other spacing to single space
s/([^\w\s^])/sprintf "(%02x)",ord $1/eg; # Mark unexpected characters
print;
}
print "\n";

__DATA__
AAA^ BBB^ CCC^ DDDaa

DDDbb
DDDcc

DDDdd DDDee
DDDff

DDDgg^EEE^FFF^??????^GGG^HHH
Cygwin% perl test.pl
AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff DDDgg^EEE^FFF^(3f)(3f)(3f)(3f)(3f)(3f)^GGG^HHH
Cygwin%

-Joe

Thanks Joe, your code is good but it doesn't differentiate the \n at
the end of the record (in this case at the end of HHH) and therefore
removes it.

It's good for me to draw it out...

Col1 || Col 2 || Col 3 || Col 4
=====================================================
111^ AAA BBB CCC\n 333^ ZZZ\n (end of record)
DDD\n
EEE FFF GGG^

The intention is to remove only \n after CCC and DDD.

Aaron

Qt4 : disappearing non-English characters	0	Dec 9, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
[ANN] Ruby Hacking Guide - New chapters (and a bonus)	2	Apr 5, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Aug 1, 2007

Skip non english character values

aaron80v

Jürgen Exner

Paul Lalli

aaron80v

Paul Lalli

aaron80v

Paul Lalli

Jürgen Exner

aaron80v

Dr.Ruud

Joe Smith

aaron80v

aaron80v

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads