Skip non english character values

A

aaron80v

Hi,

Occasionally the excel file I am dealing with contains non english
characters in certain fields (delimited by comma) such as Chinese,
Japanese and Korean. How do I check and skip those so that my perl
script won't break?

eg.

ABC, ??????, CDF

Right after processing ABC, I would want to jump to CDF.

Aaron
 
J

Jürgen Exner

Occasionally the excel file I am dealing with contains non english
characters in certain fields (delimited by comma) such as Chinese,
Japanese and Korean. How do I check and skip those so that my perl
script won't break?

Perl is fully Unicode-capable and can handle non-English characters just
fine. If you script can handle them is a different question, of course.

I would use tr/// with the proper options (complement of English characters;
delete) to transliterate the unwanted characters into oblivion.

Having said that I think the whole idea is nuts and at least I would be
pretty upset if you would bastardize my name.

jue
 
P

Paul Lalli

Occasionally the excel file I am dealing with contains non english
characters in certain fields (delimited by comma) such as Chinese,
Japanese and Korean. How do I check and skip those so that my perl
script won't break?

If your perl script "breaks" when encountering non-English characters,
your script is broken and should be fixed. How exactly does it
"break"? Please post a short-but-complete script that demonstrates
what you're doing wrong.

How you "skip" over the non-english characters depends entirely on how
you are processing the data. Line by line, character by character,
field by field, other? Again, please post a short-but-complete script
that demonstrates what you're doing.

Have you read the Posting Guidelines that are posted here twice a week?

Paul Lalli
 
A

aaron80v

Hi,

Thanks. All I am trying to do is to read the content of the 4th
delimiter value and remove \n from it. I don't see why it should break
for non-English.

while (<STUFF>) {

next if /^(\s)*$/;

@str1 = split(/,/);
if ($str1[3] =~ /\n/) {
$i++;
$_=~ s/\n/ /eg;
#$_=~ s/\s+/ /g;
}
foreach $name (@str1) {
chomp($name);
}
print OUT "$_";
}

What posting guide? Isn't most groups have about the same posting
guide?

Aaron
 
P

Paul Lalli

Thanks. All I am trying to do is to read the content of the 4th
delimiter value and remove \n from it. I don't see why it should break
for non-English.

You have still not said HOW it breaks. What does "break" even mean?
Does your program crash? Inifinite Loop? Incorrect output? No
output? WHAT HAPPENS?

This is now the second time I've asked this question. I should not
have to ask it at all. I will not ask it again.
while (<STUFF>) {

next if /^(\s)*$/;

What do you think the parentheses are doing in that statement?
@str1 = split(/,/);
if ($str1[3] =~ /\n/) {

You have a severe logic problem. You're reading a file line-by-line,
but are searching one of the internal fields for a newline. This can't
happen, unless there actually are only four fields in the file. And if
there are, you really just need to chomp() the line before hand.
$i++;
$_=~ s/\n/ /eg;

What do you think the e is doing in that statement?
#$_=~ s/\s+/ /g;
}
foreach $name (@str1) {
chomp($name);
}

Again. Logic problem. Only the very last field can POSSIBLY have a
newline character, so it makes no sense of any kind to chomp each one.
print OUT "$_";

What do you think the quotes are doing in that statement? Please read:
perldoc -q quoting
}

What posting guide?

Like I said, the Posting Guidelines that are posted here twice a week.
They have the words "Posting Guidelines" in their subject. They are
not difficult to find.
Isn't most groups have about the same posting guide?

If you had read the Posting Guidelines for this group, you would have
been able to avoid SEVERAL things you've done in this posting that has
made people decide to skip over your post, and likely kill file you.
Those things include:
not use strict and warnings
using inconsident indentation
not posting sample input
not posting desired output
not posting actual output
not quoting the material you're replying to.
not posting a short-but-COMPLETE script

The posting guidelines are there to give you these tips, so that you
get the best chances of someone who knows what your problem might be
actually reading and responding to your post. Please do not reply
again until you read them.

Paul Lalli
 
A

aaron80v

Thanks Paul.

I will try to compliant to guide as much as possible. Perhaps it would
be best to explain what I am trying to accomplish.

1. Multiple Excel files with different fields which I need to clean and
keep them delimited (^) before importing to a database.
2. Any fields can have \n and can have it more than once.
3. The job is to remove all \n except the actual \n at the end of the
last field.
4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

Eg input file:

AAA^ BBB^ CCC^ DDDaa

DDDbb
DDDcc

DDDdd DDDee
DDDff

DDDgg^EEE^FFF^??????^GGG^HHH




Eg output file: (one line without \n except the one after HHH)

AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff
DDDgg^EEE^FFF^Chinese Characters^GGG^HHH


Here is the code which isn't sufficient for what I am trying to
accomplish. I will worry about the language part later. Right now, I
have problem differentiating the last \n from any \n that occur before
it.

use strict;
use warnings;

my $stuff = "d:\\PerlWork\\myfile.txt";
open STUFF, $stuff or die "Cannot open file $stuff for read :$!";

my $out = "d:\\PerlWork\\FileMani.txt";
open OUT, ">$out" or die "Cannot open file $out for write :$!";

while (<STUFF>) {

# skip reading the blank lines
next if /^(\s)*$/;

# tokenize it with the delimited ^ included.
my @str1 = split(/(\^)/);

# remove any \n that may appear anywhere
foreach my $name (@str1) {
chomp($name);
print OUT $name;
}
}
close (STUFF);
close (OUT);

Aaron
 
P

Paul Lalli

Thanks Paul.

I will try to compliant to guide as much as possible.

You've already failed that, as you've already *again* refused to quote
the post you're replying to. I wish you the best of luck with your
program. Good bye.

Paul Lalli
 
J

Jürgen Exner

4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

That is impossible. Simpler example that I can actually type:
It is like asking if the character "ö" is a German or a Swedish character.
The answer is yes --- to both of the them.

jue
 
A

aaron80v

Thanks Paul again.

So Jue, thanks for pointing it out. I guess there is just no way to
figure the language out.

Aaron.
 
J

Joe Smith

1. Multiple Excel files with different fields which I need to clean and
keep them delimited (^) before importing to a database.

If your data is delimiter by '^', then you should tell perl to use '^'
as the input record separator.
2. Any fields can have \n and can have it more than once.
3. The job is to remove all \n except the actual \n at the end of the
last field.

You could eliminate them all, then add back the one that should be there.
4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

Here's an example on how to reject (or to mark) characters that are
not alphanumunderscore, not blanks, not '^'.

Cygwin% cat test.pl
#!/usr/bin/perl
use strict; use warnings;

$/ = '^'; # Use caret as record terminator on input
while (<DATA>) {
s/\s+/ /gs; # Convert newline and other spacing to single space
s/([^\w\s^])/sprintf "(%02x)",ord $1/eg; # Mark unexpected characters
print;
}
print "\n";

__DATA__
AAA^ BBB^ CCC^ DDDaa

DDDbb
DDDcc

DDDdd DDDee
DDDff

DDDgg^EEE^FFF^??????^GGG^HHH
Cygwin% perl test.pl
AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff DDDgg^EEE^FFF^(3f)(3f)(3f)(3f)(3f)(3f)^GGG^HHH
Cygwin%


-Joe
 
A

aaron80v

Michele said:
You didn't try hard, did you? You missed the very first step, i.e. you
failed to properly quote the post you're replying to...


Michele


Hi Michele,

Those things include:
1. not use strict and warnings
2. using inconsident indentation
3. not posting sample input
4. not posting desired output
5. not posting actual output
6. not quoting the material you're replying to.
7. not posting a short-but-COMPLETE script

Sure.. Got your point...
 
A

aaron80v

Joe said:
1. Multiple Excel files with different fields which I need to clean and
keep them delimited (^) before importing to a database.

If your data is delimiter by '^', then you should tell perl to use '^'
as the input record separator.
2. Any fields can have \n and can have it more than once.
3. The job is to remove all \n except the actual \n at the end of the
last field.

You could eliminate them all, then add back the one that should be there.
4. If encounter other non English characters such as Jap, Korean,
Chinese, report the line where they occur before replacing them with
phrases such as "Japanese Characters", "Korean Characters", "Chinese
Characters" etc.

Here's an example on how to reject (or to mark) characters that are
not alphanumunderscore, not blanks, not '^'.

Cygwin% cat test.pl
#!/usr/bin/perl
use strict; use warnings;

$/ = '^'; # Use caret as record terminator on input
while (<DATA>) {
s/\s+/ /gs; # Convert newline and other spacing to single space
s/([^\w\s^])/sprintf "(%02x)",ord $1/eg; # Mark unexpected characters
print;
}
print "\n";

__DATA__
AAA^ BBB^ CCC^ DDDaa

DDDbb
DDDcc

DDDdd DDDee
DDDff

DDDgg^EEE^FFF^??????^GGG^HHH
Cygwin% perl test.pl
AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff DDDgg^EEE^FFF^(3f)(3f)(3f)(3f)(3f)(3f)^GGG^HHH
Cygwin%


-Joe

Thanks Joe, your code is good but it doesn't differentiate the \n at
the end of the record (in this case at the end of HHH) and therefore
removes it.

It's good for me to draw it out...

Col1 || Col 2 || Col 3 || Col 4
=====================================================
111^ AAA BBB CCC\n 333^ ZZZ\n (end of record)
DDD\n
EEE FFF GGG^

The intention is to remove only \n after CCC and DDD.

Aaron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top