Skip non english character values

Discussion in 'Perl Misc' started by aaron80v@yahoo.com.au, Jan 11, 2007.

  1. Guest

    Hi,

    Occasionally the excel file I am dealing with contains non english
    characters in certain fields (delimited by comma) such as Chinese,
    Japanese and Korean. How do I check and skip those so that my perl
    script won't break?

    eg.

    ABC, ??????, CDF

    Right after processing ABC, I would want to jump to CDF.

    Aaron
     
    , Jan 11, 2007
    #1
    1. Advertising

  2. wrote:
    > Occasionally the excel file I am dealing with contains non english
    > characters in certain fields (delimited by comma) such as Chinese,
    > Japanese and Korean. How do I check and skip those so that my perl
    > script won't break?


    Perl is fully Unicode-capable and can handle non-English characters just
    fine. If you script can handle them is a different question, of course.

    I would use tr/// with the proper options (complement of English characters;
    delete) to transliterate the unwanted characters into oblivion.

    Having said that I think the whole idea is nuts and at least I would be
    pretty upset if you would bastardize my name.

    jue
     
    Jürgen Exner, Jan 11, 2007
    #2
    1. Advertising

  3. Paul Lalli Guest

    wrote:
    > Occasionally the excel file I am dealing with contains non english
    > characters in certain fields (delimited by comma) such as Chinese,
    > Japanese and Korean. How do I check and skip those so that my perl
    > script won't break?


    If your perl script "breaks" when encountering non-English characters,
    your script is broken and should be fixed. How exactly does it
    "break"? Please post a short-but-complete script that demonstrates
    what you're doing wrong.

    How you "skip" over the non-english characters depends entirely on how
    you are processing the data. Line by line, character by character,
    field by field, other? Again, please post a short-but-complete script
    that demonstrates what you're doing.

    Have you read the Posting Guidelines that are posted here twice a week?

    Paul Lalli
     
    Paul Lalli, Jan 11, 2007
    #3
  4. Guest

    Hi,

    Thanks. All I am trying to do is to read the content of the 4th
    delimiter value and remove \n from it. I don't see why it should break
    for non-English.

    while (<STUFF>) {

    next if /^(\s)*$/;

    @str1 = split(/,/);
    if ($str1[3] =~ /\n/) {
    $i++;
    $_=~ s/\n/ /eg;
    #$_=~ s/\s+/ /g;
    }
    foreach $name (@str1) {
    chomp($name);
    }
    print OUT "$_";
    }

    What posting guide? Isn't most groups have about the same posting
    guide?

    Aaron
     
    , Jan 11, 2007
    #4
  5. Paul Lalli Guest

    wrote:
    > Thanks. All I am trying to do is to read the content of the 4th
    > delimiter value and remove \n from it. I don't see why it should break
    > for non-English.


    You have still not said HOW it breaks. What does "break" even mean?
    Does your program crash? Inifinite Loop? Incorrect output? No
    output? WHAT HAPPENS?

    This is now the second time I've asked this question. I should not
    have to ask it at all. I will not ask it again.

    > while (<STUFF>) {
    >
    > next if /^(\s)*$/;


    What do you think the parentheses are doing in that statement?

    > @str1 = split(/,/);
    > if ($str1[3] =~ /\n/) {


    You have a severe logic problem. You're reading a file line-by-line,
    but are searching one of the internal fields for a newline. This can't
    happen, unless there actually are only four fields in the file. And if
    there are, you really just need to chomp() the line before hand.

    > $i++;
    > $_=~ s/\n/ /eg;


    What do you think the e is doing in that statement?

    > #$_=~ s/\s+/ /g;
    > }
    > foreach $name (@str1) {
    > chomp($name);
    > }


    Again. Logic problem. Only the very last field can POSSIBLY have a
    newline character, so it makes no sense of any kind to chomp each one.

    > print OUT "$_";


    What do you think the quotes are doing in that statement? Please read:
    perldoc -q quoting

    > }
    >
    > What posting guide?


    Like I said, the Posting Guidelines that are posted here twice a week.
    They have the words "Posting Guidelines" in their subject. They are
    not difficult to find.

    > Isn't most groups have about the same posting guide?


    If you had read the Posting Guidelines for this group, you would have
    been able to avoid SEVERAL things you've done in this posting that has
    made people decide to skip over your post, and likely kill file you.
    Those things include:
    not use strict and warnings
    using inconsident indentation
    not posting sample input
    not posting desired output
    not posting actual output
    not quoting the material you're replying to.
    not posting a short-but-COMPLETE script

    The posting guidelines are there to give you these tips, so that you
    get the best chances of someone who knows what your problem might be
    actually reading and responding to your post. Please do not reply
    again until you read them.

    Paul Lalli
     
    Paul Lalli, Jan 11, 2007
    #5
  6. Guest

    Thanks Paul.

    I will try to compliant to guide as much as possible. Perhaps it would
    be best to explain what I am trying to accomplish.

    1. Multiple Excel files with different fields which I need to clean and
    keep them delimited (^) before importing to a database.
    2. Any fields can have \n and can have it more than once.
    3. The job is to remove all \n except the actual \n at the end of the
    last field.
    4. If encounter other non English characters such as Jap, Korean,
    Chinese, report the line where they occur before replacing them with
    phrases such as "Japanese Characters", "Korean Characters", "Chinese
    Characters" etc.

    Eg input file:

    AAA^ BBB^ CCC^ DDDaa

    DDDbb
    DDDcc

    DDDdd DDDee
    DDDff

    DDDgg^EEE^FFF^??????^GGG^HHH




    Eg output file: (one line without \n except the one after HHH)

    AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff
    DDDgg^EEE^FFF^Chinese Characters^GGG^HHH


    Here is the code which isn't sufficient for what I am trying to
    accomplish. I will worry about the language part later. Right now, I
    have problem differentiating the last \n from any \n that occur before
    it.

    use strict;
    use warnings;

    my $stuff = "d:\\PerlWork\\myfile.txt";
    open STUFF, $stuff or die "Cannot open file $stuff for read :$!";

    my $out = "d:\\PerlWork\\FileMani.txt";
    open OUT, ">$out" or die "Cannot open file $out for write :$!";

    while (<STUFF>) {

    # skip reading the blank lines
    next if /^(\s)*$/;

    # tokenize it with the delimited ^ included.
    my @str1 = split(/(\^)/);

    # remove any \n that may appear anywhere
    foreach my $name (@str1) {
    chomp($name);
    print OUT $name;
    }
    }
    close (STUFF);
    close (OUT);

    Aaron
     
    , Jan 13, 2007
    #6
  7. Paul Lalli Guest

    wrote:
    > Thanks Paul.
    >
    > I will try to compliant to guide as much as possible.


    You've already failed that, as you've already *again* refused to quote
    the post you're replying to. I wish you the best of luck with your
    program. Good bye.

    Paul Lalli
     
    Paul Lalli, Jan 13, 2007
    #7
  8. wrote:
    > 4. If encounter other non English characters such as Jap, Korean,
    > Chinese, report the line where they occur before replacing them with
    > phrases such as "Japanese Characters", "Korean Characters", "Chinese
    > Characters" etc.


    That is impossible. Simpler example that I can actually type:
    It is like asking if the character "ö" is a German or a Swedish character.
    The answer is yes --- to both of the them.

    jue
     
    Jürgen Exner, Jan 13, 2007
    #8
  9. Guest

    Thanks Paul again.

    So Jue, thanks for pointing it out. I guess there is just no way to
    figure the language out.

    Aaron.
     
    , Jan 13, 2007
    #9
  10. Dr.Ruud Guest

    Jürgen Exner schreef:
    > wrote:


    >> 4. If encounter other non English characters such as Jap, Korean,
    >> Chinese, report the line where they occur before replacing them with
    >> phrases such as "Japanese Characters", "Korean Characters", "Chinese
    >> Characters" etc.

    >
    > That is impossible. Simpler example that I can actually type:
    > It is like asking if the character "ö" is a German or a Swedish
    > character. The answer is yes --- to both of the them.


    English even: coöperation, noöne (with a diaeresis, not an umlaut)
    http://en.wikipedia.org/wiki/Diaeresis_(diacritic)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jan 13, 2007
    #10
  11. Joe Smith Guest

    wrote:

    > 1. Multiple Excel files with different fields which I need to clean and
    > keep them delimited (^) before importing to a database.


    If your data is delimiter by '^', then you should tell perl to use '^'
    as the input record separator.

    > 2. Any fields can have \n and can have it more than once.
    > 3. The job is to remove all \n except the actual \n at the end of the
    > last field.


    You could eliminate them all, then add back the one that should be there.

    > 4. If encounter other non English characters such as Jap, Korean,
    > Chinese, report the line where they occur before replacing them with
    > phrases such as "Japanese Characters", "Korean Characters", "Chinese
    > Characters" etc.


    Here's an example on how to reject (or to mark) characters that are
    not alphanumunderscore, not blanks, not '^'.

    Cygwin% cat test.pl
    #!/usr/bin/perl
    use strict; use warnings;

    $/ = '^'; # Use caret as record terminator on input
    while (<DATA>) {
    s/\s+/ /gs; # Convert newline and other spacing to single space
    s/([^\w\s^])/sprintf "(%02x)",ord $1/eg; # Mark unexpected characters
    print;
    }
    print "\n";

    __DATA__
    AAA^ BBB^ CCC^ DDDaa

    DDDbb
    DDDcc

    DDDdd DDDee
    DDDff

    DDDgg^EEE^FFF^??????^GGG^HHH
    Cygwin% perl test.pl
    AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff DDDgg^EEE^FFF^(3f)(3f)(3f)(3f)(3f)(3f)^GGG^HHH
    Cygwin%


    -Joe
     
    Joe Smith, Jan 14, 2007
    #11
  12. Guest

    Michele Dondi wrote:
    > On 13 Jan 2007 00:48:40 -0800, wrote:
    >
    > >I will try to compliant to guide as much as possible. Perhaps it would

    >
    > You didn't try hard, did you? You missed the very first step, i.e. you
    > failed to properly quote the post you're replying to...
    >
    >
    > Michele



    Hi Michele,

    Those things include:
    1. not use strict and warnings
    2. using inconsident indentation
    3. not posting sample input
    4. not posting desired output
    5. not posting actual output
    6. not quoting the material you're replying to.
    7. not posting a short-but-COMPLETE script

    Sure.. Got your point...
     
    , Jan 16, 2007
    #12
  13. Guest

    Joe Smith wrote:
    > wrote:
    >
    > > 1. Multiple Excel files with different fields which I need to clean and
    > > keep them delimited (^) before importing to a database.

    >
    > If your data is delimiter by '^', then you should tell perl to use '^'
    > as the input record separator.
    >
    > > 2. Any fields can have \n and can have it more than once.
    > > 3. The job is to remove all \n except the actual \n at the end of the
    > > last field.

    >
    > You could eliminate them all, then add back the one that should be there.
    >
    > > 4. If encounter other non English characters such as Jap, Korean,
    > > Chinese, report the line where they occur before replacing them with
    > > phrases such as "Japanese Characters", "Korean Characters", "Chinese
    > > Characters" etc.

    >
    > Here's an example on how to reject (or to mark) characters that are
    > not alphanumunderscore, not blanks, not '^'.
    >
    > Cygwin% cat test.pl
    > #!/usr/bin/perl
    > use strict; use warnings;
    >
    > $/ = '^'; # Use caret as record terminator on input
    > while (<DATA>) {
    > s/\s+/ /gs; # Convert newline and other spacing to single space
    > s/([^\w\s^])/sprintf "(%02x)",ord $1/eg; # Mark unexpected characters
    > print;
    > }
    > print "\n";
    >
    > __DATA__
    > AAA^ BBB^ CCC^ DDDaa
    >
    > DDDbb
    > DDDcc
    >
    > DDDdd DDDee
    > DDDff
    >
    > DDDgg^EEE^FFF^??????^GGG^HHH
    > Cygwin% perl test.pl
    > AAA^ BBB^ CCC^ DDDaa DDDbb DDDcc DDDdd DDDee DDDff DDDgg^EEE^FFF^(3f)(3f)(3f)(3f)(3f)(3f)^GGG^HHH
    > Cygwin%
    >
    >
    > -Joe


    Thanks Joe, your code is good but it doesn't differentiate the \n at
    the end of the record (in this case at the end of HHH) and therefore
    removes it.

    It's good for me to draw it out...

    Col1 || Col 2 || Col 3 || Col 4
    =====================================================
    111^ AAA BBB CCC\n 333^ ZZZ\n (end of record)
    DDD\n
    EEE FFF GGG^

    The intention is to remove only \n after CCC and DDD.

    Aaron
     
    , Jan 16, 2007
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike P

    non-english character button

    Mike P, Sep 21, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    383
    Joerg Jooss
    Sep 24, 2005
  2. =?Utf-8?B?UmFlZCBTYXdhbGhh?=

    English/English DLL

    =?Utf-8?B?UmFlZCBTYXdhbGhh?=, Oct 15, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    1,681
    =?Utf-8?B?UmFlZCBTYXdhbGhh?=
    Oct 16, 2005
  3. IchBin
    Replies:
    1
    Views:
    788
  4. FrancisC
    Replies:
    7
    Views:
    472
    Howard
    Oct 10, 2003
  5. Lad
    Replies:
    0
    Views:
    253
Loading...

Share This Page