Internalisation support and dictionaries

Discussion in 'Perl Misc' started by Broke, Apr 3, 2007.

  1. Broke

    Broke Guest

    Hello,

    I am a beginner so please be indulgent.
    I wanted to make a sort of dictionary given a text french file.
    So I wrote the following script.
    Everything is OK but the ordered list comes in US ASCII encoding.
    How to make it work for accented letters?
    Any help will be appreciated.
    Here is my humble script.
    =======
    #!/usr/bin/perl -w
    use warnings;
    local $/;
    use locale;
    use utf8;
    $file = '/Users/Broke/Desktop/data.txt';
    open (IN, $file) or die "$file not found\n : $!\n";
    @data = ();
    %seen = ();
    while (<IN>) {
    foreach $word (m/(\b.+?\b)/gi) {
    unless ($seen{$word}) {
    $seen{$word} = 1;
    push(@data, $word);
    }
    }
    }
    close (IN) or die "Can't close $file : $!\n";
    @data = sort(@data);
    @data = map $_ . "\n", @data;
    open (OUT, ">/Users/Broke/Desktop/out.txt") or die "Can't create\n :
    $!\n";
    select (OUT);
    print @data;
    close (OUT);
    ========
    B.
    Broke, Apr 3, 2007
    #1
    1. Advertising

  2. Broke

    -berlin.de Guest

    Broke <> wrote in comp.lang.perl.misc:
    > Hello,
    >
    > I am a beginner so please be indulgent.
    > I wanted to make a sort of dictionary given a text french file.
    > So I wrote the following script.
    > Everything is OK but the ordered list comes in US ASCII encoding.
    > How to make it work for accented letters?


    Use the locale pragma. See perldoc perllocale for a general description
    and perldoc locale for specifics.

    Anno
    -berlin.de, Apr 3, 2007
    #2
    1. Advertising

  3. Broke

    Broke Guest

    Michele Dondi <> wrote:

    Many thanks to you Michele for your help.
    Thank you also for pointing out that the
    dot will capture also space. That's true.
    it's better written with the \w+ or
    the [[:alnum:]]+
    instead of the dot.
    Thank you also for the other hints.

    Please don't forget that my problem is that
    I want to extract french words with
    diacritics and that I get only
    words without diacritics amongs the other
    possible words that would like to extract.

    As Anno points out this is the problem
    of the locale pragma.

    I will reinstall the operating system.
    It seems that I forgot that I choosed
    US languge as my defaut language when
    installing the operating system.
    Very fortunately with "Apple" I am not
    forced to reformat.

    Many thanks again and have a nice day!
    --
    B.


    > On Tue, 3 Apr 2007 09:16:21 +0200, (Broke) wrote:
    >
    > >#!/usr/bin/perl -w
    > >use warnings;

    >
    > either
    >
    > -w
    >
    > or
    >
    > use warnings; # and the latter is better!
    >
    > and
    >
    > use strict; # as well!
    >
    > >local $/;
    > >use locale;
    > >use utf8;
    > >$file = '/Users/Broke/Desktop/data.txt';

    >
    > my $file = ... # the same for all other variables.
    >
    > >open (IN, $file) or die "$file not found\n : $!\n";

    >
    > open my $in, '<', $file or die "$file not found\n : $!\n";
    >
    > >@data = ();
    > >%seen = ();

    >
    > my (@data, %seen);
    >
    > (If not under strict.pm, you don't need that at all.)
    >
    > >while (<IN>) {
    > >foreach $word (m/(\b.+?\b)/gi) {

    >
    > Are you aware that this will capture whitespace too?
    >
    > >unless ($seen{$word}) {
    > >$seen{$word} = 1;
    > > push(@data, $word);

    >
    > push @data, $word unless $seen{$word}++;
    >
    > Or even
    >
    > !$seen{$_}++ and push @data, $_ for /\b.+?\b/g;
    >
    > >}
    > >}
    > >}
    > >close (IN) or die "Can't close $file : $!\n";
    > >@data = sort(@data);
    > >@data = map $_ . "\n", @data;

    >
    > You can use $, and $\;
    >
    >
    > Michele
    Broke, Apr 3, 2007
    #3
  4. Broke

    Broke Guest

    Many thanks to you Anno.
    You said the truth.
    I will investigate this problem.
    Thanks again!
    -
    B.
    <-berlin.de> wrote:

    > Broke <> wrote in comp.lang.perl.misc:
    > > Hello,
    > >
    > > I am a beginner so please be indulgent.
    > > I wanted to make a sort of dictionary given a text french file.
    > > So I wrote the following script.
    > > Everything is OK but the ordered list comes in US ASCII encoding.
    > > How to make it work for accented letters?

    >
    > Use the locale pragma. See perldoc perllocale for a general description
    > and perldoc locale for specifics.
    >
    > Anno
    Broke, Apr 3, 2007
    #4
  5. Broke

    Mumia W. Guest

    On 04/03/2007 02:16 AM, Broke wrote:
    > Hello,
    >
    > I am a beginner so please be indulgent.
    > I wanted to make a sort of dictionary given a text french file.
    > So I wrote the following script.
    > Everything is OK but the ordered list comes in US ASCII encoding.
    > How to make it work for accented letters?
    > Any help will be appreciated.
    > Here is my humble script.
    > =======
    > #!/usr/bin/perl -w
    > use warnings;
    > local $/;
    > use locale;
    > use utf8;
    > $file = '/Users/Broke/Desktop/data.txt';
    > open (IN, $file) or die "$file not found\n : $!\n";
    > [...]


    You can set an encoding for the 'open' command:

    open (IN, '<:utf8', $file) or die (...

    Read about the 'open' command and Perl:

    Start->Run->"perldoc -f open"
    Start->Run->"perldoc perl"
    Mumia W., Apr 3, 2007
    #5
  6. Broke

    Broke Guest

    Mumia W. <> wrote:

    Hello Mumia,

    That's a wonderful idea.
    I will do it.

    Thanks to you !
    B.
    > You can set an encoding for the 'open' command:
    >
    > open (IN, '<:utf8', $file) or die (...
    >
    > Read about the 'open' command and Perl:
    >
    > Start->Run->"perldoc -f open"
    > Start->Run->"perldoc perl"
    Broke, Apr 3, 2007
    #6
  7. Broke

    Mumia W. Guest

    On 04/03/2007 05:48 PM, Broke wrote:
    > Mumia W. <> wrote:
    >> You can set an encoding for the 'open' command:
    >>
    >> open (IN, '<:utf8', $file) or die (...
    >>

    >
    > That's a wonderful idea.
    > I will do it.
    >
    > Thanks to you !
    > B.


    Sure, you're very welcome.
    Mumia W., Apr 4, 2007
    #7
  8. Broke

    Broke Guest

    Mumia W. <> wrote:
    SUPER !!!!
    It is exactly this that I needed !!
    My friend I am extremely glad !
    It works !!!
    All my problems are solved thanks to YOU !!!

    :^-)

    Many Many Many Many Many Many thanks to you !!
    --
    B.
    > On 04/03/2007 02:16 AM, Broke wrote:
    > > Hello,
    > >
    > > I am a beginner so please be indulgent.
    > > I wanted to make a sort of dictionary given a text french file.
    > > So I wrote the following script.
    > > Everything is OK but the ordered list comes in US ASCII encoding.
    > > How to make it work for accented letters?
    > > Any help will be appreciated.
    > > Here is my humble script.
    > > =======
    > > #!/usr/bin/perl -w
    > > use warnings;
    > > local $/;
    > > use locale;
    > > use utf8;
    > > $file = '/Users/Broke/Desktop/data.txt';
    > > open (IN, $file) or die "$file not found\n : $!\n";
    > > [...]

    >
    > You can set an encoding for the 'open' command:
    >
    > open (IN, '<:utf8', $file) or die (...
    >
    > Read about the 'open' command and Perl:
    >
    > Start->Run->"perldoc -f open"
    > Start->Run->"perldoc perl"
    Broke, Apr 8, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nick Zdunic
    Replies:
    0
    Views:
    904
    Nick Zdunic
    Nov 5, 2003
  2. Steven Knight
    Replies:
    0
    Views:
    1,168
    Steven Knight
    Aug 18, 2004
  3. lysdexia
    Replies:
    6
    Views:
    479
    John Machin
    Dec 2, 2007
  4. Brandon
    Replies:
    12
    Views:
    477
    Brandon
    Aug 15, 2008
  5. Rouslan Korneychuk
    Replies:
    8
    Views:
    588
    Rouslan Korneychuk
    Feb 10, 2011
Loading...

Share This Page