best way to do this?

Discussion in 'Perl Misc' started by MJL, Jul 26, 2004.

  1. MJL

    MJL Guest

    I'm sure this is not the most efficient way to accomplish my goal of
    taking a file of text and converting it into a list of individual
    words and punctuation symbols. It works, but I am curious about how
    to do it differently. Thanks!

    #!/usr/bin/perl
    open INF, "./testfile1.txt";
    while (<INF>)
    {
    @words = split;
    push @list, @words;
    }

    foreach(@list)
    {
    /\S+\w+/;
    if ($& ne "") {push @list2, "$&\n";}
    if ($' ne "") {push @list2, "$'\n";}
    }


    open OUTF, ">./testfile2.txt";
    print OUTF @list2;
    close INF;
    close OUTF;
     
    MJL, Jul 26, 2004
    #1
    1. Advertising

  2. MJL wrote:
    > I'm sure this is not the most efficient way to accomplish my goal
    > of taking a file of text and converting it into a list of
    > individual words and punctuation symbols. It works, but I am
    > curious about how to do it differently. Thanks!
    >
    > #!/usr/bin/perl
    > open INF, "./testfile1.txt";
    > while (<INF>)
    > {
    > @words = split;
    > push @list, @words;
    > }
    >
    > foreach(@list)
    > {
    > /\S+\w+/;
    > if ($& ne "") {push @list2, "$&\n";}
    > if ($' ne "") {push @list2, "$'\n";}
    > }
    >
    >
    > open OUTF, ">./testfile2.txt";
    > print OUTF @list2;
    > close INF;
    > close OUTF;


    Well, I think this accomplishes the same thing, but without the @arrays:

    #!/usr/bin/perl
    use strict;
    use warnings;
    open INF, './testfile1.txt' or die $!;
    open OUTF, '> ./testfile2.txt' or die $!;
    while (<INF>) {
    while( /(\S+\w+)(\S+)?/g ) {
    print OUTF "$1\n";
    print OUTF "$2\n" if $2;
    }
    }
    close INF;
    close OUTF;
    __END__

    Another thing is whether it actually does what you want...

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jul 26, 2004
    #2
    1. Advertising

  3. MJL

    Anno Siegel Guest

    MJL <> wrote in comp.lang.perl.misc:
    > I'm sure this is not the most efficient way to accomplish my goal of
    > taking a file of text and converting it into a list of individual
    > words and punctuation symbols. It works, but I am curious about how
    > to do it differently. Thanks!
    >
    > #!/usr/bin/perl
    > open INF, "./testfile1.txt";
    > while (<INF>)
    > {
    > @words = split;
    > push @list, @words;
    > }
    >
    > foreach(@list)
    > {
    > /\S+\w+/;
    > if ($& ne "") {push @list2, "$&\n";}
    > if ($' ne "") {push @list2, "$'\n";}
    > }
    >
    >
    > open OUTF, ">./testfile2.txt";
    > print OUTF @list2;
    > close INF;
    > close OUTF;


    You can gain more out of the first split, if you split not only on
    white space, but word boundaries too. That way, the string neatly
    separates in consecutive pieces of word-characters and punctuation,
    with blanks removed.

    There is also no good reason to collect the parts first. You might
    as well separate them right in the loop. So:

    my ( @words, @punct);
    while ( <DATA> ) {
    for ( split /\s+|\b/ ) {
    if ( /\w/ ) {
    push @words, $_;
    } else {
    push @punct, $_;
    }
    }
    }

    or, in more compact form

    while ( <DATA> ) {
    push @{ /\w/ ? \ @words : \ @punct}, $_ for split /\s+|\b/;
    }

    Anno
     
    Anno Siegel, Jul 26, 2004
    #3
  4. MJL

    MJL Guest

    Thanks to all for great alternatives! I am having a great time
    running and dissecting all of these suggestions.

    I should clarify my goal: I want to write a program that takes a text
    file or a text string and turn it into an html file/string. Each
    individual word is to become a link to a definition of that word.
    Punctuation is to be excluded of course and each word is to be defined
    only once. I wrote a version that works as a cgi program. It still
    needs a little work. I appologize for any poor or innefficient use of
    the language. This is not a homework assignment or anything. I'm
    just playing around, trying to learn a little perl. Thanks again!

    #!/usr/bin/perl

    # process a string and turn it into a webpage with internal links to
    definitions...

    use CGI qw:)standard);

    $_ = param("mytext");
    @list = split;
    foreach(@list)
    {
    /\S+\w+/;
    if ($& ne "")
    {
    push @list2, "<a href=\"#defn_$&\">$&</a> \n";
    $ins =
    "<a name=defn_$&>definition of $&:</a>
    \n\n<p>\n\n\n</p>\n<hr>\n\n";
    $chk = 0;
    foreach(@list4)
    {
    if ($_ eq $ins) {$chk = 1;break;}
    }
    if ($chk == 0)
    {
    push @list4, $ins;
    }
    }
    if ($' ne "") {push @list2, "$'\n";}
    }

    print header(), start_html("definitions"), h1("Definitions");
    foreach(@list2) {print;}
    print h1("definitions");
    foreach(@list4) {print;}
     
    MJL, Jul 28, 2004
    #4
  5. Bernard El-Hagin <> wrote in
    message <news:Xns953484D707FF1elhber1lidotechnet@62.89.127.66>:

    > bowsayge <> wrote:
    >
    >> Abigail said to us:
    >>
    >> [ Splitting a file into words an symbols question ]


    [snip]

    >>> Or you could do:
    >>>
    >>> while (<INF>) {
    >>> s/\s+//g;

    >>
    >> The above line folds all consecutive words together.

    >
    >
    > Yes, now that Bowsayge removed the map() which prevented this.


    What map()?

    >> Change to: s/\s+/ /g;

    >
    >
    > No, don't. Just leave the correct answer Abigail gave alone.


    See below...

    >>> push @list => map {"$_\n"} split /(\w+)/;
    >>> }


    How is that correct? If I change INF to DATA to make it self-
    contained:

    <code>
    my @list;
    while (<DATA>) {
    s/\s+//g;
    push @list => map {"$_\n"} split /(\w+)/;
    }
    print @list;

    __DATA__
    The language is intended to be practical (easy to use,
    efficient, complete) rather than beautiful (tiny,
    elegant, minimal).
    </code>


    ....then the above code produces this output:


    <output>

    Thelanguageisintendedtobepractical
    (
    easytouse
    ,

    efficient
    ,
    complete
    )
    ratherthanbeautiful
    (
    tiny
    ,

    elegant
    ,
    minimal
    ).
    </output>


    That doesn't look correct, and I was careful to cut-and-paste the
    code from Abigail's post (not the followup), making only the change
    mentioned. (INF to DATA)
     
    David K. Wall, Jul 30, 2004
    #5
  6. Bernard El-Hagin <> wrote in
    message <news:Xns95394FABD10BDelhber1lidotechnet@62.89.127.66>:

    > "David K. Wall" <> wrote:
    >
    >> Bernard El-Hagin <> wrote
    >> in message
    >> <news:Xns953484D707FF1elhber1lidotechnet@62.89.127.66>:
    >>
    >>> bowsayge <> wrote:
    >>>
    >>>> Abigail said to us:
    >>>>
    >>>> [ Splitting a file into words an symbols question ]

    >>
    >> [snip]
    >>
    >>>>> Or you could do:
    >>>>>
    >>>>> while (<INF>) {
    >>>>> s/\s+//g;
    >>>>
    >>>> The above line folds all consecutive words together.
    >>>
    >>>
    >>> Yes, now that Bowsayge removed the map() which prevented this.

    >>
    >> What map()?

    >
    >
    > The map() which he removed from Abigail's first example (which
    > works correctly).


    Ah, OK. I thought you meant the second example instead of the the
    first. Never mind. :)
     
    David K. Wall, Aug 2, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stan
    Replies:
    3
    Views:
    497
    Brock Allen
    May 5, 2005
  2. Diego Martins
    Replies:
    5
    Views:
    5,375
    Diego Martins
    Jun 19, 2007
  3. Kevin
    Replies:
    16
    Views:
    47,478
    Roedy Green
    Jan 30, 2008
  4. Eddy Xu
    Replies:
    5
    Views:
    137
    Eddy Xu
    Apr 11, 2008
  5. oldyork90
    Replies:
    1
    Views:
    175
    Jeremy J Starcher
    Sep 10, 2008
Loading...

Share This Page