Removing lines containing same first string boundaries?

Discussion in 'Perl Misc' started by Tuxedo, Mar 17, 2014.

  1. Tuxedo

    Tuxedo Guest

    I have a plain text file with each line in the format:

    Start of line followed immediately by a string of character(s), a
    whitespace, another string, a newline.

    -------- file.txt -------



    I would like to output each line that contains a string of a first
    character sequence but not repeat any line(s) with the same string as a
    first character sequence that appear further down the file. The output of
    running a perl procedure against the above file would then be:


    In other words, no repetition should occur of any first word boundary on
    each line in case the sequence happens to reappear on other line(s) as a
    first character boundary before each line's first whitespace.

    Alternatively, if given a parameter such as '^SOMESTRING' the output
    against the file would be narrowed down to:


    The second character string boundary (XXX) after the whitespace is
    arbitrary and should not affect the result but can be included in the
    output even if it happens to match SOMESTRING. So the output becomes the
    first occurence of ^SOMESTRING plus the remaining characters on the same
    line up until newline.

    Or if for example '^SOME' is passed as a parameter, the result would be:


    In which ways can this be done efficiently in Perl?

    Many thanks for any ideas.

    Tuxedo, Mar 17, 2014
    1. Advertisements

  2. Tuxedo

    Tuxedo Guest

    Jürgen Exner wrote:

    Nothing so far, as I didn't try anything yet.
    Thanks for the above pointers. I will look into these.

    Tuxedo, Mar 17, 2014
    1. Advertisements

  3. Tuxedo

    John Black Guest

    Just use a regex to match each line to your input parameter.

    if ($line =~ /($param\w*)\s+.*/) { # If the first string matches input parameter
    $string1 = $1; # Extract string1 from the current line

    if (grep {$string1 ne $_} @matched_array { # If string1 has not been seen before
    print $line; # Print the line
    push (@matched_array, $string1); # Put the string in the matched array

    I have not tested this for errors but that is what I would start with. Not shown is the loop
    to grab each line of the file into $line.

    John Black
    John Black, Mar 17, 2014
  4. [...]

    my (%seen, $filter, $tag);

    $filter = $ARGV[0] // '.';

    while (<STDIN>) {
    ($tag) = /^(\S+)/ or next;

    next if $seen{$tag} || !/$filter/o;

    $seen{$tag} = 1;
    Rainer Weikusat, Mar 17, 2014
  5. This is atrociously inefficient/ unscalable as the running time of the
    algorithm is proportional to the square of the number of lines which are
    Rainer Weikusat, Mar 17, 2014
  6. HL> Strong smell of homework assignment here ...

    Indeed. Highly detailed and precise specification combined with "I
    haven't tried anything yet...."

    Tuxedo, go and try. Come back when you have a specific, concrete
    question that isn't "Solve my problem for me."

    Charlton Wilbur, Mar 17, 2014
  7. It expects data to process on stdin and a keyword, if any, as first
    argument, eg, assuming that the script text is in a file named and
    the test text you posted in a file named text,

    [rw@sable]/tmp#perl SOME <text

    NB: while (<>) doesn't localize $_ implicitly. If this is supposed to be
    part of a larger program, a

    local $_;

    might need to be added in a suitable scope to avoid overwriting someone
    else's $_.
    Rainer Weikusat, Mar 17, 2014
  8. Tuxedo

    John Black Guest

    Are you saying that a lookup for a key in a hash array does not have to search the hash for a
    key match in a similar way to how grep is searching an array?

    John Black
    John Black, Mar 17, 2014
  9. Yes. The idea of a hash is that it can transform (aka hash) the key
    (an O(1) operation) into something like an array index, which allows
    an index-based lookup (another O(1) operation).

    The downside of using a hash is memory overhead in setting up the
    hash table.
    Randy Westlund, Mar 17, 2014
  10. Of course not. A 'hash' is not an array. It's an associative array
    implemented as hash table with separate chaining. This means it is an
    array of pointers to linked lists containing the actual entries. When
    searching for a key, the key is transformed to a number by using it as
    input for the so-called 'hash function' and 'truncating' the resulting
    number down to the current size of the list pointer array via
    mod. Assuming that H(key) is the hash function, an array index is
    calculated as

    ndx = H(key) % @hash_array

    Nowadays, using hash arrays whose sizes are powers of two is common
    because the modulo-operation can then be performed with a binary
    and. The list $hash_array[ndx] points to is then searched for the actual
    key. In ideal conditions (when the table is large enough), all list
    sizes will be 1 which means the lookup requires only a single string
    comparison. Even if the list has a number of entries (because of
    so-called 'hash collisions'), it will still be much shorter than a list
    containing all key/value-pairs in the hash if a 'reasonable' hash
    function is used (it is supposed to produce a 'uniform' distribution of
    Rainer Weikusat, Mar 17, 2014
  11. Tuxedo

    Tuxedo Guest

    Rainer Weikusat wrote:

    Many thanks for the detailed how-to! I will test.

    Tuxedo, Mar 17, 2014
  12. [...]
    There are a couple of more possible gotchas with this, in particular,

    - the /o flag means the regex will be compiled exactly once per program
    run. This might become an issue in conditions where the value of
    $filter can change, ie, if thas was put into a subroutine.

    $filter = qr/$filter/

    could be used to pre-compile the current $filter in such cases.

    - the filter match is performed on the whole input line, not only on the
    tag, meaning, it will match in the 2nd and subsequent fields if not
    anchored to the beginning of the line.

    $tag !~ /$filter/

    could be used instead.
    Rainer Weikusat, Mar 17, 2014
  13. This is a little too simplistic. The the amount of work necessarily to
    locate a random key which is stored in the hash is proportional to the
    average length of each used hash chain divided by 2 and the time to
    determine that some key is not in the hash is proportional to the
    average length of a hash chain. Assuming a 'good' hash function is being
    used (which also covers the case of 'degenerated' key sets better known
    as 'algorithmic complexity attack), the average length of a hash chain
    should be the number of keys in the hash divided by the number of entries
    in the array holding the chain pointers. That still proportional to the
    number of entries in the hash, aka O(n), except that it is possible to
    make the table sufficiently large to ensure that the average length of a
    hash chain will be small.

    As an example:

    [rw@sable]~#perl -MDevel::peek -e '%a = 'aaaa' .. 'zzzz'; Dump(\%a)'
    SV = RV(0x817b824) at 0x817b818
    REFCNT = 1
    RV = 0x8195d28
    SV = PVHV(0x8180984) at 0x8195d28
    REFCNT = 2
    ARRAY = 0xb6807008 (0:110767, 1:94524, 2:40904, 3:12427, 4:2864, 5:545, 6:99, 7:12, 8:2)
    hash quality = 98.6%
    KEYS = 228488
    FILL = 151377
    MAX = 262143
    RITER = -1
    EITER = 0x0
    Elt "utei" HASH = 0x8ccc0001
    SV = PV(0xaffa700) at 0xb70b0bf8
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0xb000720 "utej"\0
    CUR = 4
    LEN = 8
    Elt "vqmw" HASH = 0xe2100002
    SV = PV(0xb07d430) at 0xb70cf8c8
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0xb0843d0 "vqmx"\0
    CUR = 4
    LEN = 8
    Elt "yehi" HASH = 0x8d680003
    SV = PV(0xb22c828) at 0xb230360
    REFCNT = 1
    FLAGS = (POK,pPOK)
    PV = 0xb233fc0 "yehj"\0
    CUR = 4
    LEN = 8

    There are 228,488 keys in the hash and the chain pointer table has a
    size of 262,143 which means the average length of a chain should be
    about 0.87. However, only 151,377 table entries are in use so
    'average chain length' for the 'locate key which exists' case is about
    1.5 and about 42% of the available table slots are unused (the 'ARRAY ='
    line shows the actual distribution of chain lengths).

    Assuming the key set is small, there's also the problem that calculating
    the hash value is usually more expensive than comparing two keys and
    locating an item on a linked-list is more expensive than locating one in
    an array.

    The general idea behind 'using a hash' is that determining
    exactly when this become better than doing linear searches in an array
    isn't worth the effort for a small number of entries and that hash
    lookups will very likely perform better than most linear searches as the
    key set gets larger (at the expense of wasting a possibly 'large' amount
    of memory).
    Rainer Weikusat, Mar 18, 2014
  14. This is similar to the rotating hash, but it actually mixes the
    internal state. It takes 9n+9 instructions and produces a full
    4-byte result. Preliminary analysis suggests there are no

    This hash was not in the original Dr. Dobb's article. I
    implemented it to fill a set of requirements posed by Colin
    Plumb. Colin ended up using an even simpler (and weaker) hash
    that was sufficient for his purpose.

    Rainer Weikusat, Mar 18, 2014
  15. Tuxedo

    Kaz Kylheku Guest

    This can be implemented using a very simple, clear on-liner in awk, right from
    your shell prompt.

    The lines marked <- are my tty input; the others are awk output:

    $ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'
    A B <-
    A B
    A C <-
    A D <-
    D 1 <-
    D 1
    D 2 <-
    D 3 <-
    A Z <-
    Kaz Kylheku, Mar 18, 2014
  16. perl -ane 'print, $seen{$F[0]} = 1 unless $seen{$F[0]}'
    Rainer Weikusat, Mar 18, 2014
  17. Tuxedo

    Tim McDaniel Guest

    perl -ane 'print if !$seen{$F[0]}++'

    perl -ane '$seen{$F[0]}++ || print'
    Tim McDaniel, Mar 18, 2014
  18. Well, there are quite a few more more-or-less bizarre variants, eg,

    perl -ape '$_=$seen{$F[0]}++?"":$_'


    perl -ape '$seen{$F[0]}++&&undef$_'
    Rainer Weikusat, Mar 18, 2014
  19. Tuxedo

    Kaz Kylheku Guest

    gawk '{if(!seen[$1]++)print}'

    (GNU Awk has bignums, which takes care of the overflow.)
    Kaz Kylheku, Mar 18, 2014
  20. Coming to think of that, no self-respecting quibbler would ever use a
    hash named %seen. And this auto-split thing is also much too
    straight-forward. So what about

    perl -pe '$£{(/(\S+)/)[0]}++&&undef$_'

    Rainer Weikusat, Mar 18, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.