How to speed up this slow part of my program

Discussion in 'Perl Misc' started by Justin C, Mar 28, 2012.

  1. Justin C

    Justin C Guest

    We have a database of thousands of clothing items. Some of the items are
    almost identical apart from their size. Consequently we use the same
    image in our web-shop to advertise items of the same style, design and
    colour.

    In a program I have to get new images from the art guy's computer I end
    up grepping the entire list of items $#(list-of-items) times, there must
    be a better way. The file names are exactly the same as the style codes
    apart from the size suffix being dropped. I'm using File::Find.

    Here's some code:

    find(\&set_flag, (keys %{ $stock_groups->{text2code} }));

    sub set_flag {
    return unless (-f $_ );

    (my $item_code_part = $_) =~ s/\.jpg//;
    $item_code_part = uc($item_code_part);
    $item_code_part =~ s|_|/|g;

    my @matches = grep(/$item_code_part/, keys %{ $stock_items });
    foreach my $i (@matches) {
    $stock_items->{$i}{got_image} = 1;
    }
    }

    The 'find' iterates through two top level directories, 36 next level
    directories in each of the top level, and about 20k files accross the
    entire 72 level 2 directories. It then passes the filename to the sub
    which compares the filename (which is only a partial stock code because
    it may have several matches) with the hash of stock_items, pulling out
    matches. Those matches are then iterated over and the items with an
    available image get a hash element added and set to 1.

    I can probably do the grep and iteration over the matches with map{...},
    grep{...}, keys %{ $stock_items}; but I don't think that'll save much
    time, and I'm not certain how to do, I can see how to create a new hash,
    but I'm not sure if changing the hash while grep iterates through it is
    a good idea.

    The bottle-neck, as I see it, is running grep 20k times, once for each
    image found. Can anyone suggest a better way?

    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Mar 28, 2012
    #1
    1. Advertising

  2. Justin C <> writes:

    [...]

    > find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
    >
    > sub set_flag {
    > return unless (-f $_ );
    >
    > (my $item_code_part = $_) =~ s/\.jpg//;
    > $item_code_part = uc($item_code_part);
    > $item_code_part =~ s|_|/|g;
    >
    > my @matches = grep(/$item_code_part/, keys %{ $stock_items });
    > foreach my $i (@matches) {
    > $stock_items->{$i}{got_image} = 1;
    > }
    > }
    >
    > The 'find' iterates through two top level directories, 36 next level
    > directories in each of the top level, and about 20k files accross the
    > entire 72 level 2 directories. It then passes the filename to the sub
    > which compares the filename (which is only a partial stock code because
    > it may have several matches) with the hash of stock_items, pulling out
    > matches. Those matches are then iterated over and the items with an
    > available image get a hash element added and set to 1.


    [...]

    > The bottle-neck, as I see it, is running grep 20k times, once for each
    > image found. Can anyone suggest a better way?


    "Doing linear scans over an associative array is like trying to club
    someone to death with a loaded Uzi."

    An obvious suggestion would be to traverse the stock_items keys once
    an build a second hash which maps each item_code_part to an array
    reference containing all the corresponding stock_items keys (or even
    to a reference to the corresponding stock_items element) and use this
    second hash to locate the keys which need to be changed (or the hash
    locations which need to be changed) in constant time.

    NB: Someone should probably repost that since 'Justin who keeps coming
    out of my killfile' as chosen to 'block' himself from seeing my
    answers to questions of him for 'clearly indecent behaviour' ...
     
    Rainer Weikusat, Mar 28, 2012
    #2
    1. Advertising

  3. Justin C

    Tim McDaniel Guest

    In article <>,
    Ben Morrow <> wrote:
    >I would probably turn this into a big pattern match. Something like
    >this:
    >
    > use File::Find::Rule;
    >
    > my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
    > map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
    > File::Find::Rule->file->in(keys %{...});
    >
    > while (my ($item, $entry) = each %$stock_items) {
    > $item =~ $imgs and $entry->{got_image} = 1;
    > }
    >
    >If you're using 5.14 you can get rid of the ugly map block using s///r
    >and tr///r:
    >
    > map tr!_!/!r, map s/\.jpg//r,
    >
    >This assumes the entries in %$stock_items are already hashrefs; if they
    >aren't a 'for (keys %$stock_items)' loop will be easier.


    What's s///r and tr///r? I looked in "man perlop" and "man perlre"
    for a system with 5.14.2, but I didn't see them.

    --
    Tim McDaniel,
     
    Tim McDaniel, Mar 28, 2012
    #3
  4. Ben Morrow <> writes:
    > Quoth Justin C <>:


    [...]

    >> find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
    >>
    >> sub set_flag {
    >> return unless (-f $_ );
    >>
    >> (my $item_code_part = $_) =~ s/\.jpg//;
    >> $item_code_part = uc($item_code_part);
    >> $item_code_part =~ s|_|/|g;
    >>
    >> my @matches = grep(/$item_code_part/, keys %{ $stock_items });

    >
    > Careful: you want \Q there, even if you think you're sure the filenames
    > are all safe.
    >
    >> foreach my $i (@matches) {
    >> $stock_items->{$i}{got_image} = 1;
    >> }
    >> }

    >
    > I would probably turn this into a big pattern match. Something like
    > this:
    >
    > use File::Find::Rule;
    >
    > my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
    > map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
    > File::Find::Rule->file->in(keys %{...});
    >
    > while (my ($item, $entry) = each %$stock_items) {
    > $item =~ $imgs and $entry->{got_image} = 1;
    > }


    Something I should already have written last time: A hash lookup is
    (supposed to be) an O(1) operation. Matching against a set of
    alternative patterns is O(n). In this particular case, the means that
    the algorithm you suggest still scales as badly as the originally
    used one in this respect (run time proportional to the of patters
    times the number of stock items), except that it is possibly somewhat faster
    (although I wouldn't want to bet on that blindly).

    The sensible way to do a lot of dictionary lookups is using a
    dictionary, especially considering that Perl has native support for
    that.
     
    Rainer Weikusat, Mar 28, 2012
    #4
  5. Rainer Weikusat <> writes:

    > Ben Morrow <> writes:
    >> Quoth Justin C <>:

    >
    > [...]
    >
    >>> find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
    >>>
    >>> sub set_flag {
    >>> return unless (-f $_ );
    >>>
    >>> (my $item_code_part = $_) =~ s/\.jpg//;
    >>> $item_code_part = uc($item_code_part);
    >>> $item_code_part =~ s|_|/|g;
    >>>
    >>> my @matches = grep(/$item_code_part/, keys %{ $stock_items });

    >>
    >> Careful: you want \Q there, even if you think you're sure the filenames
    >> are all safe.
    >>
    >>> foreach my $i (@matches) {
    >>> $stock_items->{$i}{got_image} = 1;
    >>> }
    >>> }

    >>
    >> I would probably turn this into a big pattern match. Something like
    >> this:
    >>
    >> use File::Find::Rule;
    >>
    >> my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
    >> map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
    >> File::Find::Rule->file->in(keys %{...});
    >>
    >> while (my ($item, $entry) = each %$stock_items) {
    >> $item =~ $imgs and $entry->{got_image} = 1;
    >> }


    [...]

    > Matching against a set of alternative patterns is O(n). In this
    > particular case, the means that the algorithm you suggest still
    > scales as badly as the originally used one


    OTOH, I have now learnt why 'autothreading' is supposed to be
    useful. I can imagine the following set of 'optimizations' which led
    to it:

    1. Do it as in the original example. The algorithm is
    quadratic and doesn't scale, performance sucks.

    2. Assemble a giant regexp: The algorithm is still quadratic
    and doesn't scale, performance sucks.

    3. Turn the giant regexp into a suitable kind of search tree:
    Unfortunately, the algorithm is still quadratic and doesn't
    scale, performance sucks.

    4. Go back to 2, use as many cores as available in order to
    try alternatives in parallell: Given that the algorithm
    remains quadratic and doesn't scale, performance still sucks
    but now, it at least flattens the complete system.

    [5. Give up in despair and use a Hadoop-cluster. Add more
    nodes whenever the fact that quadratic algorithms don't scale
    and performance sucks because of this becomes to obvious].

    Steps 1 - 5 also invole a successive increase in code complexity,
    starting from 'more complicated than necessary' and ending with
    'absolutely insanely more complicated than necessary' ...
     
    Rainer Weikusat, Mar 28, 2012
    #5
  6. Christian Winter <> writes:

    [...]


    > I've had similar problems in my day-to-day work (iterating over
    > thousands of files and building relationships with database contents)
    > and found the only truly efficient approach is to avoid repetitive
    > lookups.


    The only 'truly efficient' way to solve such a problem is 'use a
    suitable lookup algorithm'. Otherwise, you are always betting on
    "runtime proportional to n*n won't matter because n will always be
    small".
     
    Rainer Weikusat, Mar 29, 2012
    #6
  7. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Justin C <> writes:
    >>
    >> > my @matches = grep(/$item_code_part/, keys %{ $stock_items });

    >>
    >> An obvious suggestion would be to traverse the stock_items keys once
    >> an build a second hash which maps each item_code_part to an array
    >> reference containing all the corresponding stock_items keys (or even
    >> to a reference to the corresponding stock_items element) and use this
    >> second hash to locate the keys which need to be changed (or the hash
    >> locations which need to be changed) in constant time.

    >
    > Given the problem as stated, I don't believe that's practical.
    > $item_code_part can match the stock_item anywhere along its length, so
    > you would need to enter each stock_item under all of its substrings.
    > That obviously turns into a combinatorial nightmare very fast.
    >
    > If, of course, there is some further constraint we don't know about,
    > like 'the item code part is the section of the stock item up to the
    > third underscore', then the approach you suggest is the right one.


    Just that the posted code behaves in this way doesn't mean that this
    behaviour is strictly necessary to solve the problem. If it was
    otherwise, coding errors wouldn't exist. And in this particular case,
    I would risk a bet on the assumption that the regular expression match
    could be replaced with a comparison of the item_code_part and a
    substring of each stock_items key whose start and length are the same
    for each key and known in advance, IOW, that, given a stock_items key,
    it's item_code_part can be calculated without knowing the set of all
    item_code_parts presently corresponding to image files.

    Indepedently of this, another sensible idea would be to turn the
    'stock item key' into an attribute of a data structure, say, an array
    reference, put these data structures into an array, traverse the array
    to locate the first matching item, process that once it was found and
    remove it from the array afterwards (or use a programming language
    designed for people who don't want to be bothered with outlandish
    distinctions like that, eg, PHP).
     
    Rainer Weikusat, Mar 29, 2012
    #7
  8. Justin C

    J. Gleixner Guest

    On 03/28/12 10:24, Justin C wrote:
    > We have a database of thousands of clothing items. Some of the items are
    > almost identical apart from their size. Consequently we use the same
    > image in our web-shop to advertise items of the same style, design and
    > colour.
    >[...]
    > The bottle-neck, as I see it, is running grep 20k times, once for each
    > image found. Can anyone suggest a better way?


    Since you already have data in a DB, I'd suggest looking at
    associating these files, to the items, in the database.

    Maybe store the path to the file, or possibly the image as a BLOB, a
    many to one relationship.
     
    J. Gleixner, Mar 29, 2012
    #8
  9. Rainer Weikusat <> writes:

    [...]

    > 3. Turn the giant regexp into a suitable kind of search tree:
    > Unfortunately, the algorithm is still quadratic and doesn't
    > scale, performance sucks.


    This was an error on my part: The total running time should be
    proportional to the number of stock_items times log(number of images)
    and while this is proportional to the square of the logarithm, this is
    not how these terms are commonly used and understood. But for 20k
    images, the running time will still be proportional to the number of
    stock items times 14, much larger than what can be achieved by
    'throwing memory at the problem' aka 'using a hash table'.
     
    Rainer Weikusat, Mar 29, 2012
    #9
  10. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >>
    >> The algorithm is quadratic and doesn't scale, performance sucks.

    >
    > The first part of that sentence does not imply the second: 'N is usually
    > small' and all that.


    Assuming that the original problem was stated correctly, 'N' is not
    small here. Further, when 'N' is 'small',

    > When you're comparing heavily optimized C like the
    > regex engine with pretty much *any* logic in Perl,
    > 'small' can end up being quite large.


    the overhead of 'pretty much any logic in Perl' shouldn't
    matter. Also, last time I looked, hashes were a datatype built into
    Perl.
     
    Rainer Weikusat, Mar 29, 2012
    #10
  11. Justin C

    Justin C Guest

    On 2012-03-28, Ben Morrow <> wrote:
    >
    > Quoth Justin C <>:
    >>
    >> We have a database of thousands of clothing items. Some of the items are
    >> almost identical apart from their size. Consequently we use the same
    >> image in our web-shop to advertise items of the same style, design and
    >> colour.
    >>
    >> In a program I have to get new images from the art guy's computer I end
    >> up grepping the entire list of items $#(list-of-items) times, there must
    >> be a better way. The file names are exactly the same as the style codes
    >> apart from the size suffix being dropped. I'm using File::Find.
    >>
    >> Here's some code:
    >>
    >> find(\&set_flag, (keys %{ $stock_groups->{text2code} }));
    >>
    >> sub set_flag {
    >> return unless (-f $_ );
    >>
    >> (my $item_code_part = $_) =~ s/\.jpg//;
    >> $item_code_part = uc($item_code_part);
    >> $item_code_part =~ s|_|/|g;
    >>
    >> my @matches = grep(/$item_code_part/, keys %{ $stock_items });

    >
    > Careful: you want \Q there, even if you think you're sure the filenames
    > are all safe.


    Done that, thank you.


    >> foreach my $i (@matches) {
    >> $stock_items->{$i}{got_image} = 1;
    >> }
    >> }

    >
    > I would probably turn this into a big pattern match. Something like
    > this:
    >
    > use File::Find::Rule;
    >
    > my ($imgs) = map qr/$_/, join "|", map "\Q\U$_",
    > map { (my ($x) = /(.*)\.jpg/) =~ s!_!/!g; $x }
    > File::Find::Rule->file->in(keys %{...});
    >
    > while (my ($item, $entry) = each %$stock_items) {
    > $item =~ $imgs and $entry->{got_image} = 1;
    > }


    That takes a bit of understanding, I'll have a read of the docs.


    > If you're using 5.14 you can get rid of the ugly map block using s///r
    > and tr///r:
    >
    > map tr!_!/!r, map s/\.jpg//r,


    Still on 5.10, we follow Debian 'stable' and so won't be upgrading for a
    while.

    Thanks for the suggestions.

    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Mar 30, 2012
    #11
  12. Justin C

    Justin C Guest

    On 2012-03-29, J. Gleixner <> wrote:
    > On 03/28/12 10:24, Justin C wrote:
    >> We have a database of thousands of clothing items. Some of the items are
    >> almost identical apart from their size. Consequently we use the same
    >> image in our web-shop to advertise items of the same style, design and
    >> colour.
    >>[...]
    >> The bottle-neck, as I see it, is running grep 20k times, once for each
    >> image found. Can anyone suggest a better way?

    >
    > Since you already have data in a DB, I'd suggest looking at
    > associating these files, to the items, in the database.
    >
    > Maybe store the path to the file, or possibly the image as a BLOB, a
    > many to one relationship.


    Unfortunately it's not a DB I'm authorised to write to. I can run
    queries that extract data, but any changes to the data must be done
    through the 3rd party supplied interface.


    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Mar 30, 2012
    #12
  13. Justin C

    Dr.Ruud Guest

    On 2012-03-28 17:24, Justin C wrote:

    > We have a database of thousands of clothing items. Some of the items are
    > almost identical apart from their size. Consequently we use the same
    > image in our web-shop to advertise items of the same style, design and
    > colour.


    Are these (hard- or soft-) linked to a single file? If so, then you can
    use that attribute.

    --
    Ruud
     
    Dr.Ruud, Mar 30, 2012
    #13
  14. Justin C

    J. Gleixner Guest

    On 03/30/12 08:05, Justin C wrote:
    > On 2012-03-29, J. Gleixner<> wrote:
    >> On 03/28/12 10:24, Justin C wrote:
    >>> We have a database of thousands of clothing items. Some of the items are
    >>> almost identical apart from their size. Consequently we use the same
    >>> image in our web-shop to advertise items of the same style, design and
    >>> colour.
    >>> [...]
    >>> The bottle-neck, as I see it, is running grep 20k times, once for each
    >>> image found. Can anyone suggest a better way?

    >>
    >> Since you already have data in a DB, I'd suggest looking at
    >> associating these files, to the items, in the database.
    >>
    >> Maybe store the path to the file, or possibly the image as a BLOB, a
    >> many to one relationship.

    >
    > Unfortunately it's not a DB I'm authorised to write to. I can run
    > queries that extract data, but any changes to the data must be done
    > through the 3rd party supplied interface.


    OK. Then put it in your own database. MySQL, Postgres, SQLite, even a
    DBM file might work.
     
    J. Gleixner, Mar 30, 2012
    #14
  15. "J. Gleixner" <> writes:
    > On 03/30/12 08:05, Justin C wrote:
    >> On 2012-03-29, J. Gleixner<> wrote:
    >>> On 03/28/12 10:24, Justin C wrote:
    >>>> We have a database of thousands of clothing items. Some of the items are
    >>>> almost identical apart from their size. Consequently we use the same
    >>>> image in our web-shop to advertise items of the same style, design and
    >>>> colour.
    >>>> [...]
    >>>> The bottle-neck, as I see it, is running grep 20k times, once for each
    >>>> image found. Can anyone suggest a better way?
    >>>
    >>> Since you already have data in a DB, I'd suggest looking at
    >>> associating these files, to the items, in the database.
    >>>
    >>> Maybe store the path to the file, or possibly the image as a BLOB, a
    >>> many to one relationship.

    >>
    >> Unfortunately it's not a DB I'm authorised to write to. I can run
    >> queries that extract data, but any changes to the data must be done
    >> through the 3rd party supplied interface.

    >
    > OK. Then put it in your own database. MySQL, Postgres, SQLite, even a
    > DBM file might work.


    This is a 'simple' key <-> value mapping, only complicated by the fact
    that keys can have multiple values associating with them. Provided
    that the code for a particular stock item can be calculated, the
    simple first attempt at a solution would still be to use a Perl hash
    mapping codes to sets of stock item keys. This hash can be generated
    once in advance and then be used for any number of 'fast' lookups. A
    more elaborate design would be to use a flat-file hashed database ('DBM
    file') to store the mappings, update that whenever the set of stock
    items changes and use it to associate presently existing images with
    stock items without recalculating the mappings. An even better idea
    would be to use a persistent database mapping stock items to image
    files and update this database whenever the set of image files
    changes.
     
    Rainer Weikusat, Mar 30, 2012
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ham

    I need speed Mr .Net....speed

    Ham, Oct 28, 2004, in forum: ASP .Net
    Replies:
    6
    Views:
    2,371
    Antony Baula
    Oct 29, 2004
  2. efiedler
    Replies:
    1
    Views:
    2,130
    Tim Ward
    Oct 9, 2003
  3. Replies:
    2
    Views:
    2,334
    Howard
    Apr 28, 2004
  4. Replies:
    2
    Views:
    353
    Christopher Benson-Manica
    Apr 28, 2004
  5. www
    Replies:
    2
    Views:
    592
Loading...

Share This Page