hash table usage questions

T

Ted Zlatanov

On Wed, 31 Dec 2008 19:42:57 GMT (e-mail address removed) wrote:


s> Not alot of people want to do your work and not get paid for it. Can
s> you do my work for me?
s> While your Enlish is correct, your grammar punctuates dumb.

I'm glad you found time to complain plaintively
while ignoring my point entirely.
We can imagine how tight your schedule must be
to produce such verbal and mental parsimony.

Ted
 
S

sln

On Wed, 31 Dec 2008 19:42:57 GMT (e-mail address removed) wrote:



s> Not alot of people want to do your work and not get paid for it. Can
s> you do my work for me?

s> While your Enlish is correct, your grammar punctuates dumb.

I'm glad you found time to complain plaintively
while ignoring my point entirely.
We can imagine how tight your schedule must be
to produce such verbal and mental parsimony.

Ted
For you, 1 second.

sln
 
S

sln

[ snip: sln has gone off his meds again ]

Your asinine answers to my questions and to other people's posts to
this newsgroup (i checked) spoils the great work done by others in
teaching Perl to beginners.


Simply ignore the jackoffs and pay attention to the other
jackoffs
s.


I am going to recommend to this group's moderator that they cancel
your membership.
I joined Tad McClellan's group, or will shortly
Neither of those are possible, as there are no moderators for
this newsgroup, and there is no concept of "membership".

That is how most Usenet newgroups operate.

Whats his raw msg composition?
I think he knows that.

sln
 
S

sln

Whom are you quoting here? It has been a proven custom for over 2
decades to name the original author because in Usenet you cannot assume
that the article you are replying to is visible to someone else.


That may be difficult because CLP is not moderated.


That may be difficult, too, because there is no membership in Usenet.


Can't comment on that because I can't tell whom you are talking about.
But yes, there are a few nutcases trolling in this NG, just like in
pretty much any other NG. Luckily they are easy to identify and just as
easy to filter.

jue

It may be difficult to understand what it is the OP is trying to
get here isin't it.

I didn't know this was a thread on usenet usage. Not it is I guess.

sln
 
S

sln

fc> I am storing a large number of file paths into a hash table's keys
fc> (to avoid duplicate paths) with well-known extensions like .cc,
fc> .cpp, .h,.hpp. If any of the paths is a symbolic link then the link
fc> is stored in the value field.

By "large" do you mean thousands (A) or millions (B)?

fc> My questions are:

fc> 1) Is a custom data structure better than using a hash to store the
fc> file paths?

A: no

B: yes, consider nested hash tables with one level per directory. You
can also use SQLite to manage the data in a single DB file.

fc> 2) I want to remove some of the files from the hash table that don't
fc> match a regular expression (say I am only interested in *.cc files)
fc> a) Is there a smart way to apply this regular expression on the hash
fc> table? My current solution iterates over each item in the hash table
fc> and then stores the keys that don't match the regex in a separate
fc> list. I then iterate over that list and remove each key from the
fc> hash table.

A: use the solutions others have posted

B: you'll need a function to walk the nested hash tables and call a
check function for each entry. Accumulate the results into a temporary
list and delete it (if you worry that the temporary list will grow too
large, delete the entries in place). With SQLite this is a trivial SQL
statement.

Ted
Can you post a fully working example so that we can test all of your
"verbage"?

sln
 
S

sln

Unfortunately, this is the norm with that poster. Sometimes I get
confused, because (rarely) he actually does try and offer help (when
he's not trying to pretend he's the smartest, most important person
here... or trying to push his ridiculous parsing engine). Sometimes...
I have hope (but then I see his posts like the one you've quoted).
Simple, I don't pretend to be anything, and magically, I'm not.
Should I pretend to be you, magically I'm nothing.

sln
 
T

Tim Greer

Simple, I don't pretend to be anything, and magically, I'm not.
Should I pretend to be you, magically I'm nothing.

Well, reply that doesn't make sense. Also, please consider a line break
before you start your reply posts, so everything doesn't appear jumbled
together.
 
S

sln

Well, reply that doesn't make sense. Also, please consider a line break
before you start your reply posts, so everything doesn't appear jumbled
together.

Happy New Year Tim !!!

sln
 
F

freesoft12

The other is, your just plain nuts.
Its not good enough to spew buzzwords and hypotheticals and expect
thought to be applied, on your behalf, without respect #1 and humility #2.

sln

asinine answers again and again...

John
 
T

Tad J McClellan

[ Please provide a proper attribution line when you quote someone ]


asinine answers again and again...



+-------------------+ .:\:\:/:/:.
| PLEASE DO NOT | :.:\:\:/:/:.:
| FEED THE TROLLS | :=.' - - '.=:
| | '=(\ 9 9 /)='
| Thank you, | ( (_) )
| Management | /`-vvv-'\
+-------------------+ / \
| | @@@ / /|,,,,,|\ \
| | @@@ /_// /^\ \\_\
@x@@x@ | | |/ WW( ( ) )WW
\||||/ | | \| __\,,\ /,,/__
\||/ | | | (______Y______)
/\/\/\/\/\/\/\/\//\/\\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
==================================================================
 
T

Ted Zlatanov

On Wed, 31 Dec 2008 21:21:10 GMT (e-mail address removed) wrote:

fc> I am storing a large number of file paths into a hash table's keys
fc> (to avoid duplicate paths) with well-known extensions like .cc,
fc> .cpp, .h,.hpp. If any of the paths is a symbolic link then the link
fc> is stored in the value field.fc> 1) Is a custom data structure better than using a hash to store the
fc> file paths?fc> 2) I want to remove some of the files from the hash table that don't
fc> match a regular expression (say I am only interested in *.cc files)
fc> a) Is there a smart way to apply this regular expression on the hash
fc> table? My current solution iterates over each item in the hash table
fc> and then stores the keys that don't match the regex in a separate
fc> list. I then iterate over that list and remove each key from the
fc> hash table.
s> Can you post a fully working example so that we can test all of your
s> "verbage"?

Question 1 was a yes/no question. I also didn't see the need to post
fully working code for case B, since case A was much more likely. If
you or someone else need help with nested hash tables to represent
directory hierarchies or with using SQLite through Perl, I'll be glad to
assist.

Question 2 required an answer only in case B. Since, as I mentioned, A
was much more likely, I didn't see the need to spend time writing code
to demonstrate B. If, however, you or someone else need this code
specifically, I'll gladly write a quick example.

Ted
 
X

xhoster

My Question: My results show that there is quite some time spent in
copying the keys in the hash to the Perl-Tk GUI (once for creating the
filters) and then again, to show the filtered results.

If transporting the data from Perl to TK is the bottleneck, then changing
how you do the stuff purely internal to Perl *before* you transfer it to TK
is not going to improve things. Unless it changes the contents of the data
being transfered to TK.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

Probably not. You seem to primarily concerned about size. Making a custom
data structure with tie or whatever is usually going to make things larger
rather than smaller. You *can* make such structures that are more memory
efficient, but to do so you have to program such things as if you were
using a lower level language (or actually using one with XS or Inline). At
which point, I'd re-evaluate whether the whole project wouldn't be better
done in such an other language.

It is possible that you would benefit from a custom data structure, but
based on the info you gave us it doesn't seem likely.

3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?

If you put the keys into an array, they are copied. If you put them
into another hash, they are done by reference (but the overhead of the
reference is none too small). However, I doubt you should care about this.
If memory is that tight, you'd be better off rethinking the whole design
rather than making a series of fragile hacks of only marginal value.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

Uri Guttman said:
TJM> %h = map /\.cc$/
TJM> ? ($_, $h{$_})
TJM> : (),
TJM> keys %h;

delete @h{ grep !/\.cc$/, keys %h } ;

that's a bit simpler IMO and definitely should be faster. it also uses
delete with a hash slice which is a combo that should be more well
known.

If the OP's worry about size is justified, I'd go with this:

/.cc$/ and delete $h{$_} while defined ($_=each %h);

As your method builds two in-memory lists, one with all keys and one with
all keys meeting the criteria.


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
U

Uri Guttman

x> If the OP's worry about size is justified, I'd go with this:

x> /.cc$/ and delete $h{$_} while defined ($_=each %h);

x> As your method builds two in-memory lists, one with all keys and one with
x> all keys meeting the criteria.

sure that could be faster but it may not be as you think. you make N
calls to delete and calls to each in a loop. perl ops are the major
bottleneck in perl speed. if the slice lists are short enough (and they
could still be pretty long), IMO my solution should be faster. but only
benchmarking can tell for sure. my method does use more ram but ram is
cheaper than speed. :)

uri
 
X

xhoster

Uri Guttman said:
x> If the OP's worry about size is justified, I'd go with this:

x> /.cc$/ and delete $h{$_} while defined ($_=each %h);

x> As your method builds two in-memory lists, one with all keys and one
with x> all keys meeting the criteria.

sure that could be faster but it may not be as you think.

I didn't think it would be faster (and was surprised to find upon trying
it that it was faster). I was specifically referring to size, not speed.
you make N
calls to delete and calls to each in a loop. perl ops are the major
bottleneck in perl speed. if the slice lists are short enough (and they
could still be pretty long), IMO my solution should be faster. but only
benchmarking can tell for sure. my method does use more ram but ram is
cheaper than speed. :)

Running out of RAM you already own can make the speed you already own a
heck of a lot less speedy, and eventually fall over and die.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
U

Uri Guttman

x> Running out of RAM you already own can make the speed you already own a
x> heck of a lot less speedy, and eventually fall over and die.

as always you have to consider data size and growth in your design
decisions. slurping is great for almost all files but there are a know
set where it is bad (logs, biogenetic, etc.). same with slicing vs
looping over the keys. but then dealing with hashes with millions or
more keys is a problem unto itself. the OP was just filtering some files
from a dir listing so i would expect the max amount of ram used in a
slice to be very small. and i like slicing and speed. :)

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top