hash table usage questions

freesoft12 · Dec 30, 2008

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)
a) Is there a smart way to apply this regular expression on the
hash table? My current solution iterates over each item in the hash
table and then stores the keys that don't match the regex in a
separate list. I then iterate over that list and remove each key from
the hash table.

3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?

Regards
John

sln · Dec 30, 2008

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)
a) Is there a smart way to apply this regular expression on the
hash table? My current solution iterates over each item in the hash
table and then stores the keys that don't match the regex in a
separate list. I then iterate over that list and remove each key from
the hash table.

3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?
^^^^^^^^^^^^^

Regards
John

Why don't you clearly state what your trying to do instead of grabbing
straws and spewing all the buzzwords in the book.

You obviously need to learn Perl from the beginner position.
You seem to wan't somebody to not only write the code for you, but
provide documentation. As it is now, you exhibit knowledge below what is
necessary to understand a solution should one be provided.

Not alot of people want to do your work and not get paid for it.
Can you do my work for me?

sln

Tim Greer · Dec 30, 2008

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

By the sound of it, hashes might be fine. How are you originally
gathering the data to process (to obtain the paths), and exactly how
many are there? To avoid duplicates, hashes can be great for checking
that sort of thing.

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)

You can easily do that.

a) Is there a smart way to apply this regular expression on the
hash table?

You'd probably want to determine if it's something you want before
adding to the hash in the first place. How are you going about
creating/populating the hash?

My current solution iterates over each item in the hash
table and then stores the keys that don't match the regex in a
separate list. I then iterate over that list and remove each key from
the hash table.

Can you provide the relevant portions of your code? You probably don't
need to iterate over anything, especially if you have a hash value
saved or rejected based upon the logistical conditions you mentioned
with an example above.

3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?

By the sound of it, you shouldn't need to store anything you don't want
to, or copy any keys, but I may have misunderstood? If you have
duplicate keys/another hash, it will use that much more memory. Maybe
I don't understand what you're asking?

Tad J McClellan · Dec 30, 2008

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths)

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)
a) Is there a smart way to apply this regular expression on the
hash table?

%h = map /\.cc$/
? ($_, $h{$_})
: (),
keys %h;

Uri Guttman · Dec 30, 2008

TJM> %h = map /\.cc$/
TJM> ? ($_, $h{$_})
TJM> : (),
TJM> keys %h;

delete @h{ grep !/\.cc$/, keys %h } ;

that's a bit simpler IMO and definitely should be faster. it also uses
delete with a hash slice which is a combo that should be more well
known.

uri

Ted Zlatanov · Dec 30, 2008

fc> I am storing a large number of file paths into a hash table's keys
fc> (to avoid duplicate paths) with well-known extensions like .cc,
fc> .cpp, .h,.hpp. If any of the paths is a symbolic link then the link
fc> is stored in the value field.

By "large" do you mean thousands (A) or millions (B)?

fc> My questions are:

fc> 1) Is a custom data structure better than using a hash to store the
fc> file paths?

A: no

B: yes, consider nested hash tables with one level per directory. You
can also use SQLite to manage the data in a single DB file.

fc> 2) I want to remove some of the files from the hash table that don't
fc> match a regular expression (say I am only interested in *.cc files)
fc> a) Is there a smart way to apply this regular expression on the hash
fc> table? My current solution iterates over each item in the hash table
fc> and then stores the keys that don't match the regex in a separate
fc> list. I then iterate over that list and remove each key from the
fc> hash table.

A: use the solutions others have posted

B: you'll need a function to walk the nested hash tables and call a
check function for each entry. Accumulate the results into a temporary
list and delete it (if you worry that the temporary list will grow too
large, delete the entries in place). With SQLite this is a trivial SQL
statement.

Ted

Ted Zlatanov · Dec 30, 2008

On Tue, 30 Dec 2008 01:34:43 GMT (e-mail address removed) wrote:

s> Why don't you clearly state what your trying to do instead of
s> grabbing straws and spewing all the buzzwords in the book.

There was nothing in the OP's questions that warranted your rudeness.

s> Not alot of people want to do your work and not get paid for it. Can
s> you do my work for me?

You, apparently, assume your time and intelligence are too precious to
waste helping people for free. This is not the right forum for you.

Ted

freesoft12 · Dec 30, 2008

Thanks to all your code suggestions! I will give each of them a try!

Regards
John

freesoft12 · Dec 30, 2008

You obviously need to learn Perl from the beginner position.
You seem to wan't somebody to not only write the code for you, but
provide documentation. As it is now, you exhibit knowledge below what is
necessary to understand a solution should one be provided.

Not alot of people want to do your work and not get paid for it.
Can you do my work for me?

sln

Yes, I am a beginner. I am trying to learn Perl by reading and asking
for advice from intermediate & advanced users. Rather than just asking
questions, I posted my Perl script that I created after receiving a
suggestion from one of the answers to one of my prev posts.

Your asinine answers to my questions and to other people's posts to
this newsgroup (i checked) spoils the great work done by others in
teaching Perl to beginners.

I am going to recommend to this group's moderator that they cancel
your membership. Your attitude is detrimental for beginners learning
Perl from this newsgroup.

Regards
John

Jürgen Exner · Dec 30, 2008

Whom are you quoting here? It has been a proven custom for over 2
decades to name the original author because in Usenet you cannot assume
that the article you are replying to is visible to someone else.

I am going to recommend to this group's moderator

That may be difficult because CLP is not moderated.

that they cancel your membership.

That may be difficult, too, because there is no membership in Usenet.

Your attitude is detrimental for beginners learning
Perl from this newsgroup.

Can't comment on that because I can't tell whom you are talking about.
But yes, there are a few nutcases trolling in this NG, just like in
pretty much any other NG. Luckily they are easy to identify and just as
easy to filter.

jue

Tim Greer · Dec 30, 2008

Ted said:
On Tue, 30 Dec 2008 01:34:43 GMT (e-mail address removed) wrote:

s> Why don't you clearly state what your trying to do instead of
s> grabbing straws and spewing all the buzzwords in the book.

There was nothing in the OP's questions that warranted your rudeness.

s> Not alot of people want to do your work and not get paid for it.
Can s> you do my work for me?

You, apparently, assume your time and intelligence are too precious to
waste helping people for free. This is not the right forum for you.

Ted

Unfortunately, this is the norm with that poster. Sometimes I get
confused, because (rarely) he actually does try and offer help (when
he's not trying to pretend he's the smartest, most important person
here... or trying to push his ridiculous parsing engine). Sometimes...
I have hope (but then I see his posts like the one you've quoted).

freesoft12 · Dec 30, 2008

Here are the answers to the questions that Tim, Ted had asked:

My Perl project description:

1) I get thousands of files (lets call them: TFILES) from a C++
program that prints out all the files, being opened by the program
over a period of time, to a log. Hence there are several TFILES that
are opened many times.

2) My Perl script analyzes the TFILEs and collects & publishes various
statistics about each of the TFILEs.

3) the user can specify one or more filters (containing one or more
regular expressions) so that they can see the statistics about a
subset of TFILES (say, just *.cc and *.cpp files).

4) the filters can be specified in the foll 3 ways:
a) on the command line
- All filters are specified on the command line. Hence, I can
populate the hash table with just the TFILEs that match the filter(s)

b) interactively, from a Perl-Tk GUI
- For this case, I need to read in all the TFILEs into the
hash table and show the TFILEs to the user in the GUI. The user then
enters the regular expressions to create a filter file. I apply the
filter file and remove the filtered-out TFILEs from the hash table and
show the reduced hash table to the user again in the GUI.

My Question: My results show that there is quite some time spent in
copying the keys in the hash to the Perl-Tk GUI (once for creating the
filters) and then again, to show the filtered results.

Regards
John

Tad J McClellan · Dec 30, 2008

[ snip: sln has gone off his meds again ]

Your asinine answers to my questions and to other people's posts to
this newsgroup (i checked) spoils the great work done by others in
teaching Perl to beginners.

Simply ignore the jackoffs and pay attention to the others.

I am going to recommend to this group's moderator that they cancel
your membership.

Neither of those are possible, as there are no moderators for
this newsgroup, and there is no concept of "membership".

That is how most Usenet newgroups operate.

Ted Zlatanov · Dec 31, 2008

fc> Here are the answers to the questions that Tim, Ted had asked:

fc> My Perl project description:

fc> 1) I get thousands of files (lets call them: TFILES) from a C++
fc> program that prints out all the files, being opened by the program
fc> over a period of time, to a log. Hence there are several TFILES that
fc> are opened many times.

You just need a hash with filenames as keys. Anything else I mentioned
is overkill (but you should be aware that some day you may need to redo
things, so you should abstract the storage functionality iterate/get/put
functions).

fc> 2) My Perl script analyzes the TFILEs and collects & publishes various
fc> statistics about each of the TFILEs.

fc> 3) the user can specify one or more filters (containing one or more
fc> regular expressions) so that they can see the statistics about a
fc> subset of TFILES (say, just *.cc and *.cpp files).

fc> 4) the filters can be specified in the foll 3 ways:
fc> a) on the command line
fc> - All filters are specified on the command line. Hence, I can
fc> populate the hash table with just the TFILEs that match the filter(s)

fc> b) interactively, from a Perl-Tk GUI
fc> - For this case, I need to read in all the TFILEs into the
fc> hash table and show the TFILEs to the user in the GUI. The user then
fc> enters the regular expressions to create a filter file. I apply the
fc> filter file and remove the filtered-out TFILEs from the hash table and
fc> show the reduced hash table to the user again in the GUI.

fc> My Question: My results show that there is quite some time spent in
fc> copying the keys in the hash to the Perl-Tk GUI (once for creating the
fc> filters) and then again, to show the filtered results.

Your biggest delay is probably in populating a list widget with the file
names--something Perl can't really improve. As a test, append the file
names to a text area and see how much faster the operation is. I don't
see any of your Perl-Tk code so I can't tell what could be slow, but
eliminating widget updates is a good first step to find performance
bottlenecks in GUIs.

Ted

sln · Dec 31, 2008

Here are the answers to the questions that Tim, Ted had asked:

My Perl project description:

I'm pretty sure I asked the question.

1) I get thousands of files (lets call them: TFILES) from a C++
program

There is no such thing as a C++ program. There are only programs
written in C++ (or other languages).

that prints out all the files, being opened by the program
over a period of time, to a log.

The executable in question prints out entire files to a log file?
What files are they, and where do they come from? Over and over again
huh? And they are .c or .cc or .cpp files to boot. Thousands of times
over and over and over and over again huh? What about the file names,
same thing.. thousands and thousands of times over and over again?

Hence there are several TFILES that
are opened many times.

Thousands and thousands of times, over and over again...

2) My Perl script analyzes the TFILEs and collects & publishes various
statistics about each of the TFILEs.

You havn't got a script, thats why you post here. You never have a script,
thats why you post here. Its pretty clear why you post here...
Now your publishing statistics.

3) the user can specify one or more filters (containing one or more
regular expressions) so that they can see the statistics about a
subset of TFILES (say, just *.cc and *.cpp files).

So the user can specify regular expression filters to parse said thousands
and thousands of said published statistics of said thousands and thousands
of said file's opened by a C++ program. Say like *.c or *.cc or *.cpp or *.cxx..
Not too regy expressionist.

4) the filters can be specified in the foll 3 ways:
a) on the command line
- All filters are specified on the command line. Hence, I can
populate the hash table with just the TFILEs that match the filter(s)

You really can't pass in regular expressions from the command line.
For reasons I won't go into, its extremely limited and virtually useless.
The mechanics for such uselessness far outweights the uselessness itself.

b) interactively, from a Perl-Tk GUI

Perhaps you should do a native program with real controls using C++.

- For this case, I need to read in all the TFILEs into the
hash table and show the TFILEs to the user in the GUI. The user then
enters the regular expressions to create a filter file. I apply the
filter file and remove the filtered-out TFILEs from the hash table and
show the reduced hash table to the user again in the GUI.

But, how do you get all those published statistics for thousands and
thousands of files, over and over and over.

My Question: My results show that there is quite some time spent in
copying the keys in the hash to the Perl-Tk GUI (once for creating the
filters) and then again, to show the filtered results.

Generally, there is no blocking in the control. If you have thousands
and thousands and thousands to populate a list control with, perhaps
a timer will help to populate it with, and/or use messages.

Regards
John

As a long, long, long time C programmer, I notice your focus is on C
source files. Statistically speaking, there are only a few categories
that your intentions fall into. One is Source Control type. If not
actually trying to implement a flavor of your own..., then trying to gleen
results from existing output, possibly a command line version with its own
script language, that spits out reems and reems of data and statistics.

The other is, your just plain nuts.
Its not good enough to spew buzzwords and hypotheticals and expect
thought to be applied, on your behalf, without respect #1 and humility #2.

sln

sln · Dec 31, 2008

On Tue, 30 Dec 2008 01:34:43 GMT (e-mail address removed) wrote:

s> Why don't you clearly state what your trying to do instead of
s> grabbing straws and spewing all the buzzwords in the book.

There was nothing in the OP's questions that warranted your rudeness.

s> Not alot of people want to do your work and not get paid for it. Can
s> you do my work for me?

You, apparently, assume your time and intelligence are too precious to
waste helping people for free. This is not the right forum for you.

Ted

While your Enlish is correct, your grammar punctuates dumb.

sln

Tim Greer · Dec 31, 2008

While your Enlish is correct, your grammar punctuates dumb.

What the hell is Enlish?

sln · Dec 31, 2008

What the hell is Enlish?

I meant En'lish..

sln

Tim Greer · Dec 31, 2008

I meant En'lish..

sln

Maybe you meant Engrish? Anyway, before you call someone on their
grammar, you should note the difference between your and you're in your
very next post. This advice comes at no charge to you, and you're
welcome.

sln · Dec 31, 2008

Maybe you meant Engrish?

Damn, I would never say that, now your marked as an Asian haterrater..

sln

Optimal way to make a table for large lists	2	Jul 7, 2022
having trouble with hash of arrays...	12	Jul 3, 2013
Push regex search result into hash with multiple values	14	May 19, 2014
Only one table shows up with the information	2	Mar 29, 2023
Hash table Implementation	3	Mar 29, 2011
FAQ 4.55 How do I process an entire hash?	0	Apr 7, 2011
FAQ 4.60 How do I sort a hash (optionally by value instead of key)?	0	Mar 14, 2011
convert integer to string	26	Jun 8, 2010

hash table usage questions

freesoft12

sln

Tim Greer

Tad J McClellan

Uri Guttman

Ted Zlatanov

Ted Zlatanov

freesoft12

freesoft12

Jürgen Exner

Tim Greer

freesoft12

Tad J McClellan

Ted Zlatanov

sln

sln

Tim Greer

sln

Tim Greer

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads