Lew said:
I meant where specifically. You didn't show me where. You talked a lot
about how you'd go about finding where, but you didn't actually follow
your own advice.
I'm finally back! Sorry for the longer-than-expected delay.
I had thought your questions were hypothetical, of the "how would you
determine such-and-such if you needed to" variety. I didn't realize you
literally meant for me to answer
Well, let's get on with concrete answers.
The ArrayList article in the Java (1.7) API is at this URL:
http://docs.oracle.com/javase/7/docs/api/index.html. From there, scroll
down in the lower left index window to find ArrayList and click on it; the
article appears in the main pane of the browser. With regards to
performance, we get various clues in the introduction to the article (the
part about the Field Summary), including:
"The size, isEmpty, get, set, iterator, and listIterator operations run in
constant time. The add operation runs in amortized constant time, that is,
adding n elements requires O(n) time. All of the other operations run in
linear time (roughly speaking). The constant factor is low compared to that
for the LinkedList implementation.
"Each ArrayList instance has a capacity. The capacity is the size of the
array used to store the elements in the list. It is always at least as
large as the list size. As elements are added to an ArrayList, its capacity
grows automatically. The details of the growth policy are not specified
beyond the fact that adding an element has constant amortized time cost.
"An application can increase the capacity of an ArrayList instance before
adding a large number of elements using the ensureCapacity operation. This
may reduce the amount of incremental reallocation."
Doing the same thing for LinkedList we find:
"All of the operations perform as could be expected for a doubly-linked
list. Operations that index into the list will traverse the list from the
beginning or the end, whichever is closer to the specified index."
That first sentence is a bit evasive if you have no idea how a linked list
would perform but that's what it says....
No. I wasn't talking about designing interfaces. I was talking about
writing a program.
Maybe I should have said that that's where I _thought_ you were going
because we were in the part of the note where we talked about interfaces
and I thought you had more to say on that ;-)
Yes. There's a reason for that. It's because I was talking about
selecting which existing interfaces you want to use.
Fair enough.
Right. Precisely. Just so.
Give me an example of some unit of functionality you'd like to design,
in broad terms, and I'll do just that.
I think that would be an interesting exercise. Let's do one that I've
already done myself and that you (probably) haven't. I'm sure I'll learn
some things by seeing how you approach it.
I have a project (that regularly undergoes polishing) to produce resumes.
More specifically, it creates various formats of my own resume (and only my
own resume.) Each of the resumes I generate contains the exact same
information obtained from a single file but I generate it in several
formats: an HTML version, a Java applet, an ASCII text version, a PDF
version (using iText), and a Word version. I also generate supporting
files, including a CSS file for the HTML version of the resume, and a short
PDF containing contact information for references.
The program that generates the resumes and supporting files is written in
Java. The data file that is parsed to create each format of the resume is a
Java ResourceBundle of the "list" type. NOw, I'm only producing the resumes
in English so I should mention that I am open to the possibility of
creating the resume in additional languages - I have a working knowledge of
two other (human) languages - so I chose to use a Resource Bundle so that I
could easily generate resumes in additional languages. Otherwise, I'd have
probably just used a standard text file or more likely a properties file.
Given the fact that several resumes containing the same data are being
generated there would seem to be obvious opportunities for refactoring and
probably at least one interface. This code has gone through various
permutations but I've used a single interface with a single method in my
existing code. The code works but I'm not at all sure it is designed well.
I'd be very curious to get your take on it. I expect that the design has
major flaws and I'd like to clean those up once I find out what they are.
Does that sound reasonable to you? I'm basically asking you to talk me
through the way you would design it knowing what I've told you. Naturally,
you're free to ask as many questions as you like to understand what I'm
trying to do.
Why do you use more sets than lists?
Most of the things I put in collections should not have duplicates so I
enforce that via Sets. Naturally, if duplicates are appropriate, I'll use a
list.
"You're supposed to answer"? What _is_ the answer? "The API" or
"something you said" isn't an answer. That's like saying, "Where in
the city will I find the post office" and you say, "There's a map
somewhere". I am still left unable to mail a letter.
Sorry, that's actually under-answering.
Yeah, I get that now.
What clues? Where? Show me. I asked a specific question about specific
classes.
Answered above for ArrayList and LinkedList.
"normally"? "performs"? "best"?
Just a generalization I saw in the tutorial for the Java Collection
Classes. Let me get you a reference....
Assuming you're already in the Java 1.7 API, the articles on ArrayList and
LinkedList both have this line: "This call is a member of the Java
Collections Framework". If I click that, then Tutorial, then Interfaces,
then Set Interface, I come to this paragraph which was the basis for my
generalization:
"The Java platform contains three general-purpose Set implementations:
HashSet, TreeSet, and LinkedHashSet. HashSet, which stores its elements in
a hash table, is the best-performing implementation; however it makes no
guarantees concerning the order of iteration. TreeSet, which stores its
elements in a red-black tree, orders its elements based on their values; it
is substantially slower than HashSet. LinkedHashSet, which is implemented
as a hash table with a linked list running through it, orders its elements
based on the order in which they were inserted into the set (insertion-
order). LinkedHashSet spares its clients from the unspecified, generally
chaotic ordering provided by HashSet at a cost that is only slightly
higher."
You described some useful strategies, but let's see what happens when
you apply them to 'ArrayList<E>' and 'LinkedList<E>'.
Answered above (toward the top of this reply).
And since you brought them up, the three 'Set' implementations you
mention.
The questions above are to stimulate thought. The questions that
follow are for you to answer here.
Without asking anyone else, but by research (which you should cite):
As above.
What are the performance differences (if any! - question every
assumption in a question) between:
- 'ArrayList<E>' and 'LinkedList<E>'?
I've cited the API on this already. Would you like me to paraphrase to show
that I understand what they're saying?
- 'HashSet<E>', 'TreeSet<E>' and 'LinkedHashSet<E>'?
I've cited the Tutorial on this already. Would you like me to paraphrase to
show that I understand what they're saying?
What are the differences between:
- 'List<E>' and 'Set<E>'?
Functionally, the big difference is the tolerance for duplicates: Sets
don't permit duplicates but Lists do. As a guy with a relational database
background, I tend to put primary keys on every data structure I conceive
which is why I do more Sets than Lists.
For your claim that "HashSet normally performs best", define:
- "normally"
- "performs"
- "best"
That whole claim was basically a paraphrase of the passage from the
Collections tutorial cited above. Bloch (I assume he's the one that wrote
this tutorial), states clearly that HashSet is the best-performing
implementation of the three (HashSet, TreeSet and LinkedHashSet). Since
HashSet doesn't guarantee the order of iteration, it is presumably cheaper
to build and maintain. Given that people often want data to be in a
predictable order, TreeSet is provided but he warns that it is
substantially slower, presumably since it has to keep things in order as it
constructs the set and that takes more time/effort. LinkedHashSet ends up
being sort of a hybrid of the other two by the sounds of it. It is based on
a hash table like HashSet but has a linked-list running through it to
ensure that elements are kept in the desired order (insertion-order) but,
through the miracle of clever programming, apparently only costs slightly
more than HashSet, rather than being substantially slower like TreeSet.
The paraphrase was a bit sloppy. I shouldn't have said "normally" since it
implies that sometimes what I've said in the preceding paragraph isn't
true. In fact, I assume that paragraph is ALWAYS true.
Basically, if you don't care about the order of the elements of the Set,
you'll find HashSet fastest. If you care about the order, LinkedHashSet
will be faster than TreeSet.
There's nothing to suggest any other factors, such as the size of the
entry, has any effect on the performance of the Set implementation.
Since you get to define these terms, there might not be exactly a
right answer. Instead, justify your definitions to whatever extent you
feel they need it. Be careful - justifications like "the standard
definition" do require citation of which standard.
Have fun.
I await your opinion of my answers. I hope I'm closer to the spirit of what
you had in mind this time.
Naturally, the other approaches I mentioned are still available but I
expect most people would stop with the API and/or tutorial unless they had
an especially challenging issue that they were handling.