Elegant Solution to a Seemingly Simple Problem?

D

Derek Cannon

Hello everyone. It's me: Derek, again! Sorry for writing a novel here,
but I'd really appreciate some help.

I'm still working on the same program -- a way to show valid course
combinations for my school schedule, using an HTML file that contains
all the courses for the semester.

I have a rough draft copy of it working, but I'd like to see an example
of a more elegant coding style than my own.

Here's a (simplified) example of the data I'm working with:

<tr>
<td>Intro to Programming<td> #title
<td>MW</td> #days
<td>9:00am-10:30am</td> #time
<td>Dr. Smith</td> # professor
</tr>
<tr>
<td>Intro to Knitting<td>
<td>TR</td>
<td>9:00am-10:30am</td>
<td>Dr. Mittens</td>
</tr>

Earlier, someone on the forum showed me a very elegant way to collect
this information (I use Nokogiri). It was:

doc = Nokogiri::HTML(open(url))

raw_course_list = doc.css("tr").collect { |row|
row.css("td").collect { |column|
column.text.strip
}
}

This would give me an array of arrays in the format
[[courseA,data,data], [courseB, data, data]].

E.g., in this case it would yield:
[["Intro to Programming", "MW", "9:00am-10:30am", "Dr. Smith"], ["Intro
to Knitting", "TR", "9:00am-10:30am", "Dr. Mittens"]]

This works perfectly, except in 3 main cases.

*** Problem 1: The <tr> does not contain course information. (It's some
irrelevant part of the HTML). In this case, I did the following:
raw_course_data.reject! { |i| i.size != 4 }, would filtered out
non-courses. Note: no tables without course data had the size of one
with course data (in the non-simplified version, the size is actually
much larger).

So, already I think it's ugly coding! It firsts loads ALL <tr> contents
into arrays, then rejects them after creation.

*** Problem 2: In a few cases, some courses do not have specified days
and times yet. In those cases, the course days reads "TBA" (to be
announced), and there is no column for time. Thus, the array of such
courses is 1 less than the normal expected case.

E.g.:

<tr>
<td>Algebra<td>
<td>TBA</td> # notice there is now 1 <td> for day/time now
<td>Dr. Calculator</td>
</tr>

Thus, I create ANOTHER time that Ruby goes back over the elements of
raw_course_list again. This time, the code is put right before problem
1's fix:

raw_course_list.each { |i|
if i.size == 3
i.insert(2, "")
end
}

So again, if an array has a size of 3, I figure it's a valid course,
just with no time assigned, so I create a blank element between the day
and professor, just to satisfy the Course class, which these array
elements of the outer array will ultimately become. E.g. of call:
Course.new(title, day, time, professor)

*** Problem 3: Some rows of the HTML are actually a continued
description of the course in the row above. For example, a course that
has a lab might look like this:

<tr>
<td>Chemistry /w Lab<td>
<td>TR</td>
<td>9:00am-10:30am</td>
<td>Dr. Chemicals</td>
</tr>
<tr>
<td><td> # Empty, since the above row provides the course name
<td>R</td> # Day of the lab
<td>11:00am-12:30pm</td> # Time of the lab
<td>Dr. Chemicals</td> # Lab professor
</tr>

The good news is it's the same length as a normal class. So for this, I
add a bit more code to problem 2's code (the each block), changing the
each method to .each_with_index:

raw_course_list.each_with_index { |i, index|
if i.size == 3
i.insert(2, "")
end

# NEW CODE FOR LABS (still working out the kinks, but hopefully I
won't need this)
# lab will always have a size of 4 and a empty first element so:
if i[0].empty?
# add all the data from the lab to the previous course:
raw_course_list[index-1].push(i.each { |element| element })

# then remove lab from raw_course_list
raw_course_list.pop(index)

# index has to go back one to avoid skipping an element (since we
popped one)
index -= 1
end
}

==================================

So there you have it. Can anyone think of a way where I can improve the
quality and elegance of this code?
 
E

Elliot Crosby-McCullough

[Note: parts of this message were removed to make it a legal post.]

Hello everyone. It's me: Derek, again! Sorry for writing a novel here,
but I'd really appreciate some help.

I'm still working on the same program -- a way to show valid course
combinations for my school schedule, using an HTML file that contains
all the courses for the semester.

I have a rough draft copy of it working, but I'd like to see an example
of a more elegant coding style than my own.

Here's a (simplified) example of the data I'm working with:

<tr>
<td>Intro to Programming<td> #title
<td>MW</td> #days
<td>9:00am-10:30am</td> #time
<td>Dr. Smith</td> # professor
</tr>
<tr>
<td>Intro to Knitting<td>
<td>TR</td>
<td>9:00am-10:30am</td>
<td>Dr. Mittens</td>
</tr>

Earlier, someone on the forum showed me a very elegant way to collect
this information (I use Nokogiri). It was:

doc = Nokogiri::HTML(open(url))

raw_course_list = doc.css("tr").collect { |row|
row.css("td").collect { |column|
column.text.strip
}
}

This would give me an array of arrays in the format
[[courseA,data,data], [courseB, data, data]].

E.g., in this case it would yield:
[["Intro to Programming", "MW", "9:00am-10:30am", "Dr. Smith"], ["Intro
to Knitting", "TR", "9:00am-10:30am", "Dr. Mittens"]]

This works perfectly, except in 3 main cases.

*** Problem 1: The <tr> does not contain course information. (It's some
irrelevant part of the HTML). In this case, I did the following:
raw_course_data.reject! { |i| i.size != 4 }, would filtered out
non-courses. Note: no tables without course data had the size of one
with course data (in the non-simplified version, the size is actually
much larger).

So, already I think it's ugly coding! It firsts loads ALL <tr> contents
into arrays, then rejects them after creation.

*** Problem 2: In a few cases, some courses do not have specified days
and times yet. In those cases, the course days reads "TBA" (to be
announced), and there is no column for time. Thus, the array of such
courses is 1 less than the normal expected case.

E.g.:

<tr>
<td>Algebra<td>
<td>TBA</td> # notice there is now 1 <td> for day/time now
<td>Dr. Calculator</td>
</tr>

Thus, I create ANOTHER time that Ruby goes back over the elements of
raw_course_list again. This time, the code is put right before problem
1's fix:

raw_course_list.each { |i|
if i.size == 3
i.insert(2, "")
end
}

So again, if an array has a size of 3, I figure it's a valid course,
just with no time assigned, so I create a blank element between the day
and professor, just to satisfy the Course class, which these array
elements of the outer array will ultimately become. E.g. of call:
Course.new(title, day, time, professor)

*** Problem 3: Some rows of the HTML are actually a continued
description of the course in the row above. For example, a course that
has a lab might look like this:

<tr>
<td>Chemistry /w Lab<td>
<td>TR</td>
<td>9:00am-10:30am</td>
<td>Dr. Chemicals</td>
</tr>
<tr>
<td><td> # Empty, since the above row provides the course name
<td>R</td> # Day of the lab
<td>11:00am-12:30pm</td> # Time of the lab
<td>Dr. Chemicals</td> # Lab professor
</tr>

The good news is it's the same length as a normal class. So for this, I
add a bit more code to problem 2's code (the each block), changing the
each method to .each_with_index:

raw_course_list.each_with_index { |i, index|
if i.size == 3
i.insert(2, "")
end

# NEW CODE FOR LABS (still working out the kinks, but hopefully I
won't need this)
# lab will always have a size of 4 and a empty first element so:
if i[0].empty?
# add all the data from the lab to the previous course:
raw_course_list[index-1].push(i.each { |element| element })

# then remove lab from raw_course_list
raw_course_list.pop(index)

# index has to go back one to avoid skipping an element (since we
popped one)
index -= 1
end
}

==================================

So there you have it. Can anyone think of a way where I can improve the
quality and elegance of this code?
Do you have any control over the HTML at all? Some semantic HTML classes
would go a long way to simplifying this.
 
D

Derek Cannon

Do you have any control over the HTML at all? Some semantic HTML
classes would go a long way to simplifying this.

The HTML comes from a page on my schools website :(
 
B

Brian Candler

Derek said:
Here's a (simplified) example of the data I'm working with:

Can you post a complete sample page somewhere?

There are lots of ways to identify more precisely which part of the HTML
you want, using CSS selectors. Most easily, if the rows are inside
<table id='courses'> then seomthing like 'table#courses tr' could do it.

But otherwise, you can select the table based on its location in the
page relative to other elements (nth, nth-child).
 
G

gf

For your contemplation:


doc = Nokogiri::HTML(open(url))
raw_course_list = doc.css("tr").collect { |row|
t_row = row.css("td").collect { |column| column.text.strip }
t_row.insert(2, "") if (t_row[1] == "TBA")
}.reject{ |i| i.size != 4 }


This isn't guaranteed to work because we're dealing with pieces of the
HTML you are parsing. Give us a sample with ALL the variations in it
and we can stop shooting in the dark.
 
D

Derek Cannon

Sure, I'll post HTML examples. In this non-simplified version, there are
20 columns per row which are:

availability, course_reference_number, subject, course_number, section,
campus, credit_hours, title, days, time, cap, registered, remaining,
xl_cap, xl_registered, xl_remaining, professor, date, location,
attributes


Here's three specific examples of the HTML that cover all the
possibilities (normal class, course with TBA day, and labs):


<!--1. Normal class-->


<TR>
<TD class="dddefault"><ABBR title="Not available for
registration">NR</ABBR></TD>
<TD class="dddefault"><A
href="https://ggc.gabest.usg.edu/pls/B400...201008&subj_in=ACCT&crse_in=2101&crn_in=80983"
onmouseover="window.status='Detail'; return true"
onfocus="window.status='Detail'; return true"
onmouseout="window.status=''; return true"
onblur="window.status=''; return true">80983</A></TD>
<TD class="dddefault">ACCT</TD>
<TD class="dddefault">2101</TD>
<TD class="dddefault">01</TD>
<TD class="dddefault">A</TD>
<TD class="dddefault">3.000</TD>
<TD class="dddefault">Intro to Financial Accounting</TD>
<TD class="dddefault">MW</TD>
<TD class="dddefault">08:00 am-09:15 am</TD>
<TD class="dddefault">30</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">30</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault">A 1880</TD>
<TD class="dddefault">&nbsp;</TD>
</TR>
<TR>


<!--2. No day/time separation due to class days being "TBA":-->


<TR>
<TD class="dddefault"><ABBR title="Closed">C</ABBR></TD>
<TD class="dddefault"><A
href="https://ggc.gabest.usg.edu/pls/B400...201008&subj_in=BUSA&crse_in=4700&crn_in=81085"
onmouseover="window.status='Detail'; return true"
onfocus="window.status='Detail'; return true"
onmouseout="window.status=''; return true"
onblur="window.status=''; return true">81085</A></TD>
<TD class="dddefault">BUSA</TD>
<TD class="dddefault">4700</TD>
<TD class="dddefault">01</TD>
<TD class="dddefault">A</TD>
<TD class="dddefault">3.000</TD>
<TD class="dddefault">Selected Topics in Business</TD>
<TD colspan="2" class="dddefault"><ABBR title="To Be
Announced">TBA</ABBR></TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD>
<TD class="dddefault">&nbsp;</TD>
</TR>


<!--3. Class with lab. First row is class, second row is lab details -->


<TR>
<TD class="dddefault"><ABBR title="Not available for
registration">NR</ABBR></TD>
<TD class="dddefault"><A
href="https://ggc.gabest.usg.edu/pls/B400...01008&subj_in=CHEM&crse_in=1151K&crn_in=80073"
onmouseover="window.status='Detail'; return true"
onfocus="window.status='Detail'; return true"
onmouseout="window.status=''; return true"
onblur="window.status=''; return true">80073</A></TD>
<TD class="dddefault">CHEM</TD>
<TD class="dddefault">1151K</TD>
<TD class="dddefault">01</TD>
<TD class="dddefault">A</TD>
<TD class="dddefault">4.000</TD>
<TD class="dddefault">Survey of Chemistry I w/Lab</TD>
<TD class="dddefault">MF</TD>
<TD class="dddefault">11:00 am-12:15 pm</TD>
<TD class="dddefault">20</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">20</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">David Pursell (<ABBR
title="Primary">P</ABBR>)</TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault">A 1400</TD>
<TD class="dddefault">&nbsp;</TD>
</TR>
<TR>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">W</TD>
<TD class="dddefault">11:00 am-01:45 pm</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">&nbsp;</TD>
<TD class="dddefault">David Pursell (<ABBR
title="Primary">P</ABBR>)</TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault">A 1290</TD>
<TD class="dddefault">&nbsp;</TD>
</TR>
 
D

Derek Cannon

There are lots of ways to identify more precisely which part of the HTML
you want, using CSS selectors. Most easily, if the rows are inside
<table id='courses'> then seomthing like 'table#courses tr' could do it.

Since my original post, I've been playing around with the code some
more. I made a new way of getting courses that automatically filters out
"non-course" rows. The code is:

table = doc.css("tr").collect { |row|
row.css(".dddefault").collect { |column|
column.text.strip
}
}

This way, "non-courses" appear as empty arrays. I still don't know how
to neatly get rid of the empty arrays... I tried .compact! but that
doesn't seem to work.
doc = Nokogiri::HTML(open(url))
raw_course_list = doc.css("tr").collect { |row|
t_row = row.css("td").collect { |column| column.text.strip }
t_row.insert(2, "") if (t_row[1] == "TBA")
}.reject{ |i| i.size != 4 }

Excellent example, I think this is much better than what I had earlier.
I guess I could now replace your reject with i.empty?, right?

PS - I changed raw_course_list to table to make it more readable.
 
E

Ehsanul Hoque

This way=2C "non-courses" appear as empty arrays. I still don't know how= =20
to neatly get rid of the empty arrays... I tried .compact! but that=20
doesn't seem to work.

Try #flatten! instead. #compact! just gets rid of nil entries=2C and an emp=
ty array is not the same as nil.

- Ehsan
=20
_________________________________________________________________
Hotmail has tools for the New Busy. Search=2C chat and e-mail from your inb=
ox.
http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:O=
N:WL:en-US:WM_HMP:042010_1=
 
D

David A. Black

Hi --

There are lots of ways to identify more precisely which part of the HTML
you want, using CSS selectors. Most easily, if the rows are inside
<table id='courses'> then seomthing like 'table#courses tr' could do it.

Since my original post, I've been playing around with the code some
more. I made a new way of getting courses that automatically filters out
"non-course" rows. The code is:

table = doc.css("tr").collect { |row|
row.css(".dddefault").collect { |column|
column.text.strip
}
}

This way, "non-courses" appear as empty arrays. I still don't know how
to neatly get rid of the empty arrays... I tried .compact! but that
doesn't seem to work.
doc = Nokogiri::HTML(open(url))
raw_course_list = doc.css("tr").collect { |row|
t_row = row.css("td").collect { |column| column.text.strip }
t_row.insert(2, "") if (t_row[1] == "TBA")
}.reject{ |i| i.size != 4 }

Excellent example, I think this is much better than what I had earlier.
I guess I could now replace your reject with i.empty?, right?

PS - I changed raw_course_list to table to make it more readable.

I think having the condition and the reject be the last things in the
code are going to make it hard to follow it later. I wouldn't rule out
doing something a tiny bit more procedural but maybe a little easier
to parse visually, like this:

doc = Nokogiri::HTML(open(url))
table = []
doc.css("tr").each do |row|
cells = row.css("td").map {|cell| cell.text.strip }
next unless cells.size == 4
next unless cells[1] == "TBA"
cells.insert(2, "")
table << cells
end

You could also extract some methods, and end up with something like:

table = doc.css("tr").
select {|row| valid_row?(row) }.
map {|row| prepare_row(row) }

(The above is all untested.)


David

--
David A. Black, Senior Developer, Cyrus Innovation Inc.

THE Ruby training with Black/Brown/McAnally
COMPLEAT Coming to Chicago area, June 18-19, 2010!
RUBYIST http://www.compleatrubyist.com
 
P

Phrogz

[...]
Earlier, someone on the forum showed me a very elegant way to collect
this information (I use Nokogiri). It was:

doc = Nokogiri::HTML(open(url))

raw_course_list = doc.css("tr").collect { |row|
  row.css("td").collect { |column|
    column.text.strip
  }

}
[...]
This works perfectly, except in 3 main cases.

*** Problem 1: The <tr> does not contain course information. (It's some
irrelevant part of the HTML). In this case, I did the following:
raw_course_data.reject! { |i| i.size != 4 }, would filtered out
non-courses. Note: no tables without course data had the size of one
with course data (in the non-simplified version, the size is actually
much larger).

So, already I think it's ugly coding! It firsts loads ALL <tr> contents
into arrays, then rejects them after creation.
[...]

Generalized, you have an array of values and you want to map a subset
of them to new array. There are (at least) four patterns you can use
to handle this sort of situation:

1) Map the unwanted elements to a 'broken' value and then reject the
broken values later. (What you are doing now.) This can be hard if you
don't have a way of creating a broken value. For example, you might be
mapping all values directly to an object, but you don't have enough
information for the object constructor and no way of making up clearly
spurious values. Further, it's inefficient as you do the work and use
the memory of creating the object only to throw it out later.

2) Map the unwanted elements to nil and compact the array afterwards.
In your case, you'd need to look at the TDs in your row and decide if
you wanted to map the row to the mapping of them or nil. This is
convenient in terms of one-liners, but still slightly inefficient
because you're creating an intermediary array packed with nils that
you don't want. (You should be clear, though, that computational
inefficiency is not always more important than programmer convenience
of code clarity.)

3) Instead of using map (or the same effect under the longer name
'collect', as Robert apparently likes) to create a new array from your
original, explicitly create the new array and push values only as
valid. This is basically the same as above, but without the nil values
and the later compact. For example:
raw_course_list = []
doc.css("tr").each { |row|
tds = row.css("td")
if tds.have_the_values_I_want
raw_course_list << tds.map{ |col| ... }
end
}

4) Use map (collect) on the array as in #1 or #2, but before that do a
pass through your source array and sanitize it. Sanitization might be
mapping values to nil and then compacting (thus very similar to #2),
or fixing values (as in your TBA or continued description case). This
feels cleaner, but note that this has you doing one (or two, in the
case of map+compact) passes on your data before you get around to
mapping it.


Here's (very roughly) what I might do given what you wrote:
# Assuming you're using Ruby 1.9
course_info = []
trs = doc.css('tr')
trs.each.with_index{ |row,i|
tds = row.css('td')
title = ...
prof = ...
days = ...
times = ...
desc = ...
next_row = trs[i+1]
if next_row && next_row.is_a_continuation?
# Add content from next_row to description
# If needed, invalidate next_row so it will be skipped
elsif title && prof && days # If you have all the information you
need
course_info << Course.new( title, prof, days )
end
}

Regardless of the approach you use, remember that even though you're
annoyed that you are 'processing' (in one form or another) invalid
entries, you have to touch every row to find out if you like it or
not. It's up to you for how you detect which are invalid and handle
them.
 
D

Derek Cannon

doc = Nokogiri::HTML(open(url))
table = []
doc.css("tr").each do |row|
cells = row.css("td").map {|cell| cell.text.strip }
next unless cells.size == 4
next unless cells[1] == "TBA"
cells.insert(2, "")
table << cells
end

This is interesting. So nil is returns for elements that activate the
unless statement?

# Assuming you're using Ruby 1.9
course_info = []
trs = doc.css('tr')
trs.each.with_index{ |row,i|
tds = row.css('td')
title = ...
prof = ...
days = ...
times = ...
desc = ...
next_row = trs[i+1]
if next_row && next_row.is_a_continuation?
# Add content from next_row to description
# If needed, invalidate next_row so it will be skipped
elsif title && prof && days # If you have all the information you
need
course_info << Course.new( title, prof, days )
end
}

I like this code a lot, however, is it considered less efficient to look
at the next row to see if it is a lab for the previous row rather than
checking each row to see if it's a lab, and if it is, adding it to the
last row instead? That way, every row won't need to check with the next
row.
 
P

Phrogz

doc = Nokogiri::HTML(open(url))
  table = []
  doc.css("tr").each do |row|
    cells = row.css("td").map {|cell| cell.text.strip }
    next unless cells.size == 4
    next unless cells[1] == "TBA"
    cells.insert(2, "")
    table << cells
  end

This is interesting. So nil is returns for elements that activate the
unless statement?

No. Next inside a block is like an early return. The ruby code:
do_this unless something
is the same as:
unless something
do_this
end
is the same as:
if !something
do_this
end

So what's happening in the above is that you're creating an array
(table), and then looping through the list of rows returned from
Nokogiri. For each row, if certain criteria are met (there aren
exactly four cells, or the second cell is "TBA") you immediately stop
processing that row and move on to the next. ("Next!" shouts the
government worker, dismissing you and moving on.) If you didn't bail
out early, however, you inject an extra entry into the array of
strings and then shove that whole array as a new entry onto the end of
the table array.

Because you're using the pattern of creating an array and
conditionally populating during a traversal, there are no embedded
nils to clean up later.

For comparison, here's similar code that WOULD leave you with a table
with some nil entries (that you'd probably want to #compact
afterwards):

table = doc.css("tr").map do |row|
cells = row.css("td").map {|cell| cell.text.strip }
if cells.size==4 && cells[1]=="TBA"
cells.insert(2, "")
cells
end
end

In the code immediately above, the value of an if statement that
matches is the value of the last expression, while the value of an if
statement that is not matches is nil.

Here's a simpler example showing the difference between map and each-
with-injecting:

irb(main):001:0> digits = (1..9).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9]

irb(main):002:0> odds = []
=> []

irb(main):003:0> digits.each{ |n| if n%2==1 then odds<<n end }
=> [1, 2, 3, 4, 5, 6, 7, 8, 9]

irb(main):004:0> odds
=> [1, 3, 5, 7, 9]

irb(main):005:0> evens = digits.map{ |n| if n%2==0 then n end }
=> [nil, 2, nil, 4, nil, 6, nil, 8, nil]

irb(main):006:0> evens.compact
=> [2, 4, 6, 8]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top