Dynamically creating webpages via Ruby

P

Phil H.

I'm just getting started with Ruby and have very little programming
background, so hopefully this is a simple problem to solve.

I'm writing a program that takes a number of url's from all ready
existing webpages as an input (the number of these URL's can vary),
extracts some data from the these pages (such as links to pictures,
google maps data, etc...) then builds 2 new webpages based off this
extracted data. One page is the index with a list of links to the
second page which contains all my extracted data.

The way it works now is basically in one big loop. So my input URL's
are stored in an array, I pop the last element out of the array, send it
to the loop to extract data and generate HTML and so on until my array
of input URL's is empty.

During this loop I am only generating HTML for the Body of the webpages.
The beginning of each webpage is written prior to starting the loop and
the end of the webpages are written after the loop finishes.

This was working because the pre-body and post-body parts of my webpages
didn't need any of the extracted data from my input URL's - it was just
really basic HTML code. And the body of the webpages ONLY needed the
extracted data. This isn't the case anymore.

Long story short, I need to separate the data extraction from the web
page building and this is turning out to be harder than I thought since
the number of input URL's can change at any time.

What I want to do is extract all the data for all the input URL's first
and store that info somehow. Then use a webpage building method of some
kind to reference the extracted data and generate my webpages. That
seems simple enough, but because my list of input URL's is always
changing I'm not sure how to dynamically create the number of objects I
need to store the extracted info.

Any tips? Thanks in advance.
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

I'm just getting started with Ruby and have very little programming
background, so hopefully this is a simple problem to solve.

I'm writing a program that takes a number of url's from all ready
existing webpages as an input (the number of these URL's can vary),
extracts some data from the these pages (such as links to pictures,
google maps data, etc...) then builds 2 new webpages based off this
extracted data. One page is the index with a list of links to the
second page which contains all my extracted data.

The way it works now is basically in one big loop. So my input URL's
are stored in an array, I pop the last element out of the array, send it
to the loop to extract data and generate HTML and so on until my array
of input URL's is empty.

During this loop I am only generating HTML for the Body of the webpages.
The beginning of each webpage is written prior to starting the loop and
the end of the webpages are written after the loop finishes.

This was working because the pre-body and post-body parts of my webpages
didn't need any of the extracted data from my input URL's - it was just
really basic HTML code. And the body of the webpages ONLY needed the
extracted data. This isn't the case anymore.

Long story short, I need to separate the data extraction from the web
page building and this is turning out to be harder than I thought since
the number of input URL's can change at any time.

What I want to do is extract all the data for all the input URL's first
and store that info somehow. Then use a webpage building method of some
kind to reference the extracted data and generate my webpages. That
seems simple enough, but because my list of input URL's is always
changing I'm not sure how to dynamically create the number of objects I
need to store the extracted info.

Any tips? Thanks in advance.
Hi, I don't really see why the number of URLs changing is causing you
problems. This should be in a loop, as you said, and a loop will iterate
over all of them regardless of their size.

I can understand the appeal of separating extraction of data from building
of data, but in this case, I think that storing it in an intermediate form
is unnecessary. I would suggest simply doing these steps one after the
other. First extract all data, then build the page. Then you don't need to
save it in a file and go run a second script to read it in and do stuff with
it.

I don't know what you are trying to do with this data, but here is an
example https://gist.github.com/714817 It iterates over an array of URLs,
opens those pages, pulls all the links out of them, then builds an html
document where each page is displayed in a paragraph with a link to the page
followed by an unordered list of all the links that page contains. No
storing in files necessary.
 
P

Phil H.

Josh Cheek wrote in post #963735:
Hi, I don't really see why the number of URLs changing is causing you
problems. This should be in a loop, as you said, and a loop will iterate
over all of them regardless of their size.

I can understand the appeal of separating extraction of data from
building
of data, but in this case, I think that storing it in an intermediate
form
is unnecessary. I would suggest simply doing these steps one after the
other. First extract all data, then build the page. Then you don't need
to
save it in a file and go run a second script to read it in and do stuff
with
it.

I don't know what you are trying to do with this data, but here is an
example https://gist.github.com/714817 It iterates over an array of
URLs,
opens those pages, pulls all the links out of them, then builds an html
document where each page is displayed in a paragraph with a link to the
page
followed by an unordered list of all the links that page contains. No
storing in files necessary.

Thanks for the response. The reason I want to approach it this way is
because I've added another element to the webpages I'm creating which
requires JavaScript... Basically the webpages will need to be a bit more
complicated. This messes up my original approach because I can't just
generate body of the HTML from the top down like I used to. The dynamic
content of the webpages now needs to be mixed with static content where
as before the dynamic content was isolated to the body of the webpages,
so
I could write everything before and after the body outside of the loop
since it didn't require dynamic content I was extracting from my input
urls. That way I wasn't re-writing a bunch of static HTML every
iteration of the loop... Hopefully that makes sense.

So if I continue with my current approach I would need to be able to
jump around to different parts of the webpages as they are being written
to avoid re-writing the static HTML.. I know that could probably be
done somehow, but I was hoping I could just avoid that by separating the
data
extraction and page building processes.
 
P

Phillip Gawlowski

I could write everything before and after the body outside of the loop
since it didn't require dynamic content I was extracting from my input
urls. =A0That way I wasn't re-writing a bunch of static HTML every
iteration of the loop... Hopefully that makes sense.

Well, if your JavaScript is a function that takes arguments, you could
write out something like this:
[url1, url2].each do |url|
"<div><script javascript=3D'myfunction(#{argument1}, #{argument2},
#{url.link})' />#{url.name}</div>" # My HTML and JS is rusty, so this
is pseudo-code
end
With argument[1,2] and url.link being the arguments your JavaScript
needs to do its dynamic magic, and url.name being, well, the link text
of your URL, for example (the methods url would accept depend on how
you gather the URLs, obviously).
So if I continue with my current approach I would need to be able to
jump around to different parts of the webpages as they are being written
to avoid re-writing the static HTML.. =A0I know that could probably be
done somehow, but I was hoping I could just avoid that by separating the
data
extraction and page building processes.

Sometimes, reformulating the problem helps, too. If you could provide
abbreviate examples of what you want to achieve, we can give you more
specific advice, I guess.

--=20
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
P

Phil H.

Sometimes, reformulating the problem helps, too. If you could provide
abbreviate examples of what you want to achieve, we can give you more
specific advice, I guess.

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

I probably should have just started with an example. After typing out a
simplified version of my code, I think I can pretty easily re-arrange my
loop(s) to achieve what I need to do. For the sake of giving this
thread resolution here's a simplified version of what I am currently
doing:

=begin code

input_urls = [www.url1.com, www.url2.com, www.url3.com]

index_file = File.new('index.html', 'w+')
index_file.puts %&<html><head><title>Page title</title></head><body>&

input_urls.length.times do |x|

source << input_urls.pop
#source is actually an array in my real code, my input array actually
has more than just urls in it but i'm trying to keep this simpler

extract_urls = Url_finder.new("#{source}")
extracted_urls = extract_urls.method_to_parse_for_certain_urls

index_file.puts %&<a href="search#{x}.html"></a>
search_file = File.new("search#{x}.html", "w+")

search_file.puts
%&<html><head><title>searchpage</title></head><body>center><body>&

extracted_urls.length.times do |i|

content = Content_finder.new("#{extracted_urls})

pics = content.method_to_find_pic_links #pics is an array of urls

pics.length.times do |y|
search_file.puts %&<img src="#{pics[y]}>&
end

end
search_file.puts %&</body></html>&
search_file.close
end
end

index_file.puts %&</body></html>
index_file.close

=end code

With the new javascript content I want to add I need to put non-static
code in the line:

search_file.puts
%&<html><head><title>searchpage</title></head><body>center><body>

The new javascript code needs to reference the loop iteration number...

So, I guess all I need to do is start extracting content (via my
Content_finder class) earlier in the loop, and some other minor
adjustments... although I haven't really thought it out yet. I've never
tried it, but I assume I could access my main loop variable 'i' from the
inner loop? The dependency of my index.html and search'x'.html files is
what seems to be the confusing part for me.

Also would still love to hear any ideas on a good way to separate the
data extraction and web page creation parts of this. As I said earlier,
I have very little programming background, but this loop is making me
think I'm creating too many dependencies and there's maybe a totally
different way to go about achieving the same end result. But, judging
by the replies so far, maybe I am going about this the best way and just
need to re-think my loop.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top