Regular Expression interesting problem

Arun Kumar · Mar 28, 2009

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base...tes?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

Eric Hodel · Mar 28, 2009

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml
"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help |
RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base...tes?client=ytapi-youtube-index&time=today&v=2
">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

PPS: There's no need to post twice.

Arun Kumar · Mar 28, 2009

Eric said:
I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.

Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Thanks for ur quick reply

Regards
Arun Kumar

James Coglan · Mar 28, 2009

[Note: parts of this message were removed to make it a legal post.]

2009/3/28 Arun Kumar said:
Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Looks like you will need to parse in stages -- I can't get String#scan to
capture everything using a single regex, though there's every chance I've
screwed up the expression somehow:

'<link type="application" href="http://google.com" rel="alternate" />'.scan
/<([^\s]+)(?:\s+([^\s]+)="([^"]*)")*\s*\/?>/i
#=> [['link', 'rel', 'alternate']]

Robert Klemme · Mar 28, 2009

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

I'd probably rather scan for each <link> tag and then analyze it, i.e.

doc.scan %r{<link[^>]*>}i do |link|
if %r{(?i:type)=["']application/rss\+xml["']} =~ link
...
end
end

Note that the scanning RX is weak.

But I agree, rather use the proper tool for the job.

Cheers

robert

Sean O'Halpin · Mar 28, 2009

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base...tes?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

Arun Kumar · Mar 28, 2009

Sean said:
In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

Regards
Arun Kumar . C. M.

7stud -- · Mar 28, 2009

Sean said:
In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

That was my first thought when I read the other post.

Anyway, split() rules the world--not regexs.

Sean O'Halpin · Mar 28, 2009

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

If I have misrepresented you, then you have my sincerest apologies.
However, you have not really represented your own position terribly
well. Now that we know you are a trainee with little experience who is
currently specifically being trained in regular expressions, it makes
more sense that you cannot use REXML, etc. But this was not clear from
your previous posts.

By the way, you are more likely to get a positive response if you at
least show how far you have got with the problem yourself before
coming to the list.

And to make up for my grouchy mood this morning, here's my contribution:

hashes = []
data.scan(/<link[^>]+?>/) do |link|
hashes << Hash[*link.scan(/([a-z]+)=["']?([^"]+)["']?/).flatten]
end
require 'pp'
pp hashes.select{ |hash| hash["type"] == "application/rss+xml" }

But I have no idea if this will meet the requirements of your
assignment or if you will understand it.

Regards,
Sean

Regular Expression interesting problem	0	Mar 28, 2009
Javascript DOM	1	Mar 29, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Possible PHP/WP problem with code, trouble accessing custom archive links	1	Jan 5, 2023
Horizontal menu bar header	2	May 12, 2021
Is it possible an iframe can overlapp another?	3	Apr 20, 2022
Unable to add task to todo list	1	Sep 25, 2021
Why <link/> is not working?	4	Jan 1, 2020

Regular Expression interesting problem

Arun Kumar

Eric Hodel

Arun Kumar

James Coglan

Robert Klemme

Sean O'Halpin

Arun Kumar

7stud --

Sean O'Halpin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads