Regular Expression interesting problem

A

Arun Kumar

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base...tes?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar
 
E

Eric Hodel

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml
"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help |
RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base...tes?client=ytapi-youtube-index&time=today&v=2
">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

PPS: There's no need to post twice.
 
A

Arun Kumar

Eric said:
I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.
Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.
PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Thanks for ur quick reply

Regards
Arun Kumar
 
J

James Coglan

[Note: parts of this message were removed to make it a legal post.]

2009/3/28 Arun Kumar said:
Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.


As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.



Looks like you will need to parse in stages -- I can't get String#scan to
capture everything using a single regex, though there's every chance I've
screwed up the expression somehow:

'<link type="application" href="http://google.com" rel="alternate" />'.scan
/<([^\s]+)(?:\s+([^\s]+)="([^"]*)")*\s*\/?>/i
#=> [['link', 'rel', 'alternate']]
 
R

Robert Klemme

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

I'd probably rather scan for each <link> tag and then analyze it, i.e.

doc.scan %r{<link[^>]*>}i do |link|
if %r{(?i:type)=["']application/rss\+xml["']} =~ link
...
end
end

Note that the scanning RX is weak.

But I agree, rather use the proper tool for the job.

Cheers

robert
 
S

Sean O'Halpin

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base...tes?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.
 
A

Arun Kumar

Sean said:
In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

Regards
Arun Kumar . C. M.
 
7

7stud --

Sean said:
In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

That was my first thought when I read the other post.

Anyway, split() rules the world--not regexs.
 
S

Sean O'Halpin

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

If I have misrepresented you, then you have my sincerest apologies.
However, you have not really represented your own position terribly
well. Now that we know you are a trainee with little experience who is
currently specifically being trained in regular expressions, it makes
more sense that you cannot use REXML, etc. But this was not clear from
your previous posts.

By the way, you are more likely to get a positive response if you at
least show how far you have got with the problem yourself before
coming to the list.

And to make up for my grouchy mood this morning, here's my contribution:

hashes = []
data.scan(/<link[^>]+?>/) do |link|
hashes << Hash[*link.scan(/([a-z]+)=["']?([^"]+)["']?/).flatten]
end
require 'pp'
pp hashes.select{ |hash| hash["type"] == "application/rss+xml" }

But I have no idea if this will meet the requirements of your
assignment or if you will understand it.

Regards,
Sean
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,241
Latest member
Lisa1997

Latest Threads

Top