regex failing

N

noon

I'm runing an xmlHttpRequest to get the site's source code and then
applying the regex

xhr.responseText.split(/<body[^>]*>((?:.|\n)*)<\/body>/i)[1]

Works for google.com. Fails on yahoo.com and imdb.com pages (ex:
http://imdb.com/title/tt0482606/ )

Can someone help me tweak this, or give insight as to why its
failing? I can't spot it
 
E

Erwin Moller

noon schreef:
I'm runing an xmlHttpRequest to get the site's source code and then
applying the regex

xhr.responseText.split(/<body[^>]*>((?:.|\n)*)<\/body>/i)[1]

Works for google.com. Fails on yahoo.com and imdb.com pages (ex:
http://imdb.com/title/tt0482606/ )

Can someone help me tweak this, or give insight as to why its
failing? I can't spot it

Maybe...
You didn't mention what it is you WANT your regex to do.
And you didn't say what 'failing' is. An error? An unexpected result?

Regards,
Erwin Moller
 
N

noon

That information might help huh. I want it to strip everything
inbetween body tags. The error was that I was either receiving nothing
or receiving the entire html including the head tags etc. I have since
seem to have got it working with this code:

xhr.responseText.split(/<body[^>]*>((.|\n|\r|\u2028|\u2029)*)<\/body>/
gi)[1];

Though improvement suggestions are welcome
 
T

Thomas 'PointedEars' Lahn

noon said:
That information might help huh. I want it to strip everything
inbetween body tags. The error was that I was either receiving nothing
or receiving the entire html including the head tags etc. I have since
seem to have got it working with this code:

xhr.responseText.split(/<body[^>]*>((.|\n|\r|\u2028|\u2029)*)<\/body>/
gi)[1];

With

foo<body>...</body>bar

this would give you

...

But you wanted to *strip* everything *in between*, _not_ split.
Though improvement suggestions are welcome

... = xhr.responseText.match(/<body(|\s+[^>]*)>((.|\s)*)<\/body>/i)[1];

is largely equivalent to your code in this case and more efficient.
However, IMHO that is still _not_ stripping everything in between but
*matching* everything in between, which is probably what you meant to say.

Note that (X)HTML is a context-sensitive language which cannot be parsed
with one regular expression (defining a regular language) alone. In your
case it should work because a Valid (X)HTML document MUST NOT have more
than one `body' element.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top