Getting a value that follows string.find()

E

englishkevin110

I know the title doesn't make much sense, but I didnt know how to explain my problem.

Anywho, I've opened a page's source in URLLIB
starturlsource = starturlopen.read()
string.find(starturlsource, '<a href="/profile.php?id=')
And I used string.find to find a specific area in the page's source.
I want to store what comes after ?id= in a variable.
Can someone help me with this?
 
J

Joel Goldstick

lookup urlparse for you answer

I know the title doesn't make much sense, but I didnt know how to explain my problem.

Anywho, I've opened a page's source in URLLIB
starturlsource = starturlopen.read()
string.find(starturlsource, '<a href="/profile.php?id=')
And I used string.find to find a specific area in the page's source.
I want to store what comes after ?id= in a variable.
Can someone help me with this?
 
J

Joel Goldstick

Aside from the fact that I really want a pony, and you seem to want
your work done for you, look here:

http://stackoverflow.com/questions/11600681/parse-query-part-from-url

I may have been too quick on my reading of you question. You wanted
to get the value of the parameters, but also to find the url in the
page. You want to do this without parsing, if I understand you. The
good news is there is a module called Beautiful Soup that will do the
parsing for you. The tutorial is way better than excellent, and you
will be up and running in less than a half hour from downloading the
module

http://www.crummy.com/software/BeautifulSoup/bs4/doc/
 
D

Dave Angel

I know the title doesn't make much sense, but I didnt know how to explain my problem.

Anywho, I've opened a page's source in URLLIB
starturlsource = starturlopen.read()
string.find(starturlsource, '<a href="/profile.php?id=')
And I used string.find to find a specific area in the page's source.
I want to store what comes after ?id= in a variable.
Can someone help me with this?

Python 3.3.0 (default, Mar 7 2013, 00:24:38)
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'find'

There is no find function in the string module [1]. But assuming
starturlsource is a str, you could do:

pattern = '<a href="/profile.php?id='
index = starturlsource.find( pattern )

index will then be -1 if there's no match, or have a non-negative value
if a match is found.

In the latter case, you can extract the next 17 characters with

newstr = starturlsource[index+len(pattern):index+len(pattern)+17]

You are of course making several assumptions about the web page, which
are perfectly reasonable since it's a page under your control. Or is
it?


[1] Assuming Python 3.3 since you omitted stating the version you're
using. But even in Python 2.7, using the string.find function is
deprecated in favor of the str method.
 
S

Steven D'Aprano

[fixing Joel's top-posting]
I dont want to do any kind of HTML parsing.


What you are doing *is* HTML parsing, or at least a half-baked, fragile,
likely to go wrong form of parsing.

But if you insist, the algorithm is simple: after calling find(), you
have the offset to the search string. You know the length of the search
string. Therefore you can calculate the index of the first character that
follows the search string:

text = "blah blah blah blah spam spam... blah blah blah blah..."
needle = "spam spam" # what we search for

i = text.find(needle)
if i == -1:
print("not found")
else:
print(text[i+len(needle):])


Of course, the problem is, you need to know not just the *start* offset
of the bit that follows, but the *ending* offset as well. Which brings
you into the realm of half-arsed parsing.
 
J

John Gordon

In said:
I know the title doesn't make much sense, but I didnt know how to explain my problem.
Anywho, I've opened a page's source in URLLIB
starturlsource = starturlopen.read()
string.find(starturlsource, '<a href="/profile.php?id=')
And I used string.find to find a specific area in the page's source.
I want to store what comes after ?id= in a variable.
Can someone help me with this?

starturlsource = starturlopen.read()

match_string = '<a href="/profile.php?id='

match_index = string.find(starturlsource, match_string)

if match_index != -1:
url = starturlsource[match_index + len(match_string):]

else:
print 'not found'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
SterlingLa
Top