How to grab a number from inside a .html file using regex

Íßêïò · Aug 7, 2010

Hello guys! Need your precious help again!

In every html file i have in the very first line a page_id fro counetr
countign purpsoes like in a format of a comment like this:





and so on. every html file has its one page_id

How can i grab that string representaion of a number from inside
the .html file using regex and convert it to an integer value?

# ==============================
# open current html template and get the page ID number
# ==============================

f = open( '/home/webville/public_html/' + page )

#read first line of the file
firstline = f.readline()

page_id = re.match( '', firstline )
print ( page_id )

Íßêïò · Aug 7, 2010

i also dont know what wrong with this line:

host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0]

hostmatch = re.search('cyta', host)

if cookie.has_key('visitor') != 'nikos' or hostmatch is None:
# do stuff

the 'stuff' never gets executed, while i ant them to be as long as i
dont have regex match!

MRAB · Aug 7, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
Hello guys! Need your precious help again!

In every html file i have in the very first line a page_id fro counetr
countign purpsoes like in a format of a comment like this:





and so on. every html file has its one page_id

How can i grab that string representaion of a number from inside
the .html file using regex and convert it to an integer value?

# ==============================
# open current html template and get the page ID number
# ==============================

f = open( '/home/webville/public_html/' + page )

#read first line of the file
firstline = f.readline()

page_id = re.match( '', firstline )
print ( page_id )

Use group capture:

found = re.match(r'', firstline).group(1)
print(page_id)

MRAB · Aug 7, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
i also dont know what wrong with this line:

host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0]

hostmatch = re.search('cyta', host)

if cookie.has_key('visitor') != 'nikos' or hostmatch is None:
# do stuff

the 'stuff' never gets executed, while i ant them to be as long as i
dont have regex match!

Try printing out repr(host). Does it contain "cyta"?

ÎÎ¯ÎºÎ¿Ï‚ · Aug 7, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
ÎÎ¯ÎºÎ¿Ï‚ said:

i also dont know what wrong with this line:

Click to expand...

host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0]

Click to expand...

hostmatch = re.search('cyta', host)

Click to expand...

if cookie.has_key('visitor') != 'nikos' or hostmatch is None:
Â Â Â # do stuff

Click to expand...

the 'stuff' never gets executed, while i want them to be as long as i
dont have regex match!

Click to expand...

Try printing out repr(host). Does it contain "cyta"?

Yes it does contain it as print shown!

is something wrong with this line in logic or syntax?

if cookie.has_key('visitor') != 'nikos' or re.search('cyta', host) is
None:
# do database stuff

ÎÎ¯ÎºÎ¿Ï‚ · Aug 7, 2010

Use group capture:

Â Â Â found = re.match(r'', firstline).group(1)
Â Â Â print(page_id)

Worked like a charm! Thanks a lot!

So match method here not only searched for the string representation
of the number but also convert it to integer as well?

r stand for retrieve the string here?

and group?

Wehn a regex searched a .txt file when is retrieving something for it
always retrieve it as string right? or can get it as a number as well?

Thomas Jollans · Aug 7, 2010

Worked like a charm! Thanks a lot!

So match method here not only searched for the string representation
of the number but also convert it to integer as well?

r stand for retrieve the string here?

r"xyz" is a raw string literal. That means that backslash escapes are
turned off -- r'\n' == '\\n'

MRAB · Aug 7, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
ÎÎ¯ÎºÎ¿Ï‚ said:

i also dont know what wrong with this line:
host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0]
hostmatch = re.search('cyta', host)
if cookie.has_key('visitor') != 'nikos' or hostmatch is None:
# do stuff
the 'stuff' never gets executed, while i want them to be as long as i
dont have regex match!

Click to expand...

Try printing out repr(host). Does it contain "cyta"?

Click to expand...

Yes it does contain it as print shown!

is something wrong with this line in logic or syntax?

if cookie.has_key('visitor') != 'nikos' or re.search('cyta', host) is
None:
# do database stuff

You said "i want them to be as long as i dont have regex match".

re.search('cyta', host) will return None if there's no match, but you
said "Yes it does contain it", so there _is_ a match, therefore:

hostmatch is None

is False.

MRAB · Aug 7, 2010

ÎÎ¯ÎºÎ¿Ï‚ said:
Worked like a charm! Thanks a lot!

So match method here not only searched for the string representation
of the number but also convert it to integer as well?

r stand for retrieve the string here?

and group?

Wehn a regex searched a .txt file when is retrieving something for it
always retrieve it as string right? or can get it as a number as well?

The 'r' prefix makes it a 'raw string literal'. That means that the
string literal won't treat backslashes as special. Before raw string
literals were added to the Python language I would have needed to write:

''

instead.

(Actually, that's not strictly true in this case, because \d doesn't
have a special meaning Python strings, but it's a good idea to use raw
string literals habitually when writing regexes in order to reduce the
chance of forgetting them when they _are_ necessary. Well, that's what I
think, anyway.

)

ÎÎ¯ÎºÎ¿Ï‚ · Aug 7, 2010

re.search('cyta', host) will return None if there's no match, but you
said "Yes it does contain it", so there _is_ a match, therefore:

Â Â Â hostmatch is None

is False.

The code block inside the if structure must be executes ONLY if the
'visitor' cookie is not set to the client's browser or the hostname
address of the client doesn't contain in it the string 'cyta'.

# ======================================
# do not increment the counter if a Cookie is set to the visitors
browser already
# ======================================

if cookie.has_key('visitor') != 'nikos' or re.search('cyta', host) is
None:

I still don't get it

ÎÎ¯ÎºÎ¿Ï‚ · Aug 7, 2010

The 'r' prefix makes it a 'raw string literal'. That means that the
string literal won't treat backslashes as special. Before raw string
literals were added to the Python language I would have needed to write:

Â Â Â ''

instead.

(Actually, that's not strictly true in this case, because \d doesn't
have a special meaning Python strings, but it's a good idea to use raw
string literals habitually when writing regexes in order to reduce the
chance of forgetting them when they _are_ necessary. Well, that's what I
think, anyway. )

Couln't agree more!

As the saying goes, better safe than sorry!

Thomas Jollans · Aug 7, 2010

cookie.has_key('visitor') != 'nikos'

This is always True. has_key returns a bool, which is never equal to any
string, even 'nikos'.

MRAB · Aug 7, 2010

Thomas said:
This is always True. has_key returns a bool, which is never equal to any
string, even 'nikos'.

I missed that bit!

Anyway, the OP said "the 'stuff' never gets executed". Kinda puzzling...

ÎÎ¯ÎºÎ¿Ï‚ · Aug 7, 2010

This is always True. has_key returns a bool, which is never equal to any
string, even 'nikos'.

if cookie.has_key('visitor') or re.search('cyta', host) is None:

adresses the problem

Thanks alot Thomas and MRAB for ALL your help!

Using filepath method to identify an .html page	88	Jan 22, 2013
Uniquely identifying each & every html template	58	Jan 18, 2013
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
How to loop through all the other pages in a pdf using python	3	May 16, 2023
Converting a number back to it's original string (that was hashed togenerate that number)	7	Jan 23, 2013
Help needed to retrieve text from a text-file using RegEx	4	Feb 9, 2009
How do I encode and decode this data to write to a file?	11	Apr 29, 2013
simple_html_dom: simple use-case - getting a scipt to work	0	Mar 2, 2020

How to grab a number from inside a .html file using regex

Íßêïò

Íßêïò

MRAB

MRAB

ÎÎ¯ÎºÎ¿Ï‚

ÎÎ¯ÎºÎ¿Ï‚

Thomas Jollans

MRAB

MRAB

ÎÎ¯ÎºÎ¿Ï‚

ÎÎ¯ÎºÎ¿Ï‚

Thomas Jollans

MRAB

ÎÎ¯ÎºÎ¿Ï‚

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads