Using filepath method to identify an .html page

F

Ferrous Cranus

Hello, i decided to switch from embedding string into .html to actually grab the filepath in order to identify it:

# =================================================================================================================
# open current html template and get the page ID number
# ===============================

f = open( page )

# read first line of the file
firstline = f.readline()

# find the ID of the file and store it
pin = re.match( r'<!-- (\d+) -->', firstline ).group(1)
=================================

This is what i used to have.

Now, can you pleas help me write the switch to filepath identifier?
I'am having trouble writing it.
 
S

Steven D'Aprano

Hello, i decided to switch from embedding string into .html to actually
grab the filepath in order to identify it:

What do you think "the filepath" means, and how do you think you would
grab it?

I can only guess you mean the full path to the file, like:

/home/steve/mypage.html

C:\My Documents\mypage.html


Is that what you mean?

# open current html template and get the page ID number
f = open( page )
# read first line of the file
firstline = f.readline()
# find the ID of the file and store it
pin = re.match( r'<!-- (\d+) -->', firstline ).group(1)

This is what i used to have.

Now, can you pleas help me write the switch to filepath identifier? I'am
having trouble writing it.

I don't understand the question.
 
F

Ferrous Cranus

# ==================================================
# produce a hash based on html page's filepath and convert it to an integet, that will be uses to identify the page itself.
# ==================================================

pin = int( hashlib.md5( htmlpage ) )


I just tried that but it produced an error.
What am i doing wrong?
 
F

Ferrous Cranus

# ====================================================================================================================================
# produce a hash string based on html page's filepath and convert it to an integer, that will then be used to identify the page itself
# ====================================================================================================================================

pin = int( hashlib.md5( htmlpage ) )

This fails. why?

htmlpage = a string respresenting the absolute path of the requested .html file
hashlib.md5( htmlpage ) = conversion of the above string to a hashed string
int( hashlib.md5( htmlpage ) ) = conversion of the above hashed string to a number

Why this fails?
 
L

Lele Gaifax

Ferrous Cranus said:
pin = int( hashlib.md5( htmlpage ) )

This fails. why?

htmlpage = a string respresenting the absolute path of the requested .html file
hashlib.md5( htmlpage ) = conversion of the above string to a hashed string

No, that statement does not "convert" a string into another, but rather
returns a "md5 HASH object":
<md5 HASH object @ 0xb76dbcf0>

Consulting the hashlib documentation[1], you could learn about that
object's methods.
int( hashlib.md5( htmlpage ) ) = conversion of the above hashed string to a number

Why this fails?

Because in general you can't "convert" an arbitrary object instance (in
your case an hashlib.HASH instance) to an integer value. Why do you need
an integer? Isn't hexdigest() what you want?
acbd18db4cc2f85cedef654fccc4a4d8

Do yourself a favor and learn using the interpreter to test your
snippets line by line, most problems will find an easy answer :)

ciao, lele.

[1] http://docs.python.org/2.7/library/hashlib.html#module-hashlib
 
C

Chris Angelico

# ==================================================
# produce a hash based on html page's filepath and convert it to an integet, that will be uses to identify the page itself.
# ==================================================

pin = int( hashlib.md5( htmlpage ) )


I just tried that but it produced an error.
What am i doing wrong?

First and foremost, here's what you're doing wrong: You're saying "it
produced an error". Python is one of those extremely helpful languages
that tells you, to the best of its ability, exactly WHAT went wrong,
WHERE it went wrong, and - often - WHY it failed. For comparison, I've
just tonight been trying to fix up a legacy accounting app that was
written in Visual BASIC back when that wouldn't get scorn heaped on
you from the whole world. When we fire up one particular module, it
bombs with a little message box saying "File not found". That's all.
Just one little message, and the application terminates (uncleanly, at
that). What file? How was it trying to open it? I do know that it
isn't one of its BTrieve data files, because when one of THEM isn't
found, the crash looks different (but it's still a crash). My current
guess is that it's probably a Windows DLL file or something, but it's
really not easy to tell...

ChrisA
 
D

Dave Angel

# ====================================================================================================================================
# produce a hash string based on html page's filepath and convert it to an integer, that will then be used to identify the page itself
# ====================================================================================================================================

pin = int( hashlib.md5( htmlpage ) )

This fails. why?

htmlpage = a string respresenting the absolute path of the requested .html file
hashlib.md5( htmlpage ) = conversion of the above string to a hashed string
int( hashlib.md5( htmlpage ) ) = conversion of the above hashed string to a number

Why this fails?

Is your copy/paste broken? It could be useful to actually show in what
way it "fails."

The md5 method produces a "HASH object", not a string. So int() cannot
process that.

To produce a digest string from the hash object, you want to call
hexdigest() method. The result of that is a hex literal string. So you
cannot just call int() on it, since that defaults to decimal.

To convert a hex string to an int, you need the extra parameter of int:

int(mystring, 16)

Now, see if you can piece it together.
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 2:29:21 μ.μ. UTC+2, οχÏήστης Dave Angel έγÏαψε:
Is your copy/paste broken? It could be useful to actually show in what

way it "fails."



The md5 method produces a "HASH object", not a string. So int() cannot

process that.



To produce a digest string from the hash object, you want to call

hexdigest() method. The result of that is a hex literal string. So you

cannot just call int() on it, since that defaults to decimal.



To convert a hex string to an int, you need the extra parameter of int:



int(mystring, 16)



Now, see if you can piece it together.


htmlpage = a string respresenting the absolute path of the requested .html file


What i want to do, is to associate a number to an html page's absolute pathfor to be able to use that number for my database relations instead of theBIG absolute path string.

so to get an integer out of a string i would just have to type:

pin = int( htmlpage )

But would that be unique?
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 2:29:21 μ.μ. UTC+2, οχÏήστης Dave Angel έγÏαψε:
Is your copy/paste broken? It could be useful to actually show in what

way it "fails."



The md5 method produces a "HASH object", not a string. So int() cannot

process that.



To produce a digest string from the hash object, you want to call

hexdigest() method. The result of that is a hex literal string. So you

cannot just call int() on it, since that defaults to decimal.



To convert a hex string to an int, you need the extra parameter of int:



int(mystring, 16)



Now, see if you can piece it together.


htmlpage = a string respresenting the absolute path of the requested .html file


What i want to do, is to associate a number to an html page's absolute pathfor to be able to use that number for my database relations instead of theBIG absolute path string.

so to get an integer out of a string i would just have to type:

pin = int( htmlpage )

But would that be unique?
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 2:47:16 μ.μ. UTC+2, οχÏήστης Ferrous Cranus έγÏαψε:
Τη ΤÏίτη, 22 ΙανουαÏίου 2013 2:29:21 μ.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:







htmlpage = a string respresenting the absolute path of the requested .html file





What i want to do, is to associate a number to an html page's absolute path for to be able to use that number for my database relations instead of the BIG absolute path string.



so to get an integer out of a string i would just have to type:



pin = int( htmlpage )



But would that be unique?

Another error even without hasing anyhting http://superhost.gr to view it please
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 2:47:16 μ.μ. UTC+2, οχÏήστης Ferrous Cranus έγÏαψε:
Τη ΤÏίτη, 22 ΙανουαÏίου 2013 2:29:21 μ.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:







htmlpage = a string respresenting the absolute path of the requested .html file





What i want to do, is to associate a number to an html page's absolute path for to be able to use that number for my database relations instead of the BIG absolute path string.



so to get an integer out of a string i would just have to type:



pin = int( htmlpage )



But would that be unique?

Another error even without hasing anyhting http://superhost.gr to view it please
 
C

Chris Angelico

What i want to do, is to associate a number to an html page's absolute path for to be able to use that number for my database relations instead of the BIG absolute path string.

so to get an integer out of a string i would just have to type:

pin = int( htmlpage )

But would that be unique?

The absolute path probably isn't that big. Just use it. Any form of
hashing will give you a chance of a collision.

ChrisA
 
S

Steven D'Aprano

htmlpage = a string respresenting the absolute path of the requested
.html file


That is a very misleading name for a variable. The contents of the
variable are not a html page, but a file name.

htmlpage = "/home/steve/my-web-page.html" # Bad variable name.

filename = "/home/steve/my-web-page.html" # Better variable name.


What i want to do, is to associate a number to an html page's absolute
path for to be able to use that number for my database relations instead
of the BIG absolute path string.

Firstly, don't bother. What you consider "BIG", your database will
consider trivially small. What is it, 100 characters long? 200? Unlikely
to be 300, since I think many file systems don't support paths that long.
But let's say it is 300 characters long.

That's likely to be 600 bytes, or a bit more than half a kilobyte. Your
database won't even notice that.

so to get an integer out of a string i would just have to type:

pin = int( htmlpage )

No, that doesn't work. int() does not convert arbitrary strings into
numbers. What made you think that this could possibly work?

What do you expect int("my-web-page.html") to return? Should it return 23
or 794 or 109432985462940911485 or 42?
But would that be unique?

Wrong question.


Just tell your database to make the file name an indexed field, and it
will handle giving every path a unique number for you. You can then
forget all about that unique number, because it is completely irrelevant
to you, and safely use the path while the database treats it in the
fastest and most efficient fashion necessary.
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 3:04:41 μ.μ. UTC+2, οχÏήστης Steven D'Aprano έγÏαψε:
What do you expect int("my-web-page.html") to return? Should it return 23
or 794 or 109432985462940911485 or 42?

Just tell your database to make the file name an indexed field, and it

will handle giving every path a unique number for you. You can then

forget all about that unique number, because it is completely irrelevant

to you, and safely use the path while the database treats it in the

fastest and most efficient fashion necessary.

This counter.py will work on a shared hosting enviroment, so absolutes paths are BIG and expected like this:

/home/nikos/public_html/varsa.gr/articles/html/files/index.html

In addition to that my counter.py script maintains details in a database table that stores information for each and every webpage requested.

My 'visitors' database has 2 tables:

pin --- page ---- hits (that's to store general information for all html pages)

pin <-refers to-> page

pin ---- host ---- hits ---- useros ---- browser ---- date (that's to store detailed information for all html pages)

(thousands of records to hold every page's information)


'pin' has to be a number because if i used the column 'page' instead, just imagine the database's capacity withholding detailed information for each and every .html requested by visitors!!!

So i really - really need to associate a (4-digit integer <=> htmlpage's absolute path)

Maybe it can be done by creating a MySQL association between the two columns, but i dont know how such a thing can be done(if it can).

So, that why i need to get a "unique" number out of a string. please help.
 
C

Chris Angelico

Ôç Ôñßôç, 22 Éáíïõáñßïõ 2013 3:04:41 ì.ì. UTC+2, ï ÷ñÞóôçò Steven D'Aprano Ýãñáøå:


I expected a unique number from the given string to be produced so i could have a (number <=> string) relation. What does int( somestring ) is returning really? i don;t have IDLE to test.

Just run python without any args, and you'll get interactive mode. You
can try things out there.
This counter.py will work on a shared hosting enviroment, so absolutes paths are BIG and expected like this:

/home/nikos/public_html/varsa.gr/articles/html/files/index.html

That's not big. Trust me, modern databases work just fine with unique
indexes like that. The most common way to organize the index is with a
binary tree, so the database has to look through log(N) entries.
That's like figuring out if the two numbers 142857 and 857142 are the
same; you don't need to look through 1,000,000 possibilities, you just
need to look through the six digits each number has.
'pin' has to be a number because if i used the column 'page' instead, just imagine the database's capacity withholding detailed information for eachand every .html requested by visitors!!!

Not that bad actually. I've happily used keys easily that long, and
expected the database to ensure uniqueness without costing
performance.
So i really - really need to associate a (4-digit integer <=> htmlpage's absolute path)

Is there any chance that you'll have more than 10,000 pages? If so, a
four-digit number is *guaranteed* to have duplicates. And if you
research the Birthday Paradox, you'll find that any sort of hashing
algorithm is likely to produce collisions a lot sooner than that.
Maybe it can be done by creating a MySQL association between the two columns, but i dont know how such a thing can be done(if it can).

So, that why i need to get a "unique" number out of a string. please help..

Ultimately, that unique number would end up being a foreign key into a
table of URLs and IDs. So just skip that table and use the URLs
directly - much easier. In this instance, there's no value in
normalizing.

ChrisA
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 4:33:03 μ.μ. UTC+2, οχÏήστης Chris Angelico έγÏαψε:
Just run python without any args, and you'll get interactive mode. You

can try things out there.






That's not big. Trust me, modern databases work just fine with unique

indexes like that. The most common way to organize the index is with a

binary tree, so the database has to look through log(N) entries.

That's like figuring out if the two numbers 142857 and 857142 are the

same; you don't need to look through 1,000,000 possibilities, you just

need to look through the six digits each number has.






Not that bad actually. I've happily used keys easily that long, and

expected the database to ensure uniqueness without costing

performance.






Is there any chance that you'll have more than 10,000 pages? If so, a

four-digit number is *guaranteed* to have duplicates. And if you

research the Birthday Paradox, you'll find that any sort of hashing

algorithm is likely to produce collisions a lot sooner than that.






Ultimately, that unique number would end up being a foreign key into a

table of URLs and IDs. So just skip that table and use the URLs

directly - much easier. In this instance, there's no value in

normalizing.



ChrisA

I insist, perhaps compeleld, to use a key to associate a number to a filename.
Would you help please?

I dont know this is supposed to be written. i just know i need this:

number = function_that_returns_a_number_out_of_a_string( absolute_path_of_a_html_file)

Would someone help me write that in python coding? We are talkign 1 line ofcode here....
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 4:33:03 μ.μ. UTC+2, οχÏήστης Chris Angelico έγÏαψε:
Just run python without any args, and you'll get interactive mode. You

can try things out there.






That's not big. Trust me, modern databases work just fine with unique

indexes like that. The most common way to organize the index is with a

binary tree, so the database has to look through log(N) entries.

That's like figuring out if the two numbers 142857 and 857142 are the

same; you don't need to look through 1,000,000 possibilities, you just

need to look through the six digits each number has.






Not that bad actually. I've happily used keys easily that long, and

expected the database to ensure uniqueness without costing

performance.






Is there any chance that you'll have more than 10,000 pages? If so, a

four-digit number is *guaranteed* to have duplicates. And if you

research the Birthday Paradox, you'll find that any sort of hashing

algorithm is likely to produce collisions a lot sooner than that.






Ultimately, that unique number would end up being a foreign key into a

table of URLs and IDs. So just skip that table and use the URLs

directly - much easier. In this instance, there's no value in

normalizing.



ChrisA

I insist, perhaps compeleld, to use a key to associate a number to a filename.
Would you help please?

I dont know this is supposed to be written. i just know i need this:

number = function_that_returns_a_number_out_of_a_string( absolute_path_of_a_html_file)

Would someone help me write that in python coding? We are talkign 1 line ofcode here....
 
D

Dave Angel

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 4:33:03 μ.μ. UTC+2, ο χÏήστης Chris Angelico έγÏαψε:

I insist, perhaps compeleld, to use a key to associate a number to a filename.
Would you help please?

I dont know this is supposed to be written. i just know i need this:

number = function_that_returns_a_number_out_of_a_string( absolute_path_of_a_html_file)

Would someone help me write that in python coding? We are talkign 1 line of code here....

I gave you every piece of that code in my last response. So you're not
willing to compose the line from the clues?
 
C

Chris Angelico

I insist, perhaps compeleld, to use a key to associate a number to a filename.
Would you help please?

I dont know this is supposed to be written. i just know i need this:

number = function_that_returns_a_number_out_of_a_string( absolute_path_of_a_html_file)

Would someone help me write that in python coding? We are talkign 1 line of code here....

def function_that_returns_a_number_out_of_a_string(string, cache=[]):
return cache.index(string) if string in cache else
(cache.append(string) or len(cache)-1)

That will work perfectly, as long as you don't care how long the
numbers end up, and as long as you have a single Python script doing
the work, and as long as you make sure you save and load that cache
any time you shut down the script, and so on.

It will also, and rightly, be decried as a bad idea. But hey, you did
specify that it be one line of code. For your real job, USE A DATABASE
COLUMN.

ChrisA
 
F

Ferrous Cranus

Τη ΤÏίτη, 22 ΙανουαÏίου 2013 5:05:49 μ.μ. UTC+2, οχÏήστης Dave Angel έγÏαψε:
I gave you every piece of that code in my last response. So you're not

willing to compose the line from the clues?

I cannot.
I don't even know yet if hashing needs to be used for what i need.

The only thing i know is that:

a) i only need to get a number out of string(being an absolute path)
b) That number needs to be unique, because "that" number is an indicator tothe actual html file.

Would you help me write this in python?

Why the hell

pin = int ( '/home/nikos/public_html/index.html' )

fails? because it has slashes in it?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,931
Messages
2,570,085
Members
46,536
Latest member
keelop

Latest Threads

Top