Uniquely identifying each & every html template

F

Ferrous Cranus

I use this .htaccess file to rewrite every .html request to counter.py

# =================================================================================================================
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^/?(.+\.html) /cgi-bin/counter.py?htmlpage=$1 [L,PT,QSA]
# =================================================================================================================



counter.py script is created for creating, storing, increasing, displaying a counter for each webpage for every website i have.
It's supposed to identify each webpage by a <!-- Number --> and then do it's database stuff from there

# =================================================================================================================
# open current html template and get the page ID number
# =================================================================================================================
f = open( '/home/nikos/public_html/' + page )

# read first line of the file
firstline = f.readline()

# find the ID of the file and store it
pin = re.match( r'<!-- (\d+) -->', firstline ).group(1)
# =================================================================================================================

It works as expected and you can see it works normally by viewing: http//superhost.gr (bottom down its the counter)

What is the problem you ask?!
Problem is that i have to insert at the very first line of every .html template of mine, a unique string containing a number like:

index.html <!-- 1 -->
somefile.html <!-- 2-->
other.html <!-- 3 -->
nikos.html <!-- 4 -->
cool.html <!-- 5 -->

to HELP counter.py identify each webpage at a unique way.

Well.... its about 1000 .html files inside my DocumentRoot and i cannot edit ALL of them of course!
Some of them created by Notepad++, some with the use of Dreamweaver and some others with Joomla CMS
Even if i could embed a number to every html page, it would have been a very tedious task, and what if a change was in order? Edit them ALL back again? Of course not.

My question is HOW am i suppose to identify each and every html webpage i have, without the need of editing and embedding a string containing a number for them. In other words by not altering their contents.

or perhaps by modifying them a bit..... but in an automatic way....?

Thank you ALL in advance.
 
J

John Gordon

In said:
Problem is that i have to insert at the very first line of every .html template of mine, a unique string containing a number like:
index.html <!-- 1 -->
somefile.html <!-- 2-->
other.html <!-- 3 -->
nikos.html <!-- 4 -->
cool.html <!-- 5 -->
to HELP counter.py identify each webpage at a unique way.

Instead of inserting unique content in every page, can you use the
document path itself as the identifier?
 
D

Dave Angel

I use this .htaccess file to rewrite every .html request to counter.py

# =================================================================================================================
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^/?(.+\.html) /cgi-bin/counter.py?htmlpage=$1 [L,PT,QSA]
# =================================================================================================================



counter.py script is created for creating, storing, increasing, displaying a counter for each webpage for every website i have.
It's supposed to identify each webpage by a <!-- Number --> and then do it's database stuff from there

# =================================================================================================================
# open current html template and get the page ID number
# =================================================================================================================
f = open( '/home/nikos/public_html/' + page )

# read first line of the file
firstline = f.readline()

# find the ID of the file and store it
pin = re.match( r'<!-- (\d+) -->', firstline ).group(1)
# =================================================================================================================

It works as expected and you can see it works normally by viewing: http//superhost.gr (bottom down its the counter)

What is the problem you ask?!
Problem is that i have to insert at the very first line of every .html template of mine, a unique string containing a number like:

index.html <!-- 1 -->
somefile.html <!-- 2-->
other.html <!-- 3 -->
nikos.html <!-- 4 -->
cool.html <!-- 5 -->

to HELP counter.py identify each webpage at a unique way.

Well.... its about 1000 .html files inside my DocumentRoot and i cannot edit ALL of them of course!
Some of them created by Notepad++, some with the use of Dreamweaver and some others with Joomla CMS
Even if i could embed a number to every html page, it would have been a very tedious task, and what if a change was in order? Edit them ALL back again? Of course not.

My question is HOW am i suppose to identify each and every html webpage i have, without the need of editing and embedding a string containing a number for them. In other words by not altering their contents.

or perhaps by modifying them a bit..... but in an automatic way....?

Thank you ALL in advance.

I don't understand the problem. A trivial Python script could scan
through all the files in the directory, checking which ones are missing
the identifier, and rewriting the file with the identifier added.

So, since you didn't come to that conclusion, there must be some other
reason you don't want to edit the files. Is it that the real sources
are elsewhere (e.g. Dreamweaver), and whenever one recompiles those
sources, these files get replaced (without identifiers)?

If that's the case, then I figure you have about 3 choices:

1) use the file path as your key, instead of requiring a number
2) use a hash of the page (eg. md5) as your key. of course this could
mean that you get a new value whenever the page is updated. That's good
in many situations, but you don't give enough information to know if
that's desirable for you or not.
3) Keep an external list of filenames, and their associated id numbers.
The database would be a good place to store such a list, in a separate
table.
 
F

Ferrous Cranus

Τη ΠαÏασκευή, 18 ΙανουαÏίου 2013 10:59:17 μ.μ. UTC+2, ο χÏήστης John Gordon έγÏαψε:
Instead of inserting unique content in every page, can't you use the
document path itself as the identifier?

No, i cannot, becaue it would mess things at later time when i for example:

1. mv name.html othername.html (document's filename altered)
2. mv name.html /subfolder/name.html (document's path altered)

Hence, new database counters will be created for each of the above cases.
 
F

Ferrous Cranus

Τη Σάββατο, 19 ΙανουαÏίου 2013 12:09:28 Ï€.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:
I don't understand the problem. A trivial Python script could scan

through all the files in the directory, checking which ones are missing

the identifier, and rewriting the file with the identifier added.
So, since you didn't come to that conclusion, there must be some other

reason you don't want to edit the files. Is it that the real sources

are elsewhere (e.g. Dreamweaver), and whenever one recompiles those

sources, these files get replaced (without identifiers)?

Exactly. Files get modified/updates thus the embedded identifier will be missing each time. So, relying on embedding code to html template content is not practical.

If that's the case, then I figure you have about 3 choices:
1) use the file path as your key, instead of requiring a number

No, i cannot, because it would mess things at a later time on when i for example:

1. mv name.html othername.html (document's filename altered)
2. mv name.html /subfolder/name.html (document's filepath altered)

Hence, new database counters will be created for each of the above actions,therefore i will be having 2 counters for the same file, and the latter one will start from a zero value.

Pros: If the file's contents gets updated, that won't affect the counter.
Cons: If filepath is altered, then duplicity will happen.

2) use a hash of the page (eg. md5) as your key. of course this could
mean that you get a new value whenever the page is updated. That's good
in many situations, but you don't give enough information to know if
that's desirable for you or not.

That sounds nice! A hash is a mathematical algorithm that produce a unique number after analyzing each file's contents? But then again what if the html templated gets updated? That update action will create a new hash for thefile, hence another counter will be created for the same file, same end result as (1) solution.

Pros: If filepath is altered, that won't affect the counter.
Cons: If file's contents gets updated the, then duplicity will happen.

3) Keep an external list of filenames, and their associated id numbers.
The database would be a good place to store such a list, in a separate table.

I did not understand that solution.


We need to find a way so even IF:

(filepath gets modified && file content's gets modified) simultaneously thecounter will STILL retains it's value.
 
F

Ferrous Cranus

Τη Σάββατο, 19 ΙανουαÏίου 2013 12:09:28 Ï€.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:
I don't understand the problem. A trivial Python script could scan

through all the files in the directory, checking which ones are missing

the identifier, and rewriting the file with the identifier added.
So, since you didn't come to that conclusion, there must be some other

reason you don't want to edit the files. Is it that the real sources

are elsewhere (e.g. Dreamweaver), and whenever one recompiles those

sources, these files get replaced (without identifiers)?

Exactly. Files get modified/updates thus the embedded identifier will be missing each time. So, relying on embedding code to html template content is not practical.

If that's the case, then I figure you have about 3 choices:
1) use the file path as your key, instead of requiring a number

No, i cannot, because it would mess things at a later time on when i for example:

1. mv name.html othername.html (document's filename altered)
2. mv name.html /subfolder/name.html (document's filepath altered)

Hence, new database counters will be created for each of the above actions,therefore i will be having 2 counters for the same file, and the latter one will start from a zero value.

Pros: If the file's contents gets updated, that won't affect the counter.
Cons: If filepath is altered, then duplicity will happen.

2) use a hash of the page (eg. md5) as your key. of course this could
mean that you get a new value whenever the page is updated. That's good
in many situations, but you don't give enough information to know if
that's desirable for you or not.

That sounds nice! A hash is a mathematical algorithm that produce a unique number after analyzing each file's contents? But then again what if the html templated gets updated? That update action will create a new hash for thefile, hence another counter will be created for the same file, same end result as (1) solution.

Pros: If filepath is altered, that won't affect the counter.
Cons: If file's contents gets updated the, then duplicity will happen.

3) Keep an external list of filenames, and their associated id numbers.
The database would be a good place to store such a list, in a separate table.

I did not understand that solution.


We need to find a way so even IF:

(filepath gets modified && file content's gets modified) simultaneously thecounter will STILL retains it's value.
 
D

Dave Angel

Τη Σάββατο, 19 ΙανουαÏίου 2013 12:09:28 Ï€.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:



Exactly. Files get modified/updates thus the embedded identifier will be missing each time. So, relying on embedding code to html template content is not practical.



No, i cannot, because it would mess things at a later time on when i for example:

1. mv name.html othername.html (document's filename altered)
2. mv name.html /subfolder/name.html (document's filepath altered)

Hence, new database counters will be created for each of the above actions, therefore i will be having 2 counters for the same file, and the latter one will start from a zero value.

Pros: If the file's contents gets updated, that won't affect the counter.
Cons: If filepath is altered, then duplicity will happen.



That sounds nice! A hash is a mathematical algorithm that produce a unique number after analyzing each file's contents? But then again what if the html templated gets updated? That update action will create a new hash for the file, hence another counter will be created for the same file, same end result as (1) solution.

Pros: If filepath is altered, that won't affect the counter.
Cons: If file's contents gets updated the, then duplicity will happen.



I did not understand that solution.


We need to find a way so even IF:

(filepath gets modified && file content's gets modified) simultaneously the counter will STILL retains it's value.

You don't yet have a programming problem, you have a specification
problem. Somehow, you want a file to be considered "the same" even when
it's moved, renamed and/or modified. So all files are the same, and you
only need one id.

Don't pick a mechanism until you have an self-consistent spec.
 
D

Dennis Lee Bieber

We need to find a way so even IF:

(filepath gets modified && file content's gets modified) simultaneously the counter will STILL retains it's value.

The only viable solution the /I/ can visualize is one in which any
operation ON the template file MUST ALSO operate on the counter... That
is; you only operate on the templates using a front-end application that
automatically links the counter information every time...
 
F

Ferrous Cranus

Τη Σάββατο, 19 ΙανουαÏίου 2013 11:32:41 μ.μ. UTC+2, ο χÏήστης Dennis Lee Bieber έγÏαψε:
On Sat, 19 Jan 2013 00:39:44 -0800 (PST), Ferrous Cranus

<[email protected]> declaimed the following in

gmane.comp.python.general:




The only viable solution the /I/ can visualize is one in which any

operation ON the template file MUST ALSO operate on the counter... That

is; you only operate on the templates using a front-end application that

automatically links the counter information every time...

--

Wulfraed Dennis Lee Bieber AF6VN

(e-mail address removed) HTTP://wlfraed.home.netcom.com/

CANNOT BE DONE because every html templates has been written by different methods.

dramweaver
joomla
notepad++
 
F

Ferrous Cranus

Τη Σάββατο, 19 ΙανουαÏίου 2013 11:32:41 μ.μ. UTC+2, ο χÏήστης Dennis Lee Bieber έγÏαψε:
On Sat, 19 Jan 2013 00:39:44 -0800 (PST), Ferrous Cranus

<[email protected]> declaimed the following in

gmane.comp.python.general:




The only viable solution the /I/ can visualize is one in which any

operation ON the template file MUST ALSO operate on the counter... That

is; you only operate on the templates using a front-end application that

automatically links the counter information every time...

--

Wulfraed Dennis Lee Bieber AF6VN

(e-mail address removed) HTTP://wlfraed.home.netcom.com/

CANNOT BE DONE because every html templates has been written by different methods.

dramweaver
joomla
notepad++
 
F

Ferrous Cranus

Τη Σάββατο, 19 ΙανουαÏίου 2013 11:00:15 Ï€.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:
You don't yet have a programming problem, you have a specification

problem. Somehow, you want a file to be considered "the same" even when

it's moved, renamed and/or modified. So all files are the same, and you

only need one id.

Don't pick a mechanism until you have an self-consistent spec.


I do have the specification.

An .html page must retain its database counter value even if its:

(renamed && moved && contents altered)


[original attributes of the file]:

filename: index.html
filepath: /home/nikos/public_html/
contents: <html> Hello </html>

[get modified to]:

filename: index2.html
filepath: /home/nikos/public_html/folder/subfolder/
contents: <html> Hello, people </html>


The file is still the same, even though its attributes got modified.
We want counter.py script to still be able to "identify" the .html page, hence its counter value in order to get increased properly.
 
F

Ferrous Cranus

Τη Σάββατο, 19 ΙανουαÏίου 2013 11:00:15 Ï€.μ. UTC+2, ο χÏήστης Dave Angel έγÏαψε:
You don't yet have a programming problem, you have a specification

problem. Somehow, you want a file to be considered "the same" even when

it's moved, renamed and/or modified. So all files are the same, and you

only need one id.

Don't pick a mechanism until you have an self-consistent spec.


I do have the specification.

An .html page must retain its database counter value even if its:

(renamed && moved && contents altered)


[original attributes of the file]:

filename: index.html
filepath: /home/nikos/public_html/
contents: <html> Hello </html>

[get modified to]:

filename: index2.html
filepath: /home/nikos/public_html/folder/subfolder/
contents: <html> Hello, people </html>


The file is still the same, even though its attributes got modified.
We want counter.py script to still be able to "identify" the .html page, hence its counter value in order to get increased properly.
 
C

Chris Angelico

An .html page must retain its database counter value even if its:

(renamed && moved && contents altered)

Then you either need to tag them in some external way, or have some
kind of tracking operation - for instance, if you require that all
renames/moves be done through a script, that script can update its
pointer. Otherwise, you need magic, and lots of it.

ChrisA
 
F

Ferrous Cranus

Τη ΔευτέÏα, 21 ΙανουαÏίου 2013 9:20:15 Ï€.μ. UTC+2, ο χÏήστης Chris Angelico έγÏαψε:
Then you either need to tag them in some external way, or have some

kind of tracking operation - for instance, if you require that all

renames/moves be done through a script, that script can update its

pointer. Otherwise, you need magic, and lots of it.



ChrisA

This python script acts upon websites other people use and
every html templates has been written by different methods(notepad++, dreamweaver, joomla).

Renames and moves are performed, either by shell access or either by cPanel access by website owners.

That being said i have no control on HOW and WHEN users alter their html pages.
 
F

Ferrous Cranus

Τη ΔευτέÏα, 21 ΙανουαÏίου 2013 9:20:15 Ï€.μ. UTC+2, ο χÏήστης Chris Angelico έγÏαψε:
Then you either need to tag them in some external way, or have some

kind of tracking operation - for instance, if you require that all

renames/moves be done through a script, that script can update its

pointer. Otherwise, you need magic, and lots of it.



ChrisA

This python script acts upon websites other people use and
every html templates has been written by different methods(notepad++, dreamweaver, joomla).

Renames and moves are performed, either by shell access or either by cPanel access by website owners.

That being said i have no control on HOW and WHEN users alter their html pages.
 
C

Chris Angelico

This python script acts upon websites other people use and
every html templates has been written by different methods(notepad++, dreamweaver, joomla).

Renames and moves are performed, either by shell access or either by cPanel access by website owners.

That being said i have no control on HOW and WHEN users alter their html pages.

Then I recommend investing in some magic. There's an old-established
business JW Wells & Co, Family Sorcerers. They've a first-rate
assortment of magic, and for raising a posthumous shade with effects
that are comic, or tragic, there's no cheaper house in the trade! If
anyone anything lacks, he'll find it all ready in stacks, if he'll
only look in on the resident Djinn, number seventy, Simmery Axe!

Seriously, you're asking for something that's beyond the power of
humans or computers. You want to identify that something's the same
file, without tracking the change or having any identifiable tag.
That's a fundamentally impossible task.

ChrisA
 
F

Ferrous Cranus

Τη ΔευτέÏα, 21 ΙανουαÏίου 2013 11:31:24 Ï€.μ. UTC+2, ο χÏήστης Chris Angelico έγÏαψε:
Then I recommend investing in some magic. There's an old-established

business JW Wells & Co, Family Sorcerers. They've a first-rate

assortment of magic, and for raising a posthumous shade with effects

that are comic, or tragic, there's no cheaper house in the trade! If

anyone anything lacks, he'll find it all ready in stacks, if he'll

only look in on the resident Djinn, number seventy, Simmery Axe!



Seriously, you're asking for something that's beyond the power of

humans or computers. You want to identify that something's the same

file, without tracking the change or having any identifiable tag.

That's a fundamentally impossible task.

No, it is difficult but not impossible.
It just cannot be done by tagging the file by:

1. filename
2. filepath
3. hash (math algorithm producing a string based on the file's contents)

We need another way to identify the file WITHOUT using the above attributes..
 
F

Ferrous Cranus

Τη ΔευτέÏα, 21 ΙανουαÏίου 2013 11:31:24 Ï€.μ. UTC+2, ο χÏήστης Chris Angelico έγÏαψε:
Then I recommend investing in some magic. There's an old-established

business JW Wells & Co, Family Sorcerers. They've a first-rate

assortment of magic, and for raising a posthumous shade with effects

that are comic, or tragic, there's no cheaper house in the trade! If

anyone anything lacks, he'll find it all ready in stacks, if he'll

only look in on the resident Djinn, number seventy, Simmery Axe!



Seriously, you're asking for something that's beyond the power of

humans or computers. You want to identify that something's the same

file, without tracking the change or having any identifiable tag.

That's a fundamentally impossible task.

No, it is difficult but not impossible.
It just cannot be done by tagging the file by:

1. filename
2. filepath
3. hash (math algorithm producing a string based on the file's contents)

We need another way to identify the file WITHOUT using the above attributes..
 
O

Oscar Benjamin

Ôç ÄåõôÝñá, 21 Éáíïõáñßïõ 2013 11:31:24 ð.ì. UTC+2, ï ÷ñÞóôçò Chris Angelico Ýãñáøå:

No, it is difficult but not impossible.
It just cannot be done by tagging the file by:

1. filename
2. filepath
3. hash (math algorithm producing a string based on the file's contents)

We need another way to identify the file WITHOUT using the above attributes.

This is a very old problem (still unsolved I believe):
http://en.wikipedia.org/wiki/Ship_of_Theseus


Oscar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top