hash from long url to short url

J

joe

Hello anyone knows how to write a funtion to genereate a tiny url with
letters and numbers only. Something almost always unique. THanks.
 
M

Malcolm McLean

joe said:
Hello anyone knows how to write a funtion to genereate a tiny url with
letters and numbers only. Something almost always unique. THanks.
It's inherently impossible to collapse 36^N unique URLs to 36^(N/4) unique
tiny urls.
However there's a hash function on my websites. It's in one of the free
chapters under Basic Algorithms. I recommend that you split the string into
2, and generate 2 unsigned longs. The chance of a collision is so low as to
be negligible. Then use the modulus operation to reduce to alpahanumeric.
 
S

santosh

Malcolm said:
news:af9ad0f2-7db0-476f-b14f-6697b9f2cf27@m23g2000hsc.googlegroups.com...
It's inherently impossible to collapse 36^N unique URLs to 36^(N/4)
unique tiny urls.
However there's a hash function on my websites. It's in one of the
free chapters under Basic Algorithms. I recommend that you split the
string into 2, and generate 2 unsigned longs. The chance of a
collision is so low as to be negligible. Then use the modulus
operation to reduce to alpahanumeric.

Also a Google search seems to show up a lot of TinyURL generators though
not, AFAICS, in C. Maybe the OP can study the code and rewrite it in C.
 
B

Ben Bacarisse

Malcolm McLean said:
It's inherently impossible to collapse 36^N unique URLs to 36^(N/4)
unique tiny urls.
However there's a hash function on my websites. It's in one of the
free chapters under Basic Algorithms. I recommend that you split the
string into 2, and generate 2 unsigned longs. The chance of a
collision is so low as to be negligible.

I have a feeling this is a bad idea[1]. The Bernstein hash function
(which is the one on your) site uses unsigned long but will work just
as well with any unsigned integer type. If the OP has access to
longer integers, it seems safer to simply use a longer integer that
than generate two hashes from two parts of the string.

[1] I have no formal argument in support of this, just the feeling
that, since URLs often have similar parts you are wasting the hash
function's mixing ability if you split the string. Anyway, even if
there is no reason to worry here, why take the risk -- unless, of
course, you don't have longer integer types.
 
M

Malcolm McLean

Ben Bacarisse said:
I have a feeling this is a bad idea[1]. The Bernstein hash function
(which is the one on your) site uses unsigned long but will work just
as well with any unsigned integer type. If the OP has access to
longer integers, it seems safer to simply use a longer integer that
than generate two hashes from two parts of the string.

[1] I have no formal argument in support of this, just the feeling
that, since URLs often have similar parts you are wasting the hash
function's mixing ability if you split the string. Anyway, even if
there is no reason to worry here, why take the risk -- unless, of
course, you don't have longer integer types.
You might well be right. If two URLs share the same introduction, which is
extremely plausible, then effectively you are wasting a long.
 
C

CBFalconer

joe said:
Hello anyone knows how to write a funtion to genereate a tiny url with
letters and numbers only. Something almost always unique. THanks.

Yes. No. You're welcome.
 
F

Flash Gordon

Malcolm McLean wrote, On 22/02/08 15:23:
Ben Bacarisse said:
I have a feeling this is a bad idea[1]. The Bernstein hash function
(which is the one on your) site uses unsigned long but will work just
as well with any unsigned integer type. If the OP has access to
longer integers, it seems safer to simply use a longer integer that
than generate two hashes from two parts of the string.

[1] I have no formal argument in support of this, just the feeling
that, since URLs often have similar parts you are wasting the hash
function's mixing ability if you split the string. Anyway, even if
there is no reason to worry here, why take the risk -- unless, of
course, you don't have longer integer types.
You might well be right. If two URLs share the same introduction, which
is extremely plausible, then effectively you are wasting a long.

There is also a significant possibility of two URLs sharing a
significant tail. E.g.
www.somesite/area/somewhere?pageid=1234&action=read&format=prettyformat
 
J

joe

May be i could get the char value of each letter multiply it by the
position in the string and come up with a number. Just an idea.
 
B

Ben Bacarisse

May be i could get the char value of each letter multiply it by the
position in the string and come up with a number. Just an idea.

[Top posting corrected]. That is one way, but there has been a lot of
study of how to turn a string into a number and your method has no
particular advantage over others that have been found to be useful.
The advice you've had to use a known hash function is good advice.
 
P

Paul Hsieh

Hello anyone knows how to write a funtion to genereate a tiny url with
letters and numbers only. Something almost always unique. THanks.

As mentioned by santosh, in practice you can essentially try to
duplicate what the site "tinyUrl.com" does, however, either way it
requires that you solve the more general problem of mapping limitless
strings to fixed sized hashes that are "almost always unique". So let
us solve this latter problem before we address the whole problem.

To map a set of variable length strings to something "pseudo-unique"
fundamentally requires an understanding of "The Birthday
Paradox" (look it up on Wikipedia.) However to make a long story
short you should study "secure hashing". Open source implementations
of SHA-256 and WHIRLPOOL exist, which are probably the "safest" bets
right now. The upshot of this is that you can transform variable
length strings to something like 160 or 256 bits in a way that even an
expert adversary cannot find any collisions in.

Now for something like an URL, however, you are going to need to
transform the output to text. So use Base-64 or hex encoding or
something like that to turn the binary to alphanumerics. Lets call
this a "long encoding" of the URL.

Now as for making it tiny, the only way I can see how to do this is to
basically retain a database of all such hashings done and always
return the shortest possible truncated version of the "long
encoding" (which is calculated as described above) that is unique.
Then you can just make the URL something like: www.smalladdress.com/<URL>
and you would probably use some Apache configuration trick to turn
this into a call to php or cgi which takes the <URL> and looks it up
in your DB and produces the original URL as a redirect. I assume that
this, roughly speaking, is what tinyurl.com does.
 
A

Antoninus Twink

CBFalconer and Default User actively contributing to CLC in their own
inimitable manner again. Nice.

Yes - with "contributors" like these, who needs trolls?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,690
Members
48,796
Latest member
Greg L.

Latest Threads

Top