Convert text string i.e 'Peter' into integer ID

Justus Ohlhaver · Nov 12, 2008

Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I'm looking for a
method thus can do the computation.

Thanks,
Justus

Sarcar, Shourya C (GE Healthcare) · Nov 12, 2008

code=3D0
"Peter".each_byte{|b| code +=3D b} @ will not generate unique tho

Hugh Sasse · Nov 12, 2008

Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I'm looking for a
method thus can do the computation.

What properties do you want the string:integer relationship to have?
The problem is too unbounded at the moment.

["Peter", "James", "John", "Andrew"].index("Peter")

satisfies the constraints given so far.

Thanks,
Justus

Hugh

Peter Szinek · Nov 12, 2008

Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

There are a lot of easy ways to do the string -> integer part - e.g.

1) def x(s); return 42; end # now you can call x on your string and it
will convert it to an integer
2) "Peter"[0]
3) "Peter".object_id
4) "Peter".hash

The integer -> string part is the real problem; it's impossible to do
it with methods 1) - 3), with some extra effort (storing the result in
a lookup table) it should be possible to accomplish it with 4) unless
you have some extra requirements.

The question is, what do you really want to achieve? What properties
should the mapping have?

Cheers
Peter

Justus Ohlhaver · Nov 12, 2008

Hugh said:
Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I'm looking for a
method thus can do the computation.

Click to expand...

What properties do you want the string:integer relationship to have?
The problem is too unbounded at the moment.

["Peter", "James", "John", "Andrew"].index("Peter")

satisfies the constraints given so far.

Thanks,
Justus

Click to expand...

Hugh

Thanks for your help. No I need to turn the string which would usually
be a headline ('Man lands on the moon') into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus

Todd Benson · Nov 12, 2008

Hugh said:
Hugh said:

Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I'm looking for a
method thus can do the computation.

Click to expand...

What properties do you want the string:integer relationship to have?
The problem is too unbounded at the moment.

["Peter", "James", "John", "Andrew"].index("Peter")

satisfies the constraints given so far.

Thanks,
Justus

Click to expand...

Hugh

Click to expand...

Thanks for your help. No I need to turn the string which would usually
be a headline ('Man lands on the moon') into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus

Why does it have to be a number? Your db should already index.

Todd

Ken Bloom · Nov 12, 2008

Hello,

is there any method to quickly convert a text string such as 'Peter'
into an integer? (And vice versa?)

I am not asking about just changing the format (to_i). I'm looking for a
method thus can do the computation.

Thanks,
Justus

Does "Peter".hash do what you want?

--Ken

Peter Szinek · Nov 12, 2008

Thanks for your help. No I need to turn the string which would usually

be a headline ('Man lands on the moon') into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.

If you just want to check for duplicates, just store a SHA1 hash of
the string:
(AR example follows, obviously you can use any ORM or plain SQL):

require "sha1";

headline = "Man lands on the moon"
Article.create

headline => headline, uniq_hash =>
SHA1.digest(headline))

...
later
....

you want to decide whether new_headline already exists:

Article.create( ... ) unless
Article.find_by_uniq_hash(SHA1.digest(new_headline))

HTH,
Peter

Jan Friedrich · Nov 12, 2008

Justus Ohlhaver said:
Hugh Sasse wrote:
Thanks for your help. No I need to turn the string which would usually
be a headline ('Man lands on the moon') into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db. Justus

I would do something like this with a md5 hash:

require 'digest/md5'
Digest::MD5.hexdigest('Peter')
# => "6fa95b1427af77b3d769ae9cb853382f"

Regards
Jan

Sebastian Hungerecker · Nov 12, 2008

Peter said:
4) "Peter".hash

The integer -> string part is the real problem; it's impossible to do =A0
it with methods 1) - 3), with some extra effort (storing the result in =A0
a lookup table) =A0it should be possible to accomplish it with 4) unless = =A0
you have some extra requirements.

Hash values are not unique. Two different strings can have the same hash=20
value, so you can't get the string back from the hash alone.

HTH,
Sebastian
=2D-=20
NP: In Flames - Lord Hypnos
Jabber: (e-mail address removed)
ICQ: 205544826

Sarcar, Shourya C (GE Healthcare) · Nov 12, 2008

You need to generate a SHA1 or MD5 hash from your string.

require 'digest/md5'
d =3D Digest::MD5.new
d.update("if you believe it, they have put a man on the moon")
uniq =3D d.hexdigest

-----Original Message-----
From: (e-mail address removed) [mailto[email protected]]=20

Todd Benson · Nov 12, 2008

I would do something like this with a md5 hash:

require 'digest/md5'
Digest::MD5.hexdigest('Peter')
# => "6fa95b1427af77b3d769ae9cb853382f"

Regards
Jan

Unless I'm missing something here, strings are just numbers in order.
Why encode/encrypt?

Most db's should handle natural keys.

If absolutely necessary to store as number strings (I can't see why),
look at #pack and #unpack.

Todd

Justus Ohlhaver · Nov 12, 2008

Todd said:
Unless I'm missing something here, strings are just numbers in order.
Why encode/encrypt?

Most db's should handle natural keys.

If absolutely necessary to store as number strings (I can't see why),
look at #pack and #unpack.

Todd

Thanks a lot everybody. I'm really impressed by the quick and useful
feedback.

Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

Justus

Jan Friedrich · Nov 12, 2008

Todd Benson said:
Unless I'm missing something here, strings are just numbers in order.
Why encode/encrypt?

You're right:
'Peter'.to_i(36) # => 42681699

But you will become problems with larger strings if your dbms doesn't
have integers with arbitrary length.

Most db's should handle natural keys.

This is right. I don't adress the db part in my post.

If absolutely necessary to store as number strings (I can't see why),
look at #pack and #unpack.

This would also not work for larger strings (see above).

HTH,
Jan

Justus Ohlhaver · Nov 12, 2008

Ken said:
Does "Peter".hash do what you want?

--Ken

Yes, thanks it seems to do what I need, except for two possible
limitations:

1.Accordinng to Sebastian (above) 'Hash values are not unique. Two
different strings can have the same hash value'

2.It may not serve my original purpose, which is speeding up database
queries.

Thanks again,
Justus

Todd Benson · Nov 12, 2008

Thanks a lot everybody. I'm really impressed by the quick and useful
feedback.

Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

To be honest, I know almost nothing about mysql. I will say, however,
that you should try natural keys and see how the performance works
(testing). PostgreSQL, for example, claims you gain no more
performance on any natural key (be it integer, character, otherwise).
The true bottleneck is almost always in the application. But, I don't
know your exact situation.

From what you have said, it seems like you are looking for a primary
key that's unique and fast. Most db's that are set up correctly do a
"behind-the-scenes" lookup for your key; which means that there is an
ID (number) assigned to your element. The search is definitely what
people are concerned about, but having a string turned into a number
won't help you there, unless it's like a password or something.

If you want the string compacted, then follow some of the other suggestions.

hth,
Todd

Rolando Abarca · Nov 12, 2008

Thanks a lot everybody. I'm really impressed by the quick and useful
feedback.

Todd, I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

You should just index the column you need, and if you want that column
to be unique, create a unique constraint:

create index <index_name> on <table_name> (<name_of_the_column>)

Something like that should work. To create the unique constraint, look
on your database documentation.
Adding a column and getting a hash from a string is going to be slower
than having an index at the database level.

Justus

Regards,

Giampiero Zanchi · Nov 12, 2008

Thanks for your help. No I need to turn the string which would usually

be a headline ('Man lands on the moon') into a unique number. The
purpose is to speed up my database queries when checking whether an
entry with the same headline already exists in the db.
Justus

Viewing your problem from a theoretical point of view, something similar
exists in compression theory, namely arithmetic compression. Of course
that is not convenient for your porpuses.

Brian Candler · Nov 12, 2008

Justus said:
I was told that searching by integers instead of strings would
speed up performance when using large mysql tables. Is that not so?

This is what's known as "premature optimisation".

The general rule is: build your application first. If it has performance
problems, profile it. Only after profiling to determine where it
*really* is slow, then modify it. Even the most experienced of
architects and programmers often get it wrong, when they rely on their
intuition as to where optimisation actually benefits you.

I'd say that if you have a few hundred thousand rows in a table, and the
column you are searching on is indexed, then I doubt you will get any
noticeable speed improvement searching on an integer rather than a
string column.

However, note that it is common practice in database applications to
have an integer primary key assigned from a sequence.

1 : "Peter"
2 : "James"
3 : "John"
4 : "Andrew"
... etc

In that way, if you happen to know the ID of the row you want already,
you can jump to it by ID. But if you want to search for it by name -
which might return 0, 1 or more results - you can do that efficiently
too as long as that column is indexed.

What you seem to be asking for is to allocate the IDs in such a way that
given the string, you can calculate the ID off-line without performing a
database search. But to avoid the possibility of two strings giving the
same integer, then you would have to use a strong cryptographic hash
like SHA1. This will give you an integer of size 2^160, which is very
large; so large that actually just storing the string (and searching on
it) will likely be more efficient anyway. Furthermore, the integers
themselves will effectively be randomly distributed, rather than a
linear sequence, so the same sort of tree index and lookup will be
required.

Ken Bloom · Nov 12, 2008

Yes, thanks it seems to do what I need, except for two possible
limitations:

1.Accordinng to Sebastian (above) 'Hash values are not unique. Two
different strings can have the same hash value'

2.It may not serve my original purpose, which is speeding up database
queries.

I'm going to suggest what Todd Benson and Rolando Abarca suggested, which
is to just work with strings in the database. Don't bother with computing
some kind of (possibly unique) hash. Use a CREATE INDEX statement to
index the headline field, and you'll probably never notice a speed
difference between your roundabout method and feeding in the string
directly to the database.

--Ken

Can't convert string to integer	5	Aug 31, 2010
Can't convert String into Integer (TypeError)	4	Aug 2, 2006
Can't convert Float into String	5	Oct 2, 2010
can't convert Symbol into String	6	Feb 14, 2010
std method to convert string of bytes into Integer?	2	Nov 26, 2004
can't convert String into Integer	6	Feb 20, 2006
convert integer to array of chars	6	Feb 23, 2010
convert ascii codes in a string to a string of characters	4	Sep 8, 2008

Convert text string i.e 'Peter' into integer ID

Justus Ohlhaver

Sarcar, Shourya C (GE Healthcare)

Hugh Sasse

Peter Szinek

Justus Ohlhaver

Todd Benson

Ken Bloom

Peter Szinek

Jan Friedrich

Sebastian Hungerecker

Sarcar, Shourya C (GE Healthcare)

Todd Benson

Justus Ohlhaver

Jan Friedrich

Justus Ohlhaver

Todd Benson

Rolando Abarca

Giampiero Zanchi

Brian Candler

Ken Bloom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads