sqlite3 error

P

Paul Rubin

Steve Holden said:
You'd think some standards body would have worked on this, wouldn't
you. I couldn't think of a Google search string that would lead to
such information, though. Maybe other, more determined, readers can do
better.

Librarians have to deal with this all the time, I expect. I believe
library catalogs often use two separate fields, one for some
representation of the name intended for lexicographic sorting, and
another for display to the user. There's an exercise in Knuth vol 3
showing a bunch of examples, using both author names and book titles.
 
L

Lawrence D'Oliveiro

Two problems so far:
(1) If you then assume that you should print the phone directory in
order of family name, that's not appropriate in some places e.g.
Iceland; neither is addressing Jon Jonsson as "Mr Jonsson", and BTW it
can be their mother's name e.g. if she has more fame or recognition
than their father.

Your bringing up the phone directory is a good point. That's probably the
most widely-consulted list of people's names around, so it's worthwhile
following whatever conventions are laid out in each country/region's phone
books. (There's also the electoral roll, I suppose, but I would assume that
follows the same sorts of conventions as the phone book.)
(2) Arabic names: you may or may not have their father's name. You
might not even have the [usually only one] given name. For example: the
person who was known as Abu Musab al-Zarqawi: this means "father of
Musab, the man from Zarqa [a city in Jordan]". You may have the family
name as well as the father's and grandfather's given name. You can have
the occupation, honorifics, nicknames. For a brief overview, read this:
http://en.wikipedia.org/wiki/Arabic_names

The question for me is: which part of his name would he share in common with
his brothers and sisters? That's the part I would call the "family name".

One might raise the issue of using names to trace genealogies, origins etc,
but that's not my concern here. I'm just trying to come up with a way to
represent names of individuals, in such a way that they can easily be found
(as in, for example, the local phone book).
Not a good idea, IMHO. Consider "Nguyen Van Tran" vs 'Rembrandt van
Rijn". Would you peel the Da off Da Costa but not the D' off
D'Oliveiro? What do you do with the bod who fills in a form as Dermot
O'Sullivan one month and Diarmaid Ó Súilleabháin the next?

The obvious question is, what does the local phone book do? How do the Dutch
phone books deal with all the "vans", and the Irish ones with all the "O"
or "O'"s? Do they put them under V and O respectively, or do they ignore
that part and look at the rest of the family name (which would make mor
sense to me)?
 
H

Hendrik van Rooyen

8<--------------------------------------
In the days of paper filing (I actually took Shorthand, and a
Business Machines & Filing course in High School to avoid Phys.Ed.) the
training for things like oriental names was to choose one for "surname".
This is where the real papers would be stored. However, one was also
taught to create cross-reference entries under the other names --
basically single cards of the form:

sort, name
see name, sort

I'll concede I doubt if any common database system is designed to
include that concept <G>

This sort of thing reminds me of a parts inventory system that I worked on -
many moons ago - it had a replacement list consisting of <old part number> <new
part number> - and you had to follow these links through to the bitter end, as
it could happen multiple times, avoiding infinite loops as you go - on a
"mainframe" with 64 kilobytes of core memory - not a job for recursion...

But luckily we had disks - I would hate to have to do this on a tape drive only
machine...

- Hendrik
 
H

Hendrik van Rooyen

8<--------------------------------------------------------
I wonder if we need another "middle" field for holding the "bin/binte" part
(could also hold, e.g. "Van" for those names that use this).

NOOOOO! - I think of my surname as "van Rooyen" - its only a string with a space
in it - and its peculiar in that the first letter is not capitalised....

And I am sure that the people called "von Kardorff" would not agree either...

- Hendrik van Rooyen
 
S

Steve Holden

Lawrence said:
In message <[email protected]>, John
Machin wrote:

Two problems so far:
(1) If you then assume that you should print the phone directory in
order of family name, that's not appropriate in some places e.g.
Iceland; neither is addressing Jon Jonsson as "Mr Jonsson", and BTW it
can be their mother's name e.g. if she has more fame or recognition
than their father.


Your bringing up the phone directory is a good point. That's probably the
most widely-consulted list of people's names around, so it's worthwhile
following whatever conventions are laid out in each country/region's phone
books. (There's also the electoral roll, I suppose, but I would assume that
follows the same sorts of conventions as the phone book.)

(2) Arabic names: you may or may not have their father's name. You
might not even have the [usually only one] given name. For example: the
person who was known as Abu Musab al-Zarqawi: this means "father of
Musab, the man from Zarqa [a city in Jordan]". You may have the family
name as well as the father's and grandfather's given name. You can have
the occupation, honorifics, nicknames. For a brief overview, read this:
http://en.wikipedia.org/wiki/Arabic_names


The question for me is: which part of his name would he share in common with
his brothers and sisters? That's the part I would call the "family name".

One might raise the issue of using names to trace genealogies, origins etc,
but that's not my concern here. I'm just trying to come up with a way to
represent names of individuals, in such a way that they can easily be found
(as in, for example, the local phone book).

Not a good idea, IMHO. Consider "Nguyen Van Tran" vs 'Rembrandt van
Rijn". Would you peel the Da off Da Costa but not the D' off
D'Oliveiro? What do you do with the bod who fills in a form as Dermot
O'Sullivan one month and Diarmaid Ó Súilleabháin the next?


The obvious question is, what does the local phone book do? How do the Dutch
phone books deal with all the "vans", and the Irish ones with all the "O"
or "O'"s? Do they put them under V and O respectively, or do they ignore
that part and look at the rest of the family name (which would make mor
sense to me)?
Don't forget the UK, where the scots are accommodated by filing Mc
before Mac everywhere except the 'phone book, where IIRC they are
treated as equivalent.

regards
Steve
 
L

Lawrence D'Oliveiro

Hendrik van said:
8<--------------------------------------------------------


NOOOOO! - I think of my surname as "van Rooyen" - its only a string with a
space in it - and its peculiar in that the first letter is not
capitalised....

And I am sure that the people called "von Kardorff" would not agree
either...

So do the Dutch phone books have a lot of entries under V, then?

It just seems less efficient to me, that's all.
 
J

John Machin

Steve said:
Don't forget the UK, where the scots are accommodated by filing Mc
before Mac everywhere except the 'phone book, where IIRC they are
treated as equivalent.

Same/similar phone book treatment here in Australia -- Mc is treated as
though it were spelled Mac. An interesting application of the
decorate-sort-undecorate technique :)
 
J

John J. Lee

Steve Holden said:
You'd think some standards body would have worked on this, wouldn't
you. I couldn't think of a Google search string that would lead to
such information, though. Maybe other, more determined, readers can do
better.

I suppose very few projects actually deal with more than a handful of
languages or cultures, but it does surprise me how hard it is to find
out about this kind of thing -- especially given that open source
projects often end up with all kinds of weird and wonderful localised
versions.

On a project that involved 9 localisations, just trying to find
information on the web about standard collation of diacritics
(accented characters) in English, German, and Scandinavian languages
was more difficult than I'd expected. And I'm glad I was incompetent
to deal with the Japanese collation :)

I found this useful list of things to worry about, though, which does
say a little about personal names, though not in detail (and provides
some useful Googling terms for somebody determined):

http://developers.sun.com/dev/gadc/des_dev/i18ntaxonomy/i18n_taxonomy.pdf


which I found from this page:

http://www.i18nguy.com/guidelines.html


John
 
J

John J. Lee

John Machin said:
This is all a bit OT. Before we close the thread down

Do you have a warrant for that?
, let me leave
you with one warning:
Beware of enthusiastic maintenance programmers on a mission to clean up
the dirty names in your database:
E.g. (1) "Karim bin Md" may not appreciate getting a letter addressed
to "Dr Karim Bin" (Md is an abbreviation of Muhammad).
E.g. (2) Billing job barfs on a customer who has no given names and no
family name. Inspection reveals that he is over-endowed in the title
department: "Mr Earl King".
[...]

Heh.

I guess the people who really know about that kind of thing are the
"record linkage" people (this one is a project worked on by c.l.py's
own Tim Churches, and has produced some Python code):

http://datamining.anu.edu.au/projects/linkage.html


John
 
J

John Machin

John said:
Do you have a warrant for that?

I have some signed-but-otherwise-blank warrants, but I'm saving them
for other threads :)
, let me leave
you with one warning:
Beware of enthusiastic maintenance programmers on a mission to clean up
the dirty names in your database:
E.g. (1) "Karim bin Md" may not appreciate getting a letter addressed
to "Dr Karim Bin" (Md is an abbreviation of Muhammad).
E.g. (2) Billing job barfs on a customer who has no given names and no
family name. Inspection reveals that he is over-endowed in the title
department: "Mr Earl King".
[...]

Heh.

Heh indeed. This behaviour seems to be endemic. Another true story from
a 3rd post-cleanup cleanup assignment: Looking at the "country"
component of addresses: WALES? Users suggested it be changed to "UK"
to conform with ISO standard, UPU conventions, etc. However glancing at
other address components, one found intriguing things like "C/o Prince
of Hospital". The same "algorithm" had migrated a handful of clients
from Coromandel Valley to Oman, and a considerable number from the
Melbourne suburb of Chadstone to Chad.
I guess the people who really know about that kind of thing are the
"record linkage" people (this one is a project worked on by c.l.py's
own Tim Churches, and has produced some Python code):

http://datamining.anu.edu.au/projects/linkage.html

The project is heavily into probabilistic methods. Given enough
correctly tagged data to work on, 'Earl" and "King" are much more
likely to drop into a name slot than a title slot.

Cheers,
John
 
H

Hendrik van Rooyen

So do the Dutch phone books have a lot of entries under V, then?

It just seems less efficient to me, that's all.

Don't know about what happens in Holland - my ancestors came over here to South
Africa a long time ago -
a mixed up kid I am - Dutch and French from the time of the revocation of the
edict of Nantes...
And yes, here the phone books are sorted that way - the "van Rensburg"s precede
the "van Rooyen"s. And what is worse, there are a lot of "van der"s too - two
spaces in the string like "van der Merwe" who are preceded by "van der Bank" -
"van" basically means "from" - like the German "von" - but in Germany its an
appellation applied to the nobility - and in my name it makes no sense as
"Rooyen" is not a place - its a strange archaic derivative of the colour red -
"rooij' in Dutch, spelt "rooi" in Afrikaans - and the "der" is an archaic form
of "the" - (and modern "the" in German, if yer male) ...

And that lot completely ignores other animals like the "Janse van Rensburg"s,
who go in amongst the "J"s...

HTH - Hendrik
 
R

Roel Schroeven

Hendrik van Rooyen schreef:
Don't know about what happens in Holland - my ancestors came over here to South
Africa a long time ago -
a mixed up kid I am - Dutch and French from the time of the revocation of the
edict of Nantes...
And yes, here the phone books are sorted that way - the "van Rensburg"s precede
the "van Rooyen"s. And what is worse, there are a lot of "van der"s too - two
spaces in the string like "van der Merwe" who are preceded by "van der Bank" -
"van" basically means "from" - like the German "von" - but in Germany its an
appellation applied to the nobility - and in my name it makes no sense as
"Rooyen" is not a place - its a strange archaic derivative of the colour red -
"rooij' in Dutch, spelt "rooi" in Afrikaans - and the "der" is an archaic form
of "the" - (and modern "the" in German, if yer male) ...

It's the same here in Belgium. Except that our Van is with a capital V
in most cases; if it's a lower v it either indicates nobility or a Dutch
name.

I don't see it as a problem. I prefer having Van Straeten and Van
Stralen next to each other than having them mixed up with names without
Van like this:
Straeten, Van
Straetmans
Stralen, Van

For me the string as a whole is the name; the parts separated don't have
much meaning.
 
T

Theerasak Photha

Just because most Western designers of databases do it wrong doesn't mean
that a) you should do it wrong, or b) they will continue to do it wrong
into the future, as increasing numbers of those designers come from Asian
and other non-Western backgrounds.

Family name comes last in some Asian countries as well. :)

It might also be prudent to consider that, e.g,, some Tamils only have
a last name for legal purposes and traditionally go by a single name.
Lots of possibilities to consider.
I wonder if we need another "middle" field for holding the "bin/binte" part
(could also hold, e.g. "Van" for those names that use this).

Also 'da' for Portuguese, which means roughly same as
Nederlands/Vlaams. Maybe. As usual: IANAE.
There would also need to be a flag field to indicate the canonical ordering
for writing out the full name: e.g. family-name-first, given-names-first.
Do we need something else for the Vietnamese case?

Good question, but IIRC, family name comes first followed by any other
given names, just as in a literary index written in English: e.g.,
Truman, Harry S

What if you're Ho Chi Minh? Do you get to list aliases indefinitely? LOL

-- Theerasak
 
S

Steve Holden

Theerasak said:
Family name comes last in some Asian countries as well. :)

It might also be prudent to consider that, e.g,, some Tamils only have
a last name for legal purposes and traditionally go by a single name.
Lots of possibilities to consider.




Also 'da' for Portuguese, which means roughly same as
Nederlands/Vlaams. Maybe. As usual: IANAE.




Good question, but IIRC, family name comes first followed by any other
given names, just as in a literary index written in English: e.g.,
Truman, Harry S

What if you're Ho Chi Minh? Do you get to list aliases indefinitely? LOL
It seems like some sort of free text search on a "full name" field looks
like the only realistic globally-acceptable (?) option.

regards
Steve
 
J

Jorge Godoy

Theerasak Photha said:
Also 'da' for Portuguese, which means roughly same as
Nederlands/Vlaams. Maybe. As usual: IANAE.

It looks like the same but at least here in Brasil it isn't considered for
sorting ("da Silva" should be sorted under "Silva", "de Souza" under "Souza",
"de Melo" under "Melo" and so on) and it is even stripped in some cases.
 
J

Jorge Godoy

Steve Holden said:
It seems like some sort of free text search on a "full name" field looks like
the only realistic globally-acceptable (?) option.

This is what we opted doing. Normalization to this level wouldn't add much
since there are a lot of "Smith"s that aren't relatives.

If finding relatives is something you need to do, demand that the mother's
name be filled in and put an optional field for father's name. Then compare
and ask if there's something between two people with the same mother /
father. (Remember about people with the same name! There are a lot of "John
Smith" or "José da Silva" around :))
 
P

Piet van Oostrum

Roel Schroeven said:
RS> It's the same here in Belgium. Except that our Van is with a capital V in
RS> most cases; if it's a lower v it either indicates nobility or a Dutch name.
RS> I don't see it as a problem. I prefer having Van Straeten and Van Stralen
RS> next to each other than having them mixed up with names without Van like
RS> this:
RS> Straeten, Van
RS> Straetmans
RS> Stralen, Van

In Holland it is sorted without the 'van' 'de' etc.
 
T

Theerasak Photha

In Holland it is sorted without the 'van' 'de' etc.

Which was my original point in mentioning similar Portuguese names. :)

BTW, do Dutch/Flemish family names now follow the trend of dropping
declension, as seen in both languages (dialects?) in general: e.g.,
'de' instead of 'der'?

-- Theerasak
 
L

Leo Kislov

John said:
I suppose very few projects actually deal with more than a handful of
languages or cultures, but it does surprise me how hard it is to find
out about this kind of thing -- especially given that open source
projects often end up with all kinds of weird and wonderful localised
versions.

On a project that involved 9 localisations, just trying to find
information on the web about standard collation of diacritics
(accented characters) in English, German, and Scandinavian languages
was more difficult than I'd expected.

As far as I understand unicode.org has become the central(?) source of
locale information: http://unicode.org/cldr/ Did you use it?
 
R

Roel Schroeven

Theerasak Photha schreef:
Which was my original point in mentioning similar Portuguese names. :)

BTW, do Dutch/Flemish family names now follow the trend of dropping
declension, as seen in both languages (dialects?) in general: e.g.,
'de' instead of 'der'?

My observation is that in general names keep hanging on to archaic forms
much longer than normal language. Examples:

- A very common name around here is Hendrickx. In normal language, the
'ckx' construction is replaced with 'ks'.
- 'Straat' (English: 'street') (or 'straten' in multiple) used to be
written 'straet' ('straeten'); in names it is still written like that:
'Verstraeten' is a common name. 'Verstraten' exists too though, but is
less common I think.

But I can't think of that many names with 'der', so maybe the
declensions have been dropped already.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,218
Latest member
JolieDenha

Latest Threads

Top