Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".

Daniel Pitts · Sep 20, 2007

So, I've spent all day working on this. Funfun...

Back story: Project Gutenburg create free ebooks from content that is
now in the public domain, including the "1913 Webster Unabridged
Dictionary". The problem with this particular work (pgw050*.txt), is
that it uses a very "odd" character set, and an almost-xml markup (it
may be valid SGML, but I wouldn't bank on it)

Its part DOS extended ascii, and then some proprietary character
codes.

My goal:
I'd like to get this into a form that is easily processed by a
program. I think the best way to do this is to put it into a robust
XML formal. This would involved cleaning up the markup to be more
valid XML, as well as processing some of the character codes into
nicer forms. I've already written a program that will read the
original texts, and re-encode the files as UTF-8, using appropriate
character substitution when possible.

At this point, I'm not sure if I'd be better off converting their
custom "entities" into the equivalent UTF-8 encoded characters, or if
it would be better to convert all entities and non-standard characters
into some sort of XML encoded entities.

Anyone have suggestions on what would be the most useful way to go?

Hunter Gratzner · Sep 20, 2007

So, I've spent all day working on this. Funfun...

Back story: Project Gutenburg

It's Gutenberg, not Gutenburg.

create free ebooks from content that is
now in the public domain, including the "1913 Webster Unabridged
Dictionary". The problem with this particular work (pgw050*.txt), is

Thanks for not providing a link to the file, so we are saved from
having to have a look at it.

Jeff Higgins · Sep 20, 2007

Daniel said:
So, I've spent all day working on this. Funfun...

Back story: Project Gutenburg create free ebooks from content that is
now in the public domain, including the "1913 Webster Unabridged
Dictionary". The problem with this particular work (pgw050*.txt), is
that it uses a very "odd" character set, and an almost-xml markup (it
may be valid SGML, but I wouldn't bank on it)

Its part DOS extended ascii, and then some proprietary character
codes.

My goal:
I'd like to get this into a form that is easily processed by a
program. I think the best way to do this is to put it into a robust
XML formal. This would involved cleaning up the markup to be more
valid XML, as well as processing some of the character codes into
nicer forms. I've already written a program that will read the
original texts, and re-encode the files as UTF-8, using appropriate
character substitution when possible.

Whew. After a quick read of webfont.asc and tagset.web I can feel
your pain. I think the main problem here is that the typesetters /style/
conveys so much information. For instance:

216 d8 Ø <par/ double vertical bar (short length; the long
length is the graphics character 186)
This precedes words marked with a double vertical bar in
the original dictionary, signifying that the word was
adopted directly into English without modification of
the spelling.

For myself, I suppose the question would be: Do I want my
/program/ to understand and/or act upon the fact that a character
code 0xd8 signifies the above or is it strictly for a /human/ readers'
consumption? If the former probably an XML tag would be appropriate,
if the latter maybe an appropriate glyph is sufficient.

Daniel Pitts · Sep 20, 2007

It's Gutenberg, not Gutenburg.

I actually knew that, but my fingers decided to do what they wanted,
not what I wanted

Thanks for not providing a link to the file, so we are saved from
having to have a look at it.

Ah, indeed.

Thanks for the constructive response.

Jeff Higgins provided the link in a reply: <http://www.gutenberg.org/
dirs/etext96/pgw050ab.txt>
Thanks Jeff!

Thanks,
Daniel.

Roedy Green · Sep 20, 2007

At this point, I'm not sure if I'd be better off converting their
custom "entities" into the equivalent UTF-8 encoded characters, or if
it would be better to convert all entities and non-standard characters
into some sort of XML encoded entities.

Perhaps the way to go is to devise a font that renders these odd
characters correctly. Then the text could be easily manipulated
programmatically with tiny mods to existing software. Then you could
even publish it as a PDF document.

Your problem then becomes political, talking some skilled type
designer into donating her skills in return for some exposure.

Roedy Green · Sep 20, 2007

Your problem then becomes political, talking some skilled type
designer into donating her skills in return for some exposure.

If you have some high res scans of the original text, your job is not
designing a font, but the much easier job of "stealing" the font from
the original samples. I looked into a similar problem circa 1990 to
"steal" Chinese fonts from hand painted fonts on mechanical optical
typesetters. The tools were primitive -- interactively defining
Bezier curves with Adobe tools.

There are people who will create you a font from a sample of your
handwriting or printing for a nominal charge. Perhaps one of them has
the tools and skills to solve your problem.

Jeff Higgins · Sep 20, 2007

Another thought strikes me. Have you looked any of the many
"dictionary markup" languages already out there? Have you seen
the GNU CIDE?
http://www.ibiblio.org/webster/

Daniel Pitts · Sep 20, 2007

Whew. After a quick read of webfont.asc and tagset.web I can feel
your pain. I think the main problem here is that the typesetters /style/
conveys so much information. For instance:

216 d8 Ø <par/ double vertical bar (short length; the long
length is the graphics character 186)
This precedes words marked with a double vertical bar in
the original dictionary, signifying that the word was
adopted directly into English without modification of
the spelling.

For myself, I suppose the question would be: Do I want my
/program/ to understand and/or act upon the fact that a character
code 0xd8 signifies the above or is it strictly for a /human/ readers'
consumption? If the former probably an XML tag would be appropriate,
if the latter maybe an appropriate glyph is sufficient.

Thanks for the reply. My main goal is to retain as much semantic
meaning as possible for the program to understand. So if I understand
your point, I should convert it to XML tags to maintain that
information...

This brings up a related point. In XML, can "&blah;" entities have
semantic meaning associated with them? Or are they only replacements
for otherwise difficult-to-represent characters? That makes a

Thanks,
Daniel.

Jeff Higgins · Sep 20, 2007

Daniel Pitts wrote:

Whew. After a quick read of webfont.asc and tagset.web I can feel
your pain. I think the main problem here is that the typesetters /style/
conveys so much information. For instance:

216 d8 Ø <par/ double vertical bar (short length; the long
length is the graphics character 186)
This precedes words marked with a double vertical bar in
the original dictionary, signifying that the word was
adopted directly into English without modification of
the spelling.

For myself, I suppose the question would be: Do I want my
/program/ to understand and/or act upon the fact that a character
code 0xd8 signifies the above or is it strictly for a /human/ readers'
consumption? If the former probably an XML tag would be appropriate,
if the latter maybe an appropriate glyph is sufficient.

Thanks for the reply. My main goal is to retain as much semantic
meaning as possible for the program to understand. So if I understand
your point, I should convert it to XML tags to maintain that
information...

This brings up a related point. In XML, can "&blah;" entities have
semantic meaning associated with them? Or are they only replacements
for otherwise difficult-to-represent characters? That makes a
difference between using &directlyAdopted; and <directly-adopted/>

Well, if your asking me personally, I'd have to say I'm no XML expert
and that the best I could do is to point you to the appropriate part
of the spec, sorry.

Thanks,
Daniel.

Daniel Pitts · Sep 20, 2007

Another thought strikes me. Have you looked any of the many
"dictionary markup" languages already out there? Have you seen
the GNU CIDE?http://www.ibiblio.org/webster/

Heh, same source material, but it looks like more care was taken in
the translation to *machine readable* format. I'll check it out.
Thanks for the pointer. (Searching for Public Domain Dictionary
doesn't turn up as much relevant hits as it should

)

RedGrittyBrick · Sep 21, 2007

Roedy said:
Perhaps the way to go is to devise a font that renders these odd
characters correctly. Then the text could be easily manipulated
programmatically with tiny mods to existing software. Then you could
even publish it as a PDF document.

Your problem then becomes political, talking some skilled type
designer into donating her skills in return for some exposure.

The purpose of a dictionary is semantic. The actual glyphs are
comparatively unimportant. The intellectual accomplishment does not lie
mainly in the choice of symbols.

If you want to reproduce the beautiful typography of the original, use
high quality image scans.

Otherwise I'd translate the glyphs to something semantically or visually
close in the unicode character set.

I think I'd try for a purely semantic markup in XML. Then create a
stylesheet that would render it in XHTML (say) and which would introduce
glyphs and fonts as close to the original as possible. That way, if
unicode ever gets extended to include some of the odd characters used in
the original, you only have to amend the stylesheet.

So I'd represent the "double vertical bar" as an attribute of a tag.
e.g. <word spelling="adopted"> The stylesheet could insert a glyph
visually close to "double vertical bar".

In particular, I'd translate markup like "<universbold>" into
<exposition> or <shape-description> or something. I'm pretty sure
Webster didn't compose his dictionary with LaserJet fonts in mind

Jeff Higgins · Sep 21, 2007

Daniel said:
So, ...

I must thank you for posting this article. After having read
your post I spent some time browsing the WWW on the subject
and found a lot of interesting stuff. Here are links to two things
that I found particularly interseting.

I rediscovered the Princeton University WordNet project.
<http://wordnet.princeton.edu/>

And through that link discovered a most wonderful (free)
dictionary utility program for the Windows platform:

WordWeb 5 for Windows
<http://wordweb.info/free/>

This program allows me to place my mouse cursor over a word
in any other program and with a CTRL + right click bring up
a useful dictionary/thesarus already opened to the word under
the cursor!! How neat! I've tried it in my newsreader "Outlook"
and in IE7 and OpenOffice Writer, even Eclipse. How's it do that?

Anyway, this is not a commercial advertisement, I am not
in any way associated the above mentioned organizations.

Thanks,
JH

Daniel Pitts · Sep 21, 2007

The purpose of a dictionary is semantic. The actual glyphs are
comparatively unimportant. The intellectual accomplishment does not lie
mainly in the choice of symbols.

If you want to reproduce the beautiful typography of the original, use
high quality image scans.

Otherwise I'd translate the glyphs to something semantically or visually
close in the unicode character set.

I think I'd try for a purely semantic markup in XML. Then create a
stylesheet that would render it in XHTML (say) and which would introduce
glyphs and fonts as close to the original as possible. That way, if
unicode ever gets extended to include some of the odd characters used in
the original, you only have to amend the stylesheet.

So I'd represent the "double vertical bar" as an attribute of a tag.
e.g. <word spelling="adopted"> The stylesheet could insert a glyph
visually close to "double vertical bar".

In particular, I'd translate markup like "<universbold>" into
<exposition> or <shape-description> or something. I'm pretty sure
Webster didn't compose his dictionary with LaserJet fonts in mind

Heh. He probably was using a BubbleJet

But seriously. I'd like to keep the original intent (the
transcriber's, not necessarily Webster's), and then in a later stage
of the processing, convert it to the more semantic meaning, and
probably ignore the rendering of that information. My personal use-
case actually only cares about the relationships between words, and
the part of speech. For instance, I'd like to be able to recognize
Ran, Run, and Runs as different tenses of the same word, and Leaf/
Leaves as different inflections of the same word.

Actually, thats not quite my "ultimate" goal. The ultimate goal is to
create an English Imperative Sentence parser to use in a text
adventure game. I just figured I might as well do something useful
for the community while I'm at it (in this case, semanticize the
dictionary). Although it appears that gcide_xml may have done what I
wanted to do already.

John W. Kennedy · Sep 21, 2007

Daniel said:
Actually, thats not quite my "ultimate" goal. The ultimate goal is to
create an English Imperative Sentence parser to use in a text
adventure game.

I cannot find that you have ever participated in rec.arts.int-fiction.
Assuming this to be true, then it is highly likely you have no idea of
what you are getting into. Most fundamentally, you can't do a useful I-F
parser (assuming that, by "parser", you mean more than a mere lexer)
unless it is integrated with the world model. And you're also going to
have to create a descriptive language and a compiler for it.

Please study Inform 6, Inform 7 (they are completely different), TADS 2,
TADS 3, Hugo, and Adrift, and then see if A) you really have anything
new to contribute to the state of the art, and B) you have the time to
produce it. I would estimate that any new system offering a significant
improvement on existing tools should take about ten man-years to do from
scratch. You'll also probably need at least two collaborators, a test
writer, and a documentation writer. At a minimum, don't try to create
your own tests; you need a dedicated adversary, because this problem
domain is rife with edge and corner cases.

--
John W. Kennedy
"The whole modern world has divided itself into Conservatives and
Progressives. The business of Progressives is to go on making mistakes.
The business of the Conservatives is to prevent the mistakes from being
corrected."
-- G. K. Chesterton

Daniel Pitts · Sep 22, 2007

I cannot find that you have ever participated in rec.arts.int-fiction. Indeed, I have not.
Assuming this to be true, then it is highly likely you have no idea of
what you are getting into. Most fundamentally, you can't do a useful I-F
parser (assuming that, by "parser", you mean more than a mere lexer)
unless it is integrated with the world model. And you're also going to
have to create a descriptive language and a compiler for it.

Actually, my plan is to describe the world model with Java objects
(hence this being a Java group)

Please study Inform 6, Inform 7 (they are completely different), TADS 2,
TADS 3, Hugo, and Adrift, and then see if A) you really have anything
new to contribute to the state of the art, and B) you have the time to
produce it.

A) If I don't have anything worth while to contribute, at least I'll
have gained knowledge. This isn't about bettering existing tools and
platforms, but about bettering myself. I will take a look at those
you suggested, but I'll probably continue on with my project anyway.
I do have *some* experience working on a Lima M.U.D.

I would estimate that any new system offering a significant
improvement on existing tools should take about ten man-years to do from
scratch. You'll also probably need at least two collaborators, a test
writer, and a documentation writer. At a minimum, don't try to create
your own tests; you need a dedicated adversary, because this problem
domain is rife with edge and corner cases.

Agreed. The part that I find the most difficult to model, parse, and
query is the complex relationships that can occur amongst several
objects. It's easy enough to say that a bowl in on a table, but what
about an apple between the banana and the orange in the bowl on the
wooden table.

Every journey starts with but a footstep. It may take 10 man years to
complete, but if I don't start on my own, I'll never know. I'm 26, so
if this a project that takes me until I'm 36, I'll still be young
enough to enjoy the results. In any case, if this DOES get to a
point where I think it might become something useful to the community,
I'm sure I will be able to find plenty of collaborators.

Thanks for the pointers both to the existing projects, and to the raif
group. I'm sure I will find it invaluable as I go on.

Cheers,
Daniel.

Patricia Shanahan · Sep 22, 2007

Daniel Pitts wrote:
....

Agreed. The part that I find the most difficult to model, parse, and
query is the complex relationships that can occur amongst several
objects. It's easy enough to say that a bowl in on a table, but what
about an apple between the banana and the orange in the bowl on the
wooden table.

I think there are far more basic issues. Here's a classic example of the
context-sensitivity of the English language: "Time flies like an arrow.".

If it is advice from a senior researcher to a junior researcher in an
entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
modifies how to go about timing flies.

If it is a comment on how fast time seems to go by, "time" is a noun,
"flies" is a verb, and "like an arrow" modifies how time flies.

Patricia

RedGrittyBrick · Sep 22, 2007

Patricia said:
Daniel Pitts wrote:
...

I think there are far more basic issues. Here's a classic example of the
context-sensitivity of the English language: "Time flies like an arrow.".

If it is advice from a senior researcher to a junior researcher in an
entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
modifies how to go about timing flies.

If it is a comment on how fast time seems to go by, "time" is a noun,
"flies" is a verb, and "like an arrow" modifies how time flies.

Time flies like an arrow.
Fruit flies like a banana.
- Groucho Marx

Daniel Pitts · Sep 22, 2007

Daniel Pitts wrote:

...

I think there are far more basic issues. Here's a classic example of the
context-sensitivity of the English language: "Time flies like an arrow.".

If it is advice from a senior researcher to a junior researcher in an
entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
modifies how to go about timing flies.

If it is a comment on how fast time seems to go by, "time" is a noun,
"flies" is a verb, and "like an arrow" modifies how time flies.

Patricia

I actually have a plan on how to handle context, but that particular
sentence is not imperative in the second sense that you provided.
Since I'm narrowing the scope of sentence types down to imperative,
that helps eliminate _some_ ambiguous situations. Indeed, most
languages (including programming) are somewhat sensitive to context.

For example, the Java "sentence":
s+=10;

could mean "Increase the int 's' by 10.", or "append '10' to the
String 's'". It could even be an error if "s" isn't numeric or a
String.

The only reason that isn't considered a problem in Java, is that its
"easy" to determine the context of a statement (scoping rules are
specific and well-defined). On the other hand, "Get the other key"
depends on context that would be harder to model in a computer.
Especially after a few interactions...

"You see a red key and a blue key."
Look at the red key
"The key is red."
Look at the other key
"The other key is blue."
Get the other key. <-- Does other point to the other other key, or to
the original other key?

Its been my experience with interactive fictions that the sentence
interpreters tend to need you to be very specific. I'm sure there are
some out there that have forms of context handling, but I want to
experiment on my own to see how I would go about it.

Originally, I think contextual information will have to be provided by
the world-view designer, with a little help about the "obvious"
context. Eventually, if the imperative sentence parser becomes good
enough, I would consider expanding the scope of it so that the parser
understood other types of sentences, and could glean information about
the current context simply by the descriptions involved.

John W. Kennedy · Sep 22, 2007

Daniel said:
Agreed. The part that I find the most difficult to model, parse, and
query is the complex relationships that can occur amongst several
objects. It's easy enough to say that a bowl in on a table, but what
about an apple between the banana and the orange in the bowl on the
wooden table.

You're still looking at the purely linguistic problems. But there's more
to it than that. For example, what about a cabinet with a closed door,
but which also has a flat surface on top? What if the door is made of
glass? What if it's made of smoky glass, but there's a switch that can
turn on an interior light? All these things have to be handled by the
world model, but -- they also drag in your parser's disambiguator.

John W. Kennedy · Sep 22, 2007

Patricia said:
Daniel Pitts wrote:
....

I think there are far more basic issues. Here's a classic example of the
context-sensitivity of the English language: "Time flies like an arrow.".

If it is advice from a senior researcher to a junior researcher in an
entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
modifies how to go about timing flies.

If it is a comment on how fast time seems to go by, "time" is a noun,
"flies" is a verb, and "like an arrow" modifies how time flies.

And if it is an observation by an surrealist, "time" is an adjective,
"flies" is a noun, "like" is a verb, and "an arrow" is the direct object.

Here's a worse one: "It's a pretty little girls school". I count six
parsings.

AES-128 Clipboard Protector: Auto-Encrypt Ctrl+C, Smart-Decrypt Ctrl+V (C++ Windows Hook)	7	Mar 24, 2026
A few questiosn about encoding	103	Jun 9, 2013
Translater + module + tkinter	1	Feb 16, 2023
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Preserving unicode filename encoding	1	Oct 20, 2012
files.py (encoding error)	0	Jun 10, 2013
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023

Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".

Daniel Pitts

Hunter Gratzner

Jeff Higgins

Daniel Pitts

Roedy Green

Roedy Green

Jeff Higgins

Daniel Pitts

Jeff Higgins

Daniel Pitts

RedGrittyBrick

Jeff Higgins

Daniel Pitts

John W. Kennedy

Daniel Pitts

Patricia Shanahan

RedGrittyBrick

Daniel Pitts

John W. Kennedy

John W. Kennedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads