kanjidic parser in Perl?

D

David Alexander Ranvig

| Does anyone know of a parser for Jim Breen's kanjidic written in
| Perl?

<URL: http://search.cpan.org> is a nice tool for finding all things
perl. Maybe you can use the module Lingua::JP::Kanjidic by Simon
Cozens?
 
B

Ben Bullock

David said:
| Does anyone know of a parser for Jim Breen's kanjidic written in
| Perl?

<URL: http://search.cpan.org> is a nice tool for finding all things
perl. Maybe you can use the module Lingua::JP::Kanjidic by Simon
Cozens?

Thanks for the tip. I had a look at it, and it seems to need some work.
Doesn't parse all the fields in the dictionary yet, unfortunately. I'll
try editing it up a bit.

Ben.
 
B

Ben Bullock

Why parse kanjidic, when there is an XML edition available? (see
http://www.csse.monash.edu.au/~jwb/kanjidic2/index.html)

I read that page yesterday and saw the comment

"At this stage the KANJIDIC2 file is officially released, but please
understand that it is still early days for the project and changes in the
structure may occur, so don't assume anything is set in concrete if you use
the file in a project."

So, I assumed the format of kanjidic is more stable.

Can you tell us if the format of the XML kanjidic is likely to change enough
to break existing software?

Also, while I'm at it, a small erratum. On both the kanjidic and kanjidic2
documentation pages, De Roo's kanji book is listed as being published by
Bojinsha, but this should be "Bonjinsha".
There are lashings of XML parsers.

Not to mention lashings of ginger beer.

I don't know anything about XML, but surely the same amount of parsing work
is required for either format.

In the end I copied out an old C file from my former cjdic project which
contained most of the codes for kanjidic, and edited it to parse kanjidic
completely. The next job is to plug the information into MySQL.

In case anyone's interested, I'm planning to add a kanji dictionary function
to the slj FAQ. For example, on the list of kokuji,

http://www.sljfaq.org/afaq/kokuji-list.html,

I plan to actually remove all the information in the list except the
characters themselves, and make each character just a link to my dictionary
lookup function. Then the user can click the character to find information
if he or she is interested, or click a link to show the information he or
she is interested in.
 
J

jwb

I read that page yesterday and saw the comment
"At this stage the KANJIDIC2 file is officially released, but please
understand that it is still early days for the project and changes in the
structure may occur, so don't assume anything is set in concrete if you use
the file in a project."

True. In fact the next release will have some DTD changes, and the data
will be in a slightly different format.
So, I assumed the format of kanjidic is more stable.

Um. Well only up to a point. The order and content of everything between
the first two fields and the start of the readings is not fixed. What
will not change is the one or two letter codes on each field. New ones are
often created in the D* group.
Can you tell us if the format of the XML kanjidic is likely to change enough
to break existing software?

No. I don't know anything about the "existing software". Since kanjidic2
has few entity types and makes a lot of use of attributes, I am told by
XML people that it is kinder on parsers than, say, JMdict.
Also, while I'm at it, a small erratum. On both the kanjidic and kanjidic2
documentation pages, De Roo's kanji book is listed as being published by
Bojinsha, but this should be "Bonjinsha".
Thanks.

In the end I copied out an old C file from my former cjdic project which
contained most of the codes for kanjidic, and edited it to parse kanjidic
completely. The next job is to plug the information into MySQL.

Should be OK provided you didn't assume an order.
 
B

Ben Finney

In sci.lang.japan said:
Not to mention lashings of ginger beer.
Hurrah!

I don't know anything about XML, but surely the same amount of
parsing work is required for either format.

With the important difference that XML documents can be robustly
checked for validity, and then parsed into a traversable document
tree, by any XML library without specific knowledge of KANJIDIC.

And XML libraries, for better or worse, are firmly entrenched in just
about any programming environment these days.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top