kanjidic parser in Perl?

Ben Bullock · Aug 14, 2005

Does anyone know of a parser for Jim Breen's kanjidic written in Perl?

David Alexander Ranvig · Aug 14, 2005

| Does anyone know of a parser for Jim Breen's kanjidic written in
| Perl?

<URL: http://search.cpan.org> is a nice tool for finding all things
perl. Maybe you can use the module Lingua::JP::Kanjidic by Simon
Cozens?

Ben Bullock · Aug 14, 2005

David said:
| Does anyone know of a parser for Jim Breen's kanjidic written in
| Perl?

<URL: http://search.cpan.org> is a nice tool for finding all things
perl. Maybe you can use the module Lingua::JP::Kanjidic by Simon
Cozens?

Thanks for the tip. I had a look at it, and it seems to need some work.
Doesn't parse all the fields in the dictionary yet, unfortunately. I'll
try editing it up a bit.

Ben.

jwb · Aug 14, 2005

Apud Ben Bullock said:
Does anyone know of a parser for Jim Breen's kanjidic written in Perl?

Why parse kanjidic, when there is an XML edition available? (see
http://www.csse.monash.edu.au/~jwb/kanjidic2/index.html)
There are lashings of XML parsers.

Ben Bullock · Aug 14, 2005

Why parse kanjidic, when there is an XML edition available? (see
http://www.csse.monash.edu.au/~jwb/kanjidic2/index.html)

I read that page yesterday and saw the comment

"At this stage the KANJIDIC2 file is officially released, but please
understand that it is still early days for the project and changes in the
structure may occur, so don't assume anything is set in concrete if you use
the file in a project."

So, I assumed the format of kanjidic is more stable.

Can you tell us if the format of the XML kanjidic is likely to change enough
to break existing software?

Also, while I'm at it, a small erratum. On both the kanjidic and kanjidic2
documentation pages, De Roo's kanji book is listed as being published by
Bojinsha, but this should be "Bonjinsha".

There are lashings of XML parsers.

Not to mention lashings of ginger beer.

I don't know anything about XML, but surely the same amount of parsing work
is required for either format.

In the end I copied out an old C file from my former cjdic project which
contained most of the codes for kanjidic, and edited it to parse kanjidic
completely. The next job is to plug the information into MySQL.

In case anyone's interested, I'm planning to add a kanji dictionary function
to the slj FAQ. For example, on the list of kokuji,

http://www.sljfaq.org/afaq/kokuji-list.html,

I plan to actually remove all the information in the list except the
characters themselves, and make each character just a link to my dictionary
lookup function. Then the user can click the character to find information
if he or she is interested, or click a link to show the information he or
she is interested in.

jwb · Aug 14, 2005

I read that page yesterday and saw the comment

"At this stage the KANJIDIC2 file is officially released, but please
understand that it is still early days for the project and changes in the
structure may occur, so don't assume anything is set in concrete if you use
the file in a project."

True. In fact the next release will have some DTD changes, and the data
will be in a slightly different format.

So, I assumed the format of kanjidic is more stable.

Um. Well only up to a point. The order and content of everything between
the first two fields and the start of the readings is not fixed. What
will not change is the one or two letter codes on each field. New ones are
often created in the D* group.

Can you tell us if the format of the XML kanjidic is likely to change enough
to break existing software?

No. I don't know anything about the "existing software". Since kanjidic2
has few entity types and makes a lot of use of attributes, I am told by
XML people that it is kinder on parsers than, say, JMdict.

Also, while I'm at it, a small erratum. On both the kanjidic and kanjidic2
documentation pages, De Roo's kanji book is listed as being published by
Bojinsha, but this should be "Bonjinsha".
Thanks.

In the end I copied out an old C file from my former cjdic project which
contained most of the codes for kanjidic, and edited it to parse kanjidic
completely. The next job is to plug the information into MySQL.

Should be OK provided you didn't assume an order.

Ben Finney · Aug 14, 2005

In sci.lang.japan said:
Not to mention lashings of ginger beer.
Hurrah!

I don't know anything about XML, but surely the same amount of
parsing work is required for either format.

With the important difference that XML documents can be robustly
checked for validity, and then parsed into a traversable document
tree, by any XML library without specific knowledge of KANJIDIC.

And XML libraries, for better or worse, are firmly entrenched in just
about any programming environment these days.

How to implement a html parser in java?	1	Dec 28, 2023
Parser	11	Apr 27, 2014
Bewildering perl parser bug	2	Dec 14, 2007
Issue: unexpected value in $2 (Perl 5.10.1)	17	Mar 22, 2013
filllable PDFs with Perl	1	Jun 4, 2014
Argparse error using NodeJS	0	Oct 31, 2022
perl html parser	1	Nov 11, 2010
Identifying functions in C files and replacing them with a keywordthrough PERL	3	Mar 18, 2013

kanjidic parser in Perl?

Ben Bullock

David Alexander Ranvig

Ben Bullock

jwb

Ben Bullock

jwb

Ben Finney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads