Information, code, or reading about machine-based textual analysis and classification?

G

Guest

I'm hoping someone here can point the way toward a fairly specialized topic.
I have large amounts of content that need to be classified. Irrelevant or
"uninteresting" (by our criterion) articles need to be disposed of.
"Interesting" articles should be tagged with a number of points of metadata.

As much as possible, I would like machines to do this work. Fairly dumb
ways of doing this might include using our existing databases/metadata as
keyword collections and classifying based on brute-force scans and matches
against our content (e.g. strings "Department of Laboratory Medicine",
"Dr. James Fine." Smarter ways may have been suggested by some of the
presentations at the recent "Google Developer Day" in Mountain View:
programmatic analysis of seed texts led to mechanisms of analysis that
seemed much more efficient than raw text scanning.

I am speculating based on no real knowledge but I would imagine it would be
possible to develop some kind of "relevance index" for an item as compared
to an existing body of text, and keep or dump based on a threshold. More
interestingly, maybe I have classified a thousand articles as say, "UW
biomedical research," and there is an algorithmic means by which we could
assess the "UW biomedical research-ness" of an unknown text. That would be
very useful.

Are there resources or readings I can be looking at? Are there any
pre-existing libraries or frameworks or tools that could ease this task? The
content lives in MS SQL Server 2005 or can be placed in it; Index Server is
installed on my servers; maybe there are things within these tools that can
help.

Thanks in advance for any leads you can offer.

-KF
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top