Clustering text-documents in bundles

E

exhuma.twn

Hi,

This *is* off-topic but with python being a language with a somewhat
scientific audience, I might get lucky ;)
I have a set of documents (helpdesk tickets in fact) and I would like
to automatically collect them in bundles so I can visualise some
statistics depending on content.

A while ago I wrote a very simple clustering library which can cluster
about everything where you can calculate some form of distance.
Meaning: You can supply a function that calculates numeric value given
two objects (helpdesk request text-body in this case). The closer the
two objects are related, the smaller the returned value with 0.0
meaning that the two objects are identical.

Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

Text processing is a very blurry area for me. I don't expect any
solutions for the problem right away. Maybe just some pointers as to
*what* I can google for. I'll pick the rest up from there.

Eventually I would like to have the possibility to say: "This set of
texts contains 20 requests dealing with emails, 30 requests dealing
with Office Applications and 210 requests dealing with databases". I
am aware that labelling the different text-bundles will have to be
done manually I suppose. But I will aim for no more than 10 bundles
anyway. So that's OK.
 
P

Paul Hankin

Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

A good distance between two chunks of text is the number of changes
you have to make to one to transform it to the other. You should look
at 'difflib' with which you should be able to code up this sort of
distance (although the details will depend just on what your text
looks like).
 
P

Paul Rubin

exhuma.twn said:
Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

There's a huge field of text mining that attempts to do things like
this. http://en.wikipedia.org/wiki/Latent_semantic_analysis for some
info about one approach. Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is
a standard reference about text processing. They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <http://informationretrieval.org>.
 
E

exhuma.twn

There's a huge field of text mining that attempts to do things like
this. http://en.wikipedia.org/wiki/Latent_semantic_analysisfor some
info about one approach. Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is
a standard reference about text processing. They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <http://informationretrieval.org>.

Thanks a lot. This gives me some bed-time reading.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top