Clustering text-documents in bundles

exhuma.twn · Sep 25, 2007

Hi,

This *is* off-topic but with python being a language with a somewhat
scientific audience, I might get lucky

I have a set of documents (helpdesk tickets in fact) and I would like
to automatically collect them in bundles so I can visualise some
statistics depending on content.

A while ago I wrote a very simple clustering library which can cluster
about everything where you can calculate some form of distance.
Meaning: You can supply a function that calculates numeric value given
two objects (helpdesk request text-body in this case). The closer the
two objects are related, the smaller the returned value with 0.0
meaning that the two objects are identical.

Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

Text processing is a very blurry area for me. I don't expect any
solutions for the problem right away. Maybe just some pointers as to
*what* I can google for. I'll pick the rest up from there.

Eventually I would like to have the possibility to say: "This set of
texts contains 20 requests dealing with emails, 30 requests dealing
with Office Applications and 210 requests dealing with databases". I
am aware that labelling the different text-bundles will have to be
done manually I suppose. But I will aim for no more than 10 bundles
anyway. So that's OK.

Paul Hankin · Sep 25, 2007

Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

A good distance between two chunks of text is the number of changes
you have to make to one to transform it to the other. You should look
at 'difflib' with which you should be able to code up this sort of
distance (although the details will depend just on what your text
looks like).

Paul Rubin · Sep 25, 2007

exhuma.twn said:
Is it possible to calculate a distance between two chunks of text? I
suppose one could simply do a simple word-count on the chunks
(removing common noise words of course). And then go from there. Maybe
even assigning different weighting to words. But maybe there is a well-
tested and useful algorithm already available?

There's a huge field of text mining that attempts to do things like
this. http://en.wikipedia.org/wiki/Latent_semantic_analysis for some
info about one approach. Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is
a standard reference about text processing. They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <http://informationretrieval.org>.

exhuma.twn · Sep 26, 2007

There's a huge field of text mining that attempts to do things like
this. http://en.wikipedia.org/wiki/Latent_semantic_analysisfor some
info about one approach. Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is
a standard reference about text processing. They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <http://informationretrieval.org>.

Thanks a lot. This gives me some bed-time reading.

List behaviours with Clustering Algorithm	2	Jan 24, 2011
Text Clustering	0	Oct 25, 2008
Genetic algoritm generating the text	0	Aug 18, 2023
here documents	0	Apr 3, 2014
Batch modifying text - content and context based	5	Jan 19, 2023
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
JavaScript in Acrobat Save As Found Text	3	Nov 11, 2021
Measuring a string of text	1	Sep 15, 2022

Clustering text-documents in bundles

exhuma.twn

Paul Hankin

Paul Rubin

exhuma.twn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads