General natural language analysis question: where do I start?

Discussion in 'Perl Misc' started by Ted Byers, Jun 2, 2010.

  1. Ted Byers

    Ted Byers Guest

    At this point, I don't even know what sort of query to submit to
    google to find resources to help find an automated solution to this
    problem. I can do it manually, but that is quite tedious as I have a
    couple thousand distinct strings to process, and for all I know, I
    could have thousands more a month from now.

    This is a business problem, in that the data represents company data
    in which the company has provided a description of the business.
    E.g.:

    "barber & hair salon"
    "barber /beauty salon"
    "barber college"
    "barber salon"
    "barber school"
    "barber shop "
    "barber shop & hair salon"
    "barber shop and beauty salon"
    "barber shop"
    "barber shop/ bar & grill "
    "barber shop/ hair salon"
    "barber shop/natural hair salon"
    "barbershop"
    "barbershop/hair salon"
    "hair salon "
    "hair salon "
    "hair salon & day spa"
    "hair salon and spa"
    "hair salon"
    "hair salon, nails, tanning, products, bistro, crafts & food
    consignments"
    "hair salon, spa, herbal clinic, boutique all in 1"
    "hair salon/ club"
    "hair salon/ spa"
    "hair salon/nail shop"
    "hair school"
    "hair store"
    "hair studio and hair product distribution"
    "hair supply store"

    What I need to do is reduce the number of "business types" in the data
    to a few rational choices. I can tell, from visual inspection, that
    the businesses with most of the above listed labels, can be grouped as
    "personal grooming services". However, the school/college type
    businesses would not be appropriately included in such a group.
    Neither would those with the last three labels be appropriately
    included in such a group.

    This task, as I said, is rather easy, but tedious and time consuming,
    to handle manually.

    The question is, "Is there a perl package or other resource that would
    make this task something I can automate?" Or, if you have experience
    with this sort of thing, can you advise on a suitable search in google
    that will produce more useful information that random noise? I ask
    here because this strikes me as a kind of task that perl would be
    particularly good at (I have already made a start, using perl, to
    clean up the data: e.g. to remove irrelevant characters, spelling
    mistakes, &c.).

    Any information you can provide would be appreciated.

    Thanks

    Ted
    Ted Byers, Jun 2, 2010
    #1
    1. Advertising

  2. Ted Byers

    ccc31807 Guest

    On Jun 2, 12:10 pm, Ted Byers <> wrote:

    Ted,

    Looking at your data, I see that every row contains either 'barber' or
    'hair' and that it would be trivial to filter your data according to
    this criterion, like this maybe:

    push @grooming, $_ if $_ =~ /(barber|hair)/;

    Obviously, you need some sets of eyes to decide if a 'barber school'
    or 'hair supply store' should be included. My approach might be to use
    automation to do some gross sorting and use humans to fine tune your
    data.

    At the same time, you might develop some heuristics to improve your
    automation, realizing that you can't depend on automation for absolute
    perfection.

    CC
    > "barber & hair salon"
    > "barber /beauty salon"
    > "barber college"
    > "barber salon"
    > "barber school"
    > "barber shop "
    > "barber shop & hair salon"
    > "barber shop and beauty salon"
    > "barber shop"
    > "barber shop/ bar & grill "
    > "barber shop/ hair salon"
    > "barber shop/natural hair salon"
    > "barbershop"
    > "barbershop/hair salon"
    > "hair salon  "
    > "hair salon "
    > "hair salon & day spa"
    > "hair salon and spa"
    > "hair salon"
    > "hair salon, nails, tanning, products, bistro, crafts & food
    > consignments"
    > "hair salon, spa, herbal clinic, boutique all in 1"
    > "hair salon/ club"
    > "hair salon/ spa"
    > "hair salon/nail shop"
    > "hair school"
    > "hair store"
    > "hair studio and hair product distribution"
    > "hair supply store"
    ccc31807, Jun 2, 2010
    #2
    1. Advertising

  3. Ted Byers

    Ted Byers Guest

    On Jun 2, 5:12 pm, ccc31807 <> wrote:
    > On Jun 2, 12:10 pm, Ted Byers <> wrote:
    >
    > Ted,
    >
    > Looking at your data, I see that every row contains either 'barber' or
    > 'hair' and that it would be trivial to filter your data according to
    > this criterion, like this maybe:
    >
    > push @grooming, $_ if $_ =~ /(barber|hair)/;
    >
    > Obviously, you need some sets of eyes to decide if a 'barber school'
    > or 'hair supply store' should be included. My approach might be to use
    > automation to do some gross sorting and use humans to fine tune your
    > data.
    >
    > At the same time, you might develop some heuristics to improve your
    > automation, realizing that you can't depend on automation for absolute
    > perfection.
    >
    > CC
    >
    > > "barber & hair salon"
    > > "barber /beauty salon"
    > > "barber college"
    > > "barber salon"
    > > "barber school"
    > > "barber shop "
    > > "barber shop & hair salon"
    > > "barber shop and beauty salon"
    > > "barber shop"
    > > "barber shop/ bar & grill "
    > > "barber shop/ hair salon"
    > > "barber shop/natural hair salon"
    > > "barbershop"
    > > "barbershop/hair salon"
    > > "hair salon  "
    > > "hair salon "
    > > "hair salon & day spa"
    > > "hair salon and spa"
    > > "hair salon"
    > > "hair salon, nails, tanning, products, bistro, crafts & food
    > > consignments"
    > > "hair salon, spa, herbal clinic, boutique all in 1"
    > > "hair salon/ club"
    > > "hair salon/ spa"
    > > "hair salon/nail shop"
    > > "hair school"
    > > "hair store"
    > > "hair studio and hair product distribution"
    > > "hair supply store"

    >
    >


    Thanks.

    I had noticed, but that was but one illustrative example selection,
    and in fact going through the rest of the data since I originally
    posted, I found other items that ought to be grouped with barber
    shops, but which include neither hair nor barber. I have, in fact, a
    file with almost 3000 records covering every imaginable kind of
    business, and some for which I have no idea what the business actually
    does.

    As we're looking at a "simple" classification with something of the
    order of 100 logical groups, it would be at least as time consuming to
    manually come up with a filter for each group as it is to simply
    manually reclassify each using any decent spreadsheet. I was hoping
    that there was a package, with a dictionary, that was able to produce
    a relation between a set of phrases and a set of synonymous words that
    would accelerate the process.

    Thanks again,

    Ted
    Ted Byers, Jun 3, 2010
    #3
  4. Ted Byers <> wrote:
    >At this point, I don't even know what sort of query to submit to
    >google to find resources to help find an automated solution to this
    >problem. I can do it manually, but that is quite tedious as I have a
    >couple thousand distinct strings to process, and for all I know, I
    >could have thousands more a month from now.
    >
    >This is a business problem, in that the data represents company data
    >in which the company has provided a description of the business.
    >E.g.:
    >
    >"barber & hair salon"
    >"barber /beauty salon"
    >"barber college"
    >"barber salon"
    >"barber school"
    >"barber shop "
    >"barber shop & hair salon"
    >"barber shop and beauty salon"
    >"barber shop"
    >"barber shop/ bar & grill "
    >"barber shop/ hair salon"
    >"barber shop/natural hair salon"
    >"barbershop"
    >"barbershop/hair salon"
    >"hair salon "
    >"hair salon "
    >"hair salon & day spa"
    >"hair salon and spa"
    >"hair salon"
    >"hair salon, nails, tanning, products, bistro, crafts & food
    >consignments"
    >"hair salon, spa, herbal clinic, boutique all in 1"
    >"hair salon/ club"
    >"hair salon/ spa"
    >"hair salon/nail shop"
    >"hair school"
    >"hair store"
    >"hair studio and hair product distribution"
    >"hair supply store"
    >
    >What I need to do is reduce the number of "business types" in the data
    >to a few rational choices.


    There are people who have done that already. You can find their
    classification and "business types" in any yellow pages book.

    >I can tell, from visual inspection, that
    >the businesses with most of the above listed labels, can be grouped as
    >"personal grooming services". However, the school/college type
    >businesses would not be appropriately included in such a group.
    >Neither would those with the last three labels be appropriately
    >included in such a group.


    No chance but to manually classify them. You might be able to automate
    some of it (e.g. for "barber shop"), but otherwise you need semantic
    knowledge.

    jue
    Jürgen Exner, Jun 3, 2010
    #4
  5. On 2010-06-03 01:17, Jürgen Exner <> wrote:
    > Ted Byers <> wrote:
    >>At this point, I don't even know what sort of query to submit to
    >>google to find resources to help find an automated solution to this
    >>problem. I can do it manually, but that is quite tedious as I have a
    >>couple thousand distinct strings to process, and for all I know, I
    >>could have thousands more a month from now.
    >>
    >>This is a business problem, in that the data represents company data
    >>in which the company has provided a description of the business.
    >>E.g.:


    [list with a few surprises deleted]


    >>What I need to do is reduce the number of "business types" in the data
    >>to a few rational choices.

    >
    > There are people who have done that already. You can find their
    > classification and "business types" in any yellow pages book.


    AIUI coming up with a classification isn't the problem. Assigning free
    text descriptions to classes is.

    >>I can tell, from visual inspection, that the businesses with most of
    >>the above listed labels, can be grouped as "personal grooming
    >>services". However, the school/college type businesses would not be
    >>appropriately included in such a group. Neither would those with the
    >>last three labels be appropriately included in such a group.

    >
    > No chance but to manually classify them. You might be able to automate
    > some of it (e.g. for "barber shop"), but otherwise you need semantic
    > knowledge.


    I agree with that. It may help to have a program which tries to guess
    the classification of each term and lets the user manually override it.
    The guess could be implemented with bayesian logic or something
    similar.

    hp
    Peter J. Holzer, Jun 3, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Aidan
    Replies:
    2
    Views:
    934
    Gordon Beaton
    Aug 1, 2003
  2. JROCKS11

    natural language recognition

    JROCKS11, Dec 28, 2003, in forum: Java
    Replies:
    3
    Views:
    430
    Gordon Beaton
    Dec 29, 2003
  3. sachin
    Replies:
    2
    Views:
    333
    CBFalconer
    Feb 17, 2004
  4. Jelle Feringa // EZCT / Paris

    OCAMl a more natural extension language for python?

    Jelle Feringa // EZCT / Paris, Jan 17, 2005, in forum: Python
    Replies:
    4
    Views:
    494
  5. ssubbarayan
    Replies:
    5
    Views:
    2,326
    Dave Hansen
    Nov 3, 2009
Loading...

Share This Page