Heuristically processing documents

Discussion in 'Python' started by BJörn Lindqvist, Mar 19, 2009.

  1. I have a large set of documents in various text formats. I know that
    each document contains its authors name, email and phone number.
    Sometimes it also contains the authors home address.

    The task is to find out the name, email and phone of as many documents
    as possible. Since the documents are not in a specific format, you
    have to do a lot of guessing and getting approximate results is fine.

    For example, to find the email you can use a simple regexp. If there
    is a match you can be certain that that is the authors email. But what
    algorithms can you use to figure out the other information?

    --
    mvh Björn
    BJörn Lindqvist, Mar 19, 2009
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Filip Hendrickx
    Replies:
    3
    Views:
    795
    Filip Hendrickx
    Feb 7, 2006
  2. Replies:
    1
    Views:
    329
    Uche Ogbuji
    Aug 9, 2004
  3. Claudio Grondi
    Replies:
    2
    Views:
    611
    Satchidanand Haridas
    Jan 25, 2005
  4. Replies:
    1
    Views:
    477
    Juan T. Llibre
    Oct 18, 2006
  5. Luc Mercier
    Replies:
    17
    Views:
    1,147
    Luc Mercier
    Nov 4, 2006
Loading...

Share This Page