Heuristically processing documents

Discussion in 'Python' started by BJörn Lindqvist, Mar 19, 2009.

  1. I have a large set of documents in various text formats. I know that
    each document contains its authors name, email and phone number.
    Sometimes it also contains the authors home address.

    The task is to find out the name, email and phone of as many documents
    as possible. Since the documents are not in a specific format, you
    have to do a lot of guessing and getting approximate results is fine.

    For example, to find the email you can use a simple regexp. If there
    is a match you can be certain that that is the authors email. But what
    algorithms can you use to figure out the other information?

    --
    mvh Björn
     
    BJörn Lindqvist, Mar 19, 2009
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Filip Hendrickx
    Replies:
    3
    Views:
    821
    Filip Hendrickx
    Feb 7, 2006
  2. Replies:
    1
    Views:
    343
    Uche Ogbuji
    Aug 9, 2004
  3. Claudio Grondi
    Replies:
    2
    Views:
    674
    Satchidanand Haridas
    Jan 25, 2005
  4. Replies:
    1
    Views:
    518
    Juan T. Llibre
    Oct 18, 2006
  5. Luc Mercier
    Replies:
    17
    Views:
    1,171
    Luc Mercier
    Nov 4, 2006
Loading...

Share This Page