There is need for Text to XML semi/automatic conversion?

Discussion in 'XML' started by Francesc, Dec 23, 2005.

  1. Francesc

    Francesc Guest

    Hi, I'm new at this newsgroup and I want do ask some questions and
    opinions about this subject.

    I'm developing an application focused in a very specific task: clean
    and labelling text documents with user-defined structural tags (title,
    cite, date, paragraph, itemList, ...). It makes the typical
    pre-processing tasks needed for computational linguistics in order to
    work with big corporas to use statistical tools.

    But I'm worried that this field be too small/specific. I choosed it
    because it's a field that I know and where I'd some contacts, *but* I'm
    not sure if research departments of universities are able to spend
    money/purchase software, or may be they are too used to the free/open
    source world.

    For this reason I'm looking for some other field where the task of
    adding structural labels to text be needed (specifically converting
    unstructured and format-oriented documents to structured
    function-oriented XML documents). May be some area on publishing, but I
    think that they will not be interested in "small" desktop applications.

    Please, any of you had worked for or listen about some business with
    this kind of need? Do you think that there is demand for legacy
    document conversion in small business?

    Some info about the application:

    - Importing form main document formats (TXTs, HTML, RTF, others?).
    - GUI Based for interactive labelling (active learning techniques,
    similar to the OCR programs).
    - Interactive labelling used to "train" the program by automatic
    induction of statistical rules (based on textual, lexical,
    typographical and structural properties of the block).
    - After trainning the labeller can be used in batch-processing in a
    full-atonomous mode.
    - Exporting to user-defined XML (any estandard? docbook? TEI?)
    - A lot of cleaning and normalization small tasks: removing headers,
    de-hyphenation, reconstruction of paragraphs with broken lines,
    removing non-textual or decorative elements as (asccii art), ...

    I think that legacy document conversion may be a need for many
    bussinnes, but I'm not able to found them, may be some of you can give
    a clue?

    thanks very much in advance.
     
    Francesc, Dec 23, 2005
    #1
    1. Advertising

  2. Francesc

    Guest

    Francesc wrote:
    > Hi, I'm new at this newsgroup and I want do ask some questions and
    > opinions about this subject.
    >
    > EDITED FOR BREVITY
    >
    > I think that legacy document conversion may be a need for many
    > bussinnes, but I'm not able to found them, may be some of you can give
    > a clue?
    >
    > thanks very much in advance.


    You might want to look at companies like Exegenix (www.exegenix.com).

    There are a number of vendors who provide XML conversion software.
     
    , Dec 27, 2005
    #2
    1. Advertising

  3. Francesc

    Guest

    Thanks for the link, this program seems to be very similar to what I'm
    doing, but as often this company is focused to "big services". I wonder
    why there are not "small desktop applications" to help taggers to
    automate its labelling tasks.

    The good new is that exists market to hold a big company as Exegenix,
    sure that exists market to hold an small company as mine. :)

    Francesc
     
    , Dec 28, 2005
    #3
  4. Francesc

    Guest

    Francesc wrote:

    >I'm worried that this field be too small/specific. I choosed it
    > because it's a field that I know and where I'd some contacts, *but* I'm
    > not sure if research departments of universities are able to spend
    > money/purchase software, or may be they are too used to the free/open
    > source world.


    Yes Fransesc, such research efforts are undertaken by Universities and
    publishing houses round the world. The single most important reason for
    the same is that XML is customizable to very large extents (as compared
    to HTML). And many such organizations have (and still are) spending
    money and efforts onthese front. But I haven't heard of any commercial
    application that can convert text to XML at a stretch.

    > For this reason I'm looking for some other field where the task of
    > adding structural labels to text be needed (specifically converting
    > unstructured and format-oriented documents to structured
    > function-oriented XML documents). May be some area on publishing, but I
    > think that they will not be interested in "small" desktop applications.


    Now, that sounds interesting, and yes, there can be publications (I am
    not sure on this front) that may need to convert unstructured
    information to XML. But again, isn't XML format-oriented itself? The
    basic purpose of text to XML conversion (in publishing houses and
    universities) is that the XML-ized documents add to a data bank, from
    which they can be searched/sorted out. I believe, this could be
    possible only by structurizing them.

    > Please, any of you had worked for or listen about some business with
    > this kind of need? Do you think that there is demand for legacy
    > document conversion in small business?


    I also have been part of one such organizations.

    > - A lot of cleaning and normalization small tasks: removing headers,
    > de-hyphenation, reconstruction of paragraphs with broken lines,
    > removing non-textual or decorative elements as (asccii art), ...


    These issues can be taken care of by defining a macro in MS Word. The
    folw would then be directed through MS Word itself (Word to text, text
    to XML)

    > I think that legacy document conversion may be a need for many
    > bussinnes....


    The most possible options (my perception), would be publications that
    pay too much emphasis to typography (those can be typography-oriented
    too), or those who prefer keeping their text in undefined/spontaneous
    structures. Such mags may not be archiving/publishing their issues for
    a recall/research purpose.

    Thanking you,

    Manu Stanley

    A journey of a thousand miles must begin with a single step.
     
    , Dec 29, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Flynn
    Replies:
    3
    Views:
    486
  2. Klaus Nowikow
    Replies:
    2
    Views:
    504
    Graham Dumpleton
    Feb 13, 2004
  3. James Stroud

    py2app semi-standalone semi-works

    James Stroud, Oct 4, 2006, in forum: Python
    Replies:
    2
    Views:
    741
    James Stroud
    Oct 4, 2006
  4. Mike Gleason jr Couturier

    Semi-Automatic code documentation

    Mike Gleason jr Couturier, Apr 2, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    362
    Eliyahu Goldin
    Apr 3, 2008
  5. K. Frank
    Replies:
    6
    Views:
    1,113
    Christof Meerwald
    Apr 22, 2012
Loading...

Share This Page