Re: There is need for Text to XML semi/automatic conversion?

Discussion in 'XML' started by Peter Flynn, Dec 23, 2005.

  1. Peter Flynn

    Peter Flynn Guest

    wrote:

    > Hi, I'm new at this newsgroup and I want do ask some questions and
    > opinions about this subject.
    >
    > I'm developing an application focused in a very specific task: clean
    > and labelling text documents with user-defined structural tags (title,
    > cite, date, paragraph, itemList, ...). It makes the typical
    > pre-processing tasks needed for computational linguistics in order to
    > work with big corporas to use statistical tools.


    "User-defined"? Is there a standard for corpus linguistics? Like TEI?

    > But I'm worried that this field be too small/specific. I choosed it
    > because it's a field that I know and where I'd some contacts, *but*
    > I'm not sure if research departments of universities are able to spend
    > money/purchase software, or may be they are too used to the free/open
    > source world.


    Some of them have money, some don't. But IMHE they are well used to
    using Open Source software, and there is plenty available.

    > For this reason I'm looking for some other field where the task of
    > adding structural labels to text be needed (specifically converting
    > unstructured and format-oriented documents to structured
    > function-oriented XML documents). May be some area on publishing, but
    > I think that they will not be interested in "small" desktop
    > applications.


    Who is "they"?

    > Please, any of you had worked for or listen about some business with
    > this kind of need? Do you think that there is demand for legacy
    > document conversion in small business?


    Lots of us work in or close to this field. There certainly is a demand
    for this, but it's very small, especially in small businesses. It is
    currently faster and cheaper to send the whole corpus to a company in
    the Indian subcontinent or on the Pacific Rim and have it rekeyed or
    scanned into XML there. In general, companies are not interested in
    legacy documents, and there is little or no business case for them to
    be. It's different if someone is suing you over some documented event
    that happened in the distant past, but in those cases I suspect the
    companies are only too glad for the documents to remain inaccessible.

    If there was any interest in preserving them, they wouldn't have used
    WordPerfect, Lotus, or Word formats (or whatever) to store them in
    in the first place, would they? :)

    Academic research, especially literary and historical research; some
    library projects; and some publishing-oriented preservation projects
    are more likely to have a demand for this software -- but they don't
    have large sums of money to spend on it, and it is arguable that if
    they are publicly-funded they should perhaps not be spending that
    money this way. And in those fields there are a lot of people who are
    very expert in doing these conversions.

    You seem to be confused about your objective: you say "...in small
    business" but in the preceding paragraph you say that "...they will
    not be interested in 'small' desktop applications."

    > Some info about the application:
    >
    > - Importing form main document formats (TXTs, HTML, RTF, others?).


    Those are three very unlikely candidates as there is already
    software to handle them in many ways.

    Legacy obsolete binary wordprocessing and DTP formats are the hardest
    to deal with, especially when they reside on obsolete media.

    > - GUI Based for interactive labelling (active learning techniques,
    > similar to the OCR programs).


    I just posted about this the other day: see the thread "looking for a
    mentor" in c.t.x (Message-ID <> et seq.)

    > - Interactive labelling used to "train" the program by automatic
    > induction of statistical rules (based on textual, lexical,
    > typographical and structural properties of the block).


    The IR people have been trying to do this for decades.
    I may be biased in favour of markup, but I really don't see any progress
    here.

    > - After trainning the labeller can be used in batch-processing in a
    > full-atonomous mode.


    This is what DynaTag's batch mode does (see reference to thread above).

    > - Exporting to user-defined XML (any estandard? docbook? TEI?)


    Very, very hard to do in the first pass, because the sequence and
    structure may simply not match. Much easier if you use an interim
    markup structure, made for the job, and do a final conversion to
    the target vocabulary afterwards.

    > - A lot of cleaning and normalization small tasks: removing headers,
    > de-hyphenation, reconstruction of paragraphs with broken lines,
    > removing non-textual or decorative elements as (asccii art), ...


    Yes, very useful, and something that a lot of conversion software
    is very bad at.

    > I think that legacy document conversion may be a need for many
    > bussinnes, but I'm not able to found them, may be some of you can give
    > a clue?


    Let us know if you find any businesses who are interested. With the
    obvious caveats already mentioned, legacy documents simply are not
    interesting for businesses.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Dec 23, 2005
    #1
    1. Advertising

  2. Peter Flynn

    Francesc Guest

    Hi Peter,

    Thanks very much for your reply, it's being hard to get some feedback,
    and your answer has been very useful. I'll try to explain myself a
    little better, I'm sorry if my grammar is confuse.


    >> "User-defined"? Is there a standard for corpus linguistics? Like TEI?


    You are right, most of linguistics annotate corpora with subsets of
    TEI, but I mean that the output labels could be defined freely by the
    user. That the output was not fitted/locked to some specific XML as
    DocBook or TEI, that the mapping can be definded in a very flexibe way.


    >> Some of them have money, some don't. But IMHE they are well used to
    >> using Open Source software, and there is plenty available.


    Yes, I've found "plenty" of some tools to translate tags, almost all
    one-to-one mapping and just a few from "styles" to structural labels;
    but all of them were far from the main feature: the capacity to be
    *trainned*, just from sample cases, to tag specific blocks from
    typographical/lexical clues to structural labels.


    > function-oriented XML documents). May be some area on publishing, but
    > I think that they will not be interested in "small" desktop
    > applications.


    >> Who is "they"?


    Sorry, I wanted to refer to the "publisshing guys". What I've listen is
    that publishing companies that works with XML utilize big applications
    (full systems of documentation handling) and that this companies wont
    be atracted and wont trust in a small desktop application.


    >> Lots of us work in or close to this field. There certainly is a
    >> demand for this, but it's very small, especially in small businesses.


    Ok, I've been thinking, and I assume that small bussiness have not
    value in its old documents, it has sense. May be I could focus in
    bussiness related to text/documentation handling. I've known that there
    are some small companies making bussiness in offering services to big
    publishing companies, specifically one of this services seems to be
    migration of texts to xml (as books or dictionaries). Do you know any
    about them?


    >> currently faster and cheaper to send the whole corpus to a company in
    >> the Indian subcontinent or on the Pacific Rim and have it rekeyed or
    >> scanned into XML there. In general, companies are not interested in


    I've listen something, but I understood that this was used as a
    replacement for the OCR phase, from paper document.


    >> If there was any interest in preserving them, they wouldn't have
    >> used WordPerfect, Lotus, or Word formats (or whatever) to store
    >> them in in the first place, would they? :)


    Not in a rational world, but I suspect that are/were too people
    trusting in Word :)


    >> library projects; and some publishing-oriented preservation projects
    >> are more likely to have a demand for this software -- but they don't
    >> have large sums of money to spend on it, and it is arguable that if


    You are right, that were my fear, although the software is not one of
    "large sums of money" kind.


    >> You seem to be confused about your objective: you say "...in
    >> small business" but in the preceding paragraph you say that
    >> "...they will not be interested in 'small' desktop applications."


    Not so in objective but in grammar, I hope that the previous paragraph
    clear the referent of the anaphora. :)

    I know that I'm doing a powerfull but "small desktop application", I
    know that I'm focusing to "small bussiness" due to the size of my
    company, that is a uISV (mono/bi personal). But, yes, I'm very confused
    about what bussiness is actually tagging text documents with XML tags.


    >> Those are three very unlikely candidates as there is already
    >> software to handle them in many ways.


    Sure? Software that handles "typical-user produced documents"? without
    styles and even with spaces as tabulation and breaks at end of line ...
    :)


    >> Legacy obsolete binary wordprocessing and DTP formats are the hardest
    >> to deal with, especially when they reside on obsolete media.


    Yes, but you are pointing to another bussiness about "document format
    converters", and that is a different thing, isnt?


    >> I just posted about this the other day: see the thread "looking for a
    >> mentor" in c.t.x (Message-ID <> et seq.)


    I've readed, but I was not lucky to find the program. May be this
    bussines niche expired some years ago and I'm too late? everybody
    translated its valuable documents to xml...


    >> The IR people have been trying to do this for decades.
    >> I may be biased in favour of markup, but I really don't see any progress


    Are you talking about Information Retrieval? I'm biased in favour of
    markup, but most of documents in the world are not marked, this was the
    reason that I thought (wrongly?) that it would be nice to develop and
    automatic structural tagger... :(


    >> Very, very hard to do in the first pass, because the sequence
    >> and structure may simply not match. Much easier if you use an
    >> interim markup structure, made for the job, and do a final
    >> conversion to the target vocabulary afterwards.


    Interesting, I'll think on it.

    > - A lot of cleaning and normalization small tasks: removing headers,
    >> Yes, very useful, and something that a lot of conversion software
    >> is very bad at.


    I agree.


    >> Let us know if you find any businesses who are interested.
    >> With the obvious caveats already mentioned, legacy documents
    >> simply are not interesting for businesses.


    Your messages was as useful as sad... now I'm thinking about two ways
    left in order to recycle the application:

    1) bussines of XML tagging services to publishing houses: I dont know
    what tools are they using, but sure not trainnable and automatic to tag
    fastly hundreds/thousands of pages.

    2) bussines of HTML to XML mapping: there are a lot of contents
    published in HTML during last 10 years, sure that there are people
    trying to recover this articles and information...


    Well, as you can see I'm a bit worried, if you could elaborate a little
    more from my reply and spent a bit more of your time I'll be again very
    grateful to you. By the way, Happy Christmas! :)

    Francesc
    Francesc, Dec 24, 2005
    #2
    1. Advertising

  3. Peter Flynn

    Peter Flynn Guest

    Francesc wrote:

    >>> "User-defined"? Is there a standard for corpus linguistics? Like
    >>> TEI?

    >
    > You are right, most of linguistics annotate corpora with subsets of
    > TEI, but I mean that the output labels could be defined freely by the
    > user. That the output was not fitted/locked to some specific XML as
    > DocBook or TEI, that the mapping can be definded in a very flexibe
    > way.


    Both TEI and DocBook provide attributes which can be used to do this.

    > Yes, I've found "plenty" of some tools to translate tags, almost all
    > one-to-one mapping and just a few from "styles" to structural labels;
    > but all of them were far from the main feature: the capacity to be
    > *trainned*, just from sample cases, to tag specific blocks from
    > typographical/lexical clues to structural labels.


    Yes, to get trainability you probably have to pay.
    And in my experience, all the reliable programs require the Word input
    to be done with named styles from a template.

    >>> Who is "they"?

    >
    > Sorry, I wanted to refer to the "publishing guys". What I've listen
    > is that publishing companies that works with XML utilize big
    > applications (full systems of documentation handling) and that this
    > companies wont be attracted and wont trust in a small desktop
    > application.


    Many large publishers have sunk large sums of money (and the reputation
    of individuals) into big systems, and don't like seeing small, cheap
    systems overtake them technologically. But others are more flexible and
    are prepared to consider changes...but only on foot of proper business
    and financial planning.

    > Ok, I've been thinking, and I assume that small business have not
    > value in its old documents, it has sense. May be I could focus in
    > business related to text/documentation handling. I've known that
    > there are some small companies making business in offering services
    > to big publishing companies, specifically one of this services seems
    > to be migration of texts to xml (as books or dictionaries). Do you
    > know any about them?


    India and countries further east have many hundreds of such companies
    producing excellent work. There are also a few still in Europe and N.
    America, but the economics of what is largely a manual operation have
    dictated a shift in this business to low-labour-cost areas.

    >>> currently faster and cheaper to send the whole corpus to a company
    >>> in the Indian subcontinent or on the Pacific Rim and have it rekeyed
    >>> or scanned into XML there. In general, companies are not interested

    >
    > I've listen something, but I understood that this was used as a
    > replacement for the OCR phase, from paper document.


    That as well. But most of the companies I have dealt with will accept
    any form of input and create any form of output.

    >>> If there was any interest in preserving them, they wouldn't have
    >>> used WordPerfect, Lotus, or Word formats (or whatever) to store
    >>> them in in the first place, would they? :)

    >
    > Not in a rational world, but I suspect that are/were too people
    > trusting in Word :)


    That's a very polite way to put it :)

    > I know that I'm doing a powerfull but "small desktop application", I
    > know that I'm focusing to "small business" due to the size of my
    > company, that is a uISV (mono/bi personal). But, yes, I'm very
    > confused about what business is actually tagging text documents
    > with XML tags.


    Documentation, especially for multiple target (eg web/print) delivery.
    Repositories for mining, ditto.
    Companies tagging repository information for government services
    (eg corpora of legislation)
    Publishers (books, journals, etc)

    >>> Those are three very unlikely candidates as there is already
    >>> software to handle them in many ways.

    >
    > Sure? Software that handles "typical-user produced documents"? without
    > styles and even with spaces as tabulation and breaks at end of line


    Not without styles (see my note above). Documents such as you describe
    are the hardest to deal with, and typically get sent out to the service
    companies mentioned above for conversion.

    >>> Legacy obsolete binary wordprocessing and DTP formats are the
    >>> hardest to deal with, especially when they reside on obsolete media.

    >
    > Yes, but you are pointing to another business about "document format
    > converters", and that is a different thing, isnt?


    Not really. It's all conversion.

    >>> I just posted about this the other day: see the thread "looking for
    >>> a mentor" in c.t.x (Message-ID <> et
    >>> seq.)

    >
    > I've readed, but I was not lucky to find the program. May be this
    > bussines niche expired some years ago and I'm too late? everybody
    > translated its valuable documents to xml...


    I spoke to Red Bridge just over a year ago about DynaTag and they said
    they still supplied it then. But maybe only as part of a larger system.

    >>> The IR people have been trying to do this for decades.
    >>> I may be biased in favour of markup, but I really don't see any
    >>> progress

    >
    > Are you talking about Information Retrieval? I'm biased in favour of
    > markup, but most of documents in the world are not marked, this was
    > the reason that I thought (wrongly?) that it would be nice to develop
    > and automatic structural tagger... :(


    I agree, and yes it would be nice to develop this program, but the
    problems of deducing structure from non-explicit documents are very
    hard to overcome.

    > Your messages was as useful as sad... now I'm thinking about two ways
    > left in order to recycle the application:
    >
    > 1) business of XML tagging services to publishing houses: I dont know
    > what tools are they using, but sure not trainnable and automatic to
    > tag fastly hundreds/thousands of pages.


    Some of these tools are proprietary and secret (in-house only).
    Some are based on sophisticated programmable editor interfaces
    (eg Emacs, Miles)
    Some are very expensive commercial systems.

    The exhibition area of the XML Conference a few weeks ago in Atlanta
    would have contained numerous stands from companies offering this
    kind of solution (I wasn't there this year, but this has been true
    in the past).

    > 2) business of HTML to XML mapping: there are a lot of contents
    > published in HTML during last 10 years, sure that there are people
    > trying to recover this articles and information...


    Yes, and a lot of people made the discovery a few years ago that their
    HTML content was virtually worthless for re-use as it stood, and have
    now changed to XHTML or some other flavour of XML. But there is still a
    lot more which is too badly done to be meaningfully converted.

    > Well, as you can see I'm a bit worried, if you could elaborate a
    > little more from my reply and spent a bit more of your time I'll be
    > again very grateful to you. By the way, Happy Christmas! :)


    I'm just recovering from the celebrations :)

    If you have a good way to solve the problem then just do it. Many
    thousands of people will be very grateful :)

    ///Peter
    Peter Flynn, Dec 25, 2005
    #3
  4. Peter Flynn

    Guest

    Hi Peter,

    Thanks very much for your reply.

    I agree that deducing the implicit structure from typography and layout
    it's not an easy task, and for this reason was an interesting
    challenge.

    I've been working on machine learning applied to linguistic labelling
    (Part-Of-Speech, Named Entities Recognition, Syntactic Parsing, ...)
    and after notice that pre-processing was the "ugly sister" of the
    field, I started using the same machine-learning techniques to clean
    documents and tag structural labelling.

    This is the reason that, although it's a hard task, I think that
    interesting results can be obtained. I was not worried on the success
    of the solving task, but in the "bussiness niche" existence for the
    task.

    I've been working more than a year on it, and I'll still need another
    half/full year to have the first trainnable versio. I've invested to
    much time and effort to stop now (I just wanted to shift the target
    task to be nearest to some niche). At this time I just must finnish the
    program and wait for the response of the market. I dont need thousands,
    just a few hundreds of "happy and able-to-pay" people in the world...
    :)

    By the way, If you dont mind I'll save your blog in order to contact to
    you again when I've finished my beta version, may be you'll want to
    play with it. :)

    thanks very much for all,

    Francesc
    , Dec 28, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Francesc
    Replies:
    3
    Views:
    554
  2. Klaus Nowikow
    Replies:
    2
    Views:
    470
    Graham Dumpleton
    Feb 13, 2004
  3. James Stroud

    py2app semi-standalone semi-works

    James Stroud, Oct 4, 2006, in forum: Python
    Replies:
    2
    Views:
    689
    James Stroud
    Oct 4, 2006
  4. Mike Gleason jr Couturier

    Semi-Automatic code documentation

    Mike Gleason jr Couturier, Apr 2, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    342
    Eliyahu Goldin
    Apr 3, 2008
  5. K. Frank
    Replies:
    6
    Views:
    1,049
    Christof Meerwald
    Apr 22, 2012
Loading...

Share This Page