parsing to XML

Discussion in 'Perl Misc' started by steeve_dun@SoftHome.net, Oct 6, 2005.

  1. Guest

    Hi everybody,
    I have a document that includes definitions.
    What I want is parsing the document and saving these definitions in a
    xml document.
    Is there a simple way to do so?
    Thank you!

    Example:
    #### beginning of ducument ####
    \glossary{HTML} {HyperText Markup Language} is the lingua franca for
    publishing hypertext on the \glossary {WWW}{World Wide Web}
    #### end of ducument ####

    what I want is something like:

    #### beginning of xml output ####
    <glossary>
    <definition>
    <word> HTML </word>
    <meaning> HyperText Markup Language </meaning>
    </definition>
    <definition>
    <word> WWW </word>
    <meaning> World Wide Web </meaning>
    </definition>
    </glossary>
    #### end of xml output ####


    ---steeve
    , Oct 6, 2005
    #1
    1. Advertising

  2. Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > Hi everybody,
    > I have a document that includes definitions.
    > What I want is parsing the document and saving these definitions in a
    > xml document.
    > Is there a simple way to do so?
    > Thank you!
    >
    > Example:
    > #### beginning of ducument ####
    > \glossary{HTML} {HyperText Markup Language} is the lingua franca for
    > publishing hypertext on the \glossary {WWW}{World Wide Web}
    > #### end of ducument ####


    Your example doesn't show the variability of the data. Examples never
    do, they only ever give a lower bound. There can always be a variant
    that doesn't happen to appear in the example.

    Can a "definition" span lines? Assuming that it can, you can't process
    the text line-wise without major trickery. You'll need all of it in
    memory . Here is a method that extracts the definitions from the text
    and puts them in a hash:

    my $text = <<'END_TEXT';
    \\glossary{HTML} {HyperText Markup Language} is the lingua franca for
    publishing hypertext on the \\glossary {WWW}{World Wide Web}
    END_TEXT

    my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;

    Generating XML from the hash is probably a job for one of the XML modules.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, Oct 6, 2005
    #2
    1. Advertising

  3. Guest

    Thank you very much, that was very helpful !


    Anno Siegel wrote:

    > <> wrote in comp.lang.perl.misc:
    > > Hi everybody,
    > > I have a document that includes definitions.
    > > What I want is parsing the document and saving these definitions in a
    > > xml document.
    > > Is there a simple way to do so?
    > > Thank you!
    > >
    > > Example:
    > > #### beginning of ducument ####
    > > \glossary{HTML} {HyperText Markup Language} is the lingua franca for
    > > publishing hypertext on the \glossary {WWW}{World Wide Web}
    > > #### end of ducument ####

    >
    > Your example doesn't show the variability of the data. Examples never
    > do, they only ever give a lower bound. There can always be a variant
    > that doesn't happen to appear in the example.
    >
    > Can a "definition" span lines? Assuming that it can, you can't process
    > the text line-wise without major trickery. You'll need all of it in
    > memory . Here is a method that extracts the definitions from the text
    > and puts them in a hash:
    >
    > my $text = <<'END_TEXT';
    > \\glossary{HTML} {HyperText Markup Language} is the lingua franca for
    > publishing hypertext on the \\glossary {WWW}{World Wide Web}
    > END_TEXT
    >
    > my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;
    >
    > Generating XML from the hash is probably a job for one of the XML modules.
    >
    > Anno
    > --
    > If you want to post a followup via groups.google.com, don't use
    > the broken "Reply" link at the bottom of the article. Click on
    > "show options" at the top of the article, then click on the
    > "Reply" at the bottom of the article headers.
    , Oct 7, 2005
    #3
  4. Guest

    On 6 Oct 2005 02:42:04 -0700, wrote:

    >Hi everybody,
    >I have a document that includes definitions.
    >What I want is parsing the document and saving these definitions in a
    >xml document.
    >Is there a simple way to do so?
    >Thank you!
    >
    >Example:
    >#### beginning of ducument ####
    >\glossary{HTML} {HyperText Markup Language} is the lingua franca for
    >publishing hypertext on the \glossary {WWW}{World Wide Web}
    >#### end of ducument ####
    >
    >what I want is something like:
    >
    >#### beginning of xml output ####
    ><glossary>
    ><definition>
    > <word> HTML </word>
    > <meaning> HyperText Markup Language </meaning>
    ></definition>
    ><definition>
    > <word> WWW </word>
    > <meaning> World Wide Web </meaning>
    ></definition>
    ></glossary>
    >#### end of xml output ####
    >
    >
    >---steeve

    I don't understand.
    If you have a "structure" in mind, you don't show it.
    XML is purely "structure" driven...
    input "to" a structure, output "from" a structure.
    Thats the definition of "simple" xml. If you want to
    get into "complex" (nested) xml (and I don't know anybody that
    does) that is beyond the scope of a question here, it seems.

    So if you have simple xml "structure" in mind (with attributes)
    then what would that be? You have to separate populating that
    structure with the generation of "simple" xml output.
    Schema can then be generated once you know what you want to do.
    The largest software houses, including M$Shit use simple xml
    because the xml is just a medium to transport structured data.
    There should be no ambiguity that "nested" html could have.
    You don't want to go down that road.

    Also, I don't know what you mean by "definitions" in that (html?)
    document. Just what is it your trying to accomplish?
    , Oct 9, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Per Magnus L?vold
    Replies:
    0
    Views:
    1,362
    Per Magnus L?vold
    Nov 15, 2004
  2. Greg Wogan-Browne
    Replies:
    1
    Views:
    785
    Uche Ogbuji
    Jan 28, 2005
  3. Replies:
    2
    Views:
    491
  4. John Levine
    Replies:
    0
    Views:
    715
    John Levine
    Feb 2, 2012
  5. Erik Wasser
    Replies:
    5
    Views:
    429
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page