need help-how to parse references

Discussion in 'Perl Misc' started by susan, Jul 25, 2003.

  1. susan

    susan Guest

    Hello everyone:

    I am a perl beginner. I am practicing to parse a list of different
    references. The list looks like any references followed a paper. In
    the list, every reference has different numbers of authors. Most
    references are either books or journals. I would like to separate each
    field, for example, the result I assume looks like:
    author article name journal name or book name volume#...year
    Alison Balter Access 2000 development 1999

    I feel it is hard to find a regular expression to separate them. Does
    anyone advise me where I can find more inforamtion?

    Thanks.

    Susan
    susan, Jul 25, 2003
    #1
    1. Advertising

  2. susan <> wrote:


    > I am practicing to parse a list of different
    > references.



    If you show us several good examples of your data, we would
    probably be able to help you.

    But you didn't, so we can't.

    Have you seen the Posting Guidelines that are posted here frequently?


    > for example, the result I assume looks like:

    ^^^^^^^^^^
    ^^^^^^^^^^
    > author article name journal name or book name volume#...year
    > Alison Balter Access 2000 development 1999



    So that is the output you want from your program?

    What does the input look like?

    We cannot parse data that we know nothing about.

    If that _is_ meant to be your input, then why must you "assume"
    what it looks like?

    We must know the input with great precision if we are to devise
    a way to process it. "Assuming" what the input looks like will
    not result in an answer that is useable in real life.


    > I feel it is hard to find a regular expression to separate them.



    Maybe you do not need a regular expression to separate them.

    Maybe you could use some other approach...


    > Does
    > anyone advise me where I can find more inforamtion?



    .... but without knowing what you have, and how you want to
    transform it, we cannot advise one way or the other.


    Your post does not contain the information we need to answer your question.

    Show use some example input. (one record is not good enough)

    Show use some desired output (for that same data).

    Tell (and show) us anything you know about the format of the input data:

    Can fields be "missing" or "empty"? How can you tell when they are?
    Do the fields always line up in columns?
    Is there some separator between each column?

    If you can do something like that, then we would have a really
    good chance of being able to help you with your problem.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jul 25, 2003
    #2
    1. Advertising

  3. susan

    susan Guest

    Hello friends,

    I am sorry I didn't provide enough information about the input. Here
    is the example of my text file for the references:

    REFERENCES
    Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
    Cell,2nd edition, Garland Publishing, New York,1989.
    Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
    Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
    Sci.Am.243, 100-125(1980).
    Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
    S.,Fluorescence Quen-
    ching: A tool for Single-Molecule Protein-Folding Study,
    Natl.Acad.Sci.19,14,41-
    64(2000).

    I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
    { print "$Line\n"}
    This will keep all the references having a single line. But I don't
    know how to tell the computer to consider from "Alberts..." to "1989."
    is only one citation. Further more, I want to separate the
    inforamtion, output should look like:

    1st author 2nd author Article_Name Journal_or_BookName
    Alberts,B Bray,D Molecular Biology of the Cell
    Van Holde Chromatic
    Bauer,W.R Crick,F.H.C Supercoiled DNA Sci.Am.

    I plan to parse the 1st author, article name and jouranl name first
    since they provide basic information. The final goal is try to parse
    all the information.

    Thanks for your advice.

    Susan


    (Tad McClellan) wrote in message news:<>...
    > susan <> wrote:
    >
    >
    > >

    I am practicing to parse a list of different
    > > references.

    >
    >
    > If you show us several good examples of your data, we would
    > probably be able to help you.
    >
    > But you didn't, so we can't.
    >
    > Have you seen the Posting Guidelines that are posted here frequently?
    >
    >
    > > for example, the result I assume looks like:

    > ^^^^^^^^^^
    > ^^^^^^^^^^
    > > author article name journal name or book name volume#...year
    > > Alison Balter Access 2000 development 1999

    >
    >



    > So that is the output you want from your program?
    >
    > What does the input look like?
    >
    > We cannot parse data that we know nothing about.
    >
    > If that _is_ meant to be your input, then why must you "assume"
    > what it looks like?
    >
    > We must know the input with great precision if we are to devise
    > a way to process it. "Assuming" what the input looks like will
    > not result in an answer that is useable in real life.
    >
    >
    > > I feel it is hard to find a regular expression to separate them.

    >
    >
    > Maybe you do not need a regular expression to separate them.
    >
    > Maybe you could use some other approach...
    >
    >
    > > Does
    > > anyone advise me where I can find more inforamtion?

    >
    >
    > ... but without knowing what you have, and how you want to
    > transform it, we cannot advise one way or the other.
    >
    >
    > Your post does not contain the information we need to answer your question.
    >
    > Show use some example input. (one record is not good enough)
    >
    > Show use some desired output (for that same data).
    >
    > Tell (and show) us anything you know about the format of the input data:
    >
    > Can fields be "missing" or "empty"? How can you tell when they are?
    > Do the fields always line up in columns?
    > Is there some separator between each column?
    >
    > If you can do something like that, then we would have a really
    > good chance of being able to help you with your problem.
    susan, Jul 26, 2003
    #3
  4. susan

    Sam Holden Guest

    On 25 Jul 2003 17:36:27 -0700, susan <> wrote:
    > Hello friends,
    >
    > I am sorry I didn't provide enough information about the input. Here
    > is the example of my text file for the references:
    >
    > REFERENCES
    > Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
    > Cell,2nd edition, Garland Publishing, New York,1989.
    > Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
    > Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
    > Sci.Am.243, 100-125(1980).
    > Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
    > S.,Fluorescence Quen-
    > ching: A tool for Single-Molecule Protein-Folding Study,
    > Natl.Acad.Sci.19,14,41-
    > 64(2000).
    >
    > I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
    > { print "$Line\n"}
    > This will keep all the references having a single line. But I don't
    > know how to tell the computer to consider from "Alberts..." to "1989."
    > is only one citation. Further more, I want to separate the
    > inforamtion, output should look like:


    The output part should be easy when compared to extracting the data.

    For extracting the data I suspect you may be out of luck for a purely
    automated system - the data is designed for humans and even then there
    are probably cases that are ambigious (for humans, let alone machines).

    This is the sort of problem for which the "human in the loop" approach tends
    to be best. Parse as best you can, hopefully give a "score" to the parse and
    let a human check the results.

    References have the nice property of being referenced in multiple places, and
    also have things like citeseer, so if you find something you've found before
    it's more likely to be correct, and if a citeseer search for your parsed
    result is successful you probably got it right too.

    Authors have a reasonably consistant format (Lastname, Initials,) publishers
    and journals and proceedings and the like can be covered by enumerating the
    known ones (which should cover a large majority of posibilities). And a year
    reference wil usually end the reference. So it should be easy to get something
    which works on the vast majority of references (after all nothing you can
    do will make the system work on an incorrect reference - and they exist...)

    As an aside:

    I'm amazed that academia hasn't worked out an ID system with publishers. Page
    numbers suck (and I've seen at least one great study of incorrect references
    spreading through a population (that study interpreted it as a symptom of people
    giving references they haven't actually read - I interprete it as copying
    the reference data of a read paper from another paper (I've done that
    more than once)). The actual proceedings, etc have ISBNs. Giving each paper an
    ID and then requiring that references have [ISBN.ID] after the human readable
    text would make life *so* much easier.

    [snip TOFU - please don't do that]

    --
    Sam Holden
    Sam Holden, Jul 26, 2003
    #4
  5. susan <> wrote:


    > Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
    > Cell,2nd edition, Garland Publishing, New York,1989.



    Ends with 4 digits and a dot.


    > Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
    > Sci.Am.243, 100-125(1980).



    Ends with open paren, 4 digits, close paren and a dot.


    > I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
    > { print "$Line\n"}
    > This will keep all the references having a single line. But I don't
    > know how to tell the computer to consider from "Alberts..." to "1989."



    Do you know how to tell _us_ how to unambiguously determine the
    end of a record?

    We need to know what must be done before we can write code
    for you that will do it.


    > is only one citation.



    If this description fits your data, then you can separate out
    the records easily enough:

    Every record ends with either
    5 chars: 4 digits and a dot
    or
    7 chars: open paren, 4 digits, close paren and a dot


    ----------------------------------------
    #!/usr/bin/perl
    use strict;
    use warnings;

    $_ = '
    Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
    Cell,2nd edition, Garland Publishing, New York,1989.
    Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
    Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
    Sci.Am.243, 100-125(1980).
    Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
    S.,Fluorescence Quen-
    ching: A tool for Single-Molecule Protein-Folding Study,
    Natl.Acad.Sci.19,14,41-
    64(2000).
    ';


    #while ( /^([A-Z].*?(\d{4}|\(\d{4}\))\.)$/gmsx ) {

    while ( /^( # start of line, start of memory
    [A-Z].*? # starts with upper case letter
    ( \d{4} | \(\d{4}\) ) # 4 digits with or without parens
    \. # dot
    )$ # end of memory, end of line
    /gmsx ) { # gym sox (gimsox), according to Damian Conway

    print "$1\n------\n";
    }
    ----------------------------------------


    > Further more, I want to separate the
    > inforamtion,



    You're on your own with that one.

    It is more an Artificial Intelligence question than a Perl question.

    The info is already hamburger. You cannot make steak out of it. :-(


    > (Tad McClellan) wrote in message news:<>...



    [ snip a bit of TOFU ]


    >> Have you seen the Posting Guidelines that are posted here frequently?



    Have you done that yet?

    Please do. Thanks.



    [ snip some more unlovely TOFU ]

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jul 26, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roger Leigh
    Replies:
    8
    Views:
    436
    Karl Heinz Buchegger
    Nov 17, 2003
  2. Replies:
    3
    Views:
    448
    Victor Bazarov
    Nov 10, 2004
  3. DanielEKFA
    Replies:
    8
    Views:
    606
    DanielEKFA
    May 16, 2005
  4. Replies:
    8
    Views:
    711
    Bruno Desthuilliers
    Dec 12, 2006
  5. Lars Willich
    Replies:
    13
    Views:
    835
    Ian Shef
    Oct 23, 2007
Loading...

Share This Page