need help-how to parse references

S

susan

Hello everyone:

I am a perl beginner. I am practicing to parse a list of different
references. The list looks like any references followed a paper. In
the list, every reference has different numbers of authors. Most
references are either books or journals. I would like to separate each
field, for example, the result I assume looks like:
author article name journal name or book name volume#...year
Alison Balter Access 2000 development 1999

I feel it is hard to find a regular expression to separate them. Does
anyone advise me where I can find more inforamtion?

Thanks.

Susan
 
T

Tad McClellan

I am practicing to parse a list of different
references.


If you show us several good examples of your data, we would
probably be able to help you.

But you didn't, so we can't.

Have you seen the Posting Guidelines that are posted here frequently?

for example, the result I assume looks like: ^^^^^^^^^^
^^^^^^^^^^
author article name journal name or book name volume#...year
Alison Balter Access 2000 development 1999


So that is the output you want from your program?

What does the input look like?

We cannot parse data that we know nothing about.

If that _is_ meant to be your input, then why must you "assume"
what it looks like?

We must know the input with great precision if we are to devise
a way to process it. "Assuming" what the input looks like will
not result in an answer that is useable in real life.

I feel it is hard to find a regular expression to separate them.


Maybe you do not need a regular expression to separate them.

Maybe you could use some other approach...

Does
anyone advise me where I can find more inforamtion?


.... but without knowing what you have, and how you want to
transform it, we cannot advise one way or the other.


Your post does not contain the information we need to answer your question.

Show use some example input. (one record is not good enough)

Show use some desired output (for that same data).

Tell (and show) us anything you know about the format of the input data:

Can fields be "missing" or "empty"? How can you tell when they are?
Do the fields always line up in columns?
Is there some separator between each column?

If you can do something like that, then we would have a really
good chance of being able to help you with your problem.
 
S

susan

Hello friends,

I am sorry I didn't provide enough information about the input. Here
is the example of my text file for the references:

REFERENCES
Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
Cell,2nd edition, Garland Publishing, New York,1989.
Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
Sci.Am.243, 100-125(1980).
Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
S.,Fluorescence Quen-
ching: A tool for Single-Molecule Protein-Folding Study,
Natl.Acad.Sci.19,14,41-
64(2000).

I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
{ print "$Line\n"}
This will keep all the references having a single line. But I don't
know how to tell the computer to consider from "Alberts..." to "1989."
is only one citation. Further more, I want to separate the
inforamtion, output should look like:

1st author 2nd author Article_Name Journal_or_BookName
Alberts,B Bray,D Molecular Biology of the Cell
Van Holde Chromatic
Bauer,W.R Crick,F.H.C Supercoiled DNA Sci.Am.

I plan to parse the 1st author, article name and jouranl name first
since they provide basic information. The final goal is try to parse
all the information.

Thanks for your advice.

Susan
 
S

Sam Holden

Hello friends,

I am sorry I didn't provide enough information about the input. Here
is the example of my text file for the references:

REFERENCES
Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
Cell,2nd edition, Garland Publishing, New York,1989.
Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
Sci.Am.243, 100-125(1980).
Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
S.,Fluorescence Quen-
ching: A tool for Single-Molecule Protein-Folding Study,
Natl.Acad.Sci.19,14,41-
64(2000).

I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
{ print "$Line\n"}
This will keep all the references having a single line. But I don't
know how to tell the computer to consider from "Alberts..." to "1989."
is only one citation. Further more, I want to separate the
inforamtion, output should look like:

The output part should be easy when compared to extracting the data.

For extracting the data I suspect you may be out of luck for a purely
automated system - the data is designed for humans and even then there
are probably cases that are ambigious (for humans, let alone machines).

This is the sort of problem for which the "human in the loop" approach tends
to be best. Parse as best you can, hopefully give a "score" to the parse and
let a human check the results.

References have the nice property of being referenced in multiple places, and
also have things like citeseer, so if you find something you've found before
it's more likely to be correct, and if a citeseer search for your parsed
result is successful you probably got it right too.

Authors have a reasonably consistant format (Lastname, Initials,) publishers
and journals and proceedings and the like can be covered by enumerating the
known ones (which should cover a large majority of posibilities). And a year
reference wil usually end the reference. So it should be easy to get something
which works on the vast majority of references (after all nothing you can
do will make the system work on an incorrect reference - and they exist...)

As an aside:

I'm amazed that academia hasn't worked out an ID system with publishers. Page
numbers suck (and I've seen at least one great study of incorrect references
spreading through a population (that study interpreted it as a symptom of people
giving references they haven't actually read - I interprete it as copying
the reference data of a read paper from another paper (I've done that
more than once)). The actual proceedings, etc have ISBNs. Giving each paper an
ID and then requiring that references have [ISBN.ID] after the human readable
text would make life *so* much easier.

[snip TOFU - please don't do that]
 
T

Tad McClellan

Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
Cell,2nd edition, Garland Publishing, New York,1989.


Ends with 4 digits and a dot.

Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
Sci.Am.243, 100-125(1980).


Ends with open paren, 4 digits, close paren and a dot.

I wrote: if (($Line =~/^[A-Z](.*)\d{4}(\))*\.$/)and ($Line =~/(\.,)/))
{ print "$Line\n"}
This will keep all the references having a single line. But I don't
know how to tell the computer to consider from "Alberts..." to "1989."


Do you know how to tell _us_ how to unambiguously determine the
end of a record?

We need to know what must be done before we can write code
for you that will do it.

is only one citation.


If this description fits your data, then you can separate out
the records easily enough:

Every record ends with either
5 chars: 4 digits and a dot
or
7 chars: open paren, 4 digits, close paren and a dot


----------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

$_ = '
Alberts, B., Bray, D., and Lewis, J., Molecular Biology of the
Cell,2nd edition, Garland Publishing, New York,1989.
Van Holde.,Chromatic, Springer-Verlag, New York, 1989.
Bauer, W. R., Crick, F.H.C, and White, J. H., Supercoiled DNA,
Sci.Am.243, 100-125(1980).
Zhuang,X.,Ha,T.,Kim,H.D.,Cartner,T.m, Lebeit,S., and Chi,
S.,Fluorescence Quen-
ching: A tool for Single-Molecule Protein-Folding Study,
Natl.Acad.Sci.19,14,41-
64(2000).
';


#while ( /^([A-Z].*?(\d{4}|\(\d{4}\))\.)$/gmsx ) {

while ( /^( # start of line, start of memory
[A-Z].*? # starts with upper case letter
( \d{4} | \(\d{4}\) ) # 4 digits with or without parens
\. # dot
)$ # end of memory, end of line
/gmsx ) { # gym sox (gimsox), according to Damian Conway

print "$1\n------\n";
}
----------------------------------------

Further more, I want to separate the
inforamtion,


You're on your own with that one.

It is more an Artificial Intelligence question than a Perl question.

The info is already hamburger. You cannot make steak out of it. :-(

(e-mail address removed) (Tad McClellan) wrote in message news:<[email protected]>...


[ snip a bit of TOFU ]



Have you done that yet?

Please do. Thanks.



[ snip some more unlovely TOFU ]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top