Suggestions for how to approach this problem?

J

John Salerno

I figured I might give myself a little project to make my life at work
easier, so here's what I want to do:

I have a large list of publication citations that are numbered. The
numbers are simply typed in with the rest of the text. What I want to do
is remove the numbers and then put bullets instead. Now, this alone
would be easy enough, with a little Python and a little work by hand,
but the real issue is that because of the way these citations were
typed, there are often line breaks at the end of each line -- in other
words, the person didn't just let the line flow to the next line, they
manually pressed Enter. So inserting bullets at this point would put a
bullet at each line break.

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it. So I'm
hoping I could get an idea or two for approaching this. I figure regular
expressions will be needed, and maybe it would be good to remove the
line breaks first and *not* remove a line break that comes before the
numbers (because that would be the proper place for one), and then
finally remove the numbers.

Thanks.
 
J

John Salerno

John Salerno wrote:

typed, there are often line breaks at the end of each line

Also, there are sometimes tabs used to indent the subsequent lines of
citation, but I assume with that I can just replace the tab with a space.
 
M

Marc 'BlackJack' Rintsch

I have a large list of publication citations that are numbered. The
numbers are simply typed in with the rest of the text. What I want to do
is remove the numbers and then put bullets instead. Now, this alone
would be easy enough, with a little Python and a little work by hand,
but the real issue is that because of the way these citations were
typed, there are often line breaks at the end of each line -- in other
words, the person didn't just let the line flow to the next line, they
manually pressed Enter. So inserting bullets at this point would put a
bullet at each line break.

So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it. So I'm
hoping I could get an idea or two for approaching this. I figure regular
expressions will be needed, and maybe it would be good to remove the
line breaks first and *not* remove a line break that comes before the
numbers (because that would be the proper place for one), and then
finally remove the numbers.

I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Salerno

Marc said:
I think I have vague idea how the input looks like, but it would be
helpful if you show some example input and wanted output.

Good idea. Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the
colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

As you can see, any single citation is broken over several lines as a
result of a line break. I want it to look like this:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray

irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects
of the inhibitors of DNA synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year
period. Annals Surg. 166:947-955.

Now, since this is pasted, it might not even look good to you. But in
the second example, the numbers are meant to be bullets and so the
indentation would happen automatically (in Word). But for now they are
just typed.
 
N

Necmettin Begiter

Also, there are sometimes tabs used to indent the subsequent lines of
citation, but I assume with that I can just replace the tab with a space.

Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers
there is either a newline or a tab, the logic might be this simple:

get the numbers at the beginning of the line. Check for \n and \t after the
number, if either exists, remove them or replace them with a space or
whatever you prefer, and there you have it. Also, how are the records
seperated? By empty lines? If so, \n\n is an empty line in a string, like
this:
"""
some text here\n
\n
some other text here\n
"""
 
D

Dave Hansen

Good idea. Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.

Questions:

1) Do the citation numbers always begin in column 1?

2) Are the citation numbers always followed by a period and then at
least one whitespace character?

If so, I'd probably use a regular expression like ^[0-9]+\.[ \t] to
find the beginning of each cite. then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.

Final formatting can be done with paragraph styles in Word.

HTH,
-=Dave
 
J

James Stroud

John said:
Marc 'BlackJack' Rintsch wrote:
Here's what it looks like now:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated
bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in
Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the
colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

As you can see, any single citation is broken over several lines as a
result of a line break. I want it to look like this:

1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.
3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects
of the inhibitors of DNA synthesis on the
transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.
5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year
period. Annals Surg. 166:947-955.

Now, since this is pasted, it might not even look good to you. But in
the second example, the numbers are meant to be bullets and so the
indentation would happen automatically (in Word). But for now they are
just typed.

If you can count on the person not skipping any numbers in the
citations, you can take an "AI" approach to hopefully weed out the rare
circumstance that a number followed by a period starts a line in the
middle of the citation. This is not failsafe, say if you were on
citation 33 and it was in chapter 34 and that 34 happend to start a new
line. But, then again, even a human would take a little time to figure
that one out--and probably wouldn't be 100% accurate either. I'm sure
there is an AI word for the type of parser that could parse something
like this unambiguously and I'm sure that it has been proven to be
impossible to create:

import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
m = regex.search(aline)
if m is not None:
recnum, aline = m.groups()
if int(recnum) == counter:
if record is not None:
records.append(record)
record = [aline.strip()]
counter += 1
continue
record.append(aline.strip())

if record is not None:
records.append(record)

records = [" ".join(r) for r in records]


py> import re
py> records = []
py> record = None
py> counter = 1
py> regex = re.compile(r'^(\d+)\. (.*)')
py> for aline in lines:
.... m = regex.search(aline)
.... if m is not None:
.... recnum, aline = m.groups()
.... if int(recnum) == counter:
.... if record is not None:
.... records.append(record)
.... record = [aline.strip()]
.... counter += 1
.... continue
.... record.append(aline.strip())
....
py> if record is not None:
.... records.append(record)
....
py> records = [" ".join(r) for r in records]
py> records

['Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.',
'Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
factor. Lancet 2:1138.',
'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
resistance factors in Enterobacteriaceae. 34. The specific effects of
the inhibitors of DNA synthesis on the transfer of R factor and F
factor. Med. Biol. (Tokyo) 73:79-83.',
'Levy, S.B. (1967) Blood safari into Kenya. The New Physician
16:50-54.',
'Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
diverticular disease of the colon: Evaluation of an eleven-year period.
Annals Surg. 166:947-955.']


James
 
J

John Salerno

Necmettin said:
Is this how the text looks like:

123
some information

124 some other information

126(tab here)something else

If this is the case (the numbers are at the beginning, and after the numbers
there is either a newline or a tab, the logic might be this simple:

They all seem to be a little different. One consistency is that each
number is followed by two spaces. There is nothing separating each
reference except a single newline, which I want to preserve. But within
each reference there might be a combination of spaces, tabs, or newlines.
 
J

John Salerno

Dave said:
Questions:

1) Do the citation numbers always begin in column 1?

Yes, that's one consistency at least. :)
2) Are the citation numbers always followed by a period and then at
least one whitespace character?

Yes, it seems to be either one or two whitespaces.
find the beginning of each cite. then I would output each cite
through a state machine that would reduce consecutive whitespace
characters (space, tab, newline) into a single character, separating
each cite with a newline.

Interesting idea! I'm not sure what "state machine" is, but it sounds
like you are suggesting that I more or less separate each reference,
process it, and then rewrite it to a new file in the cleaner format?
That might work pretty well.
 
J

John Salerno

James said:
If you can count on the person not skipping any numbers in the
citations, you can take an "AI" approach to hopefully weed out the rare
circumstance that a number followed by a period starts a line in the
middle of the citation.

I don't think any numbers are skipped, but there are some cases where a
number is followed by a period within a citation. But this might not
matter since each reference number begins at the start of the line, so I
could use the RE to start at the beginning.
 
J

John Salerno

John said:
So I need to remove the line breaks too, but of course not *all* of them
because each reference still needs a line break between it.

After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:


\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the +
should be inside it or not!
 
J

James Stroud

John said:
After doing a bit of search and replace for tabs with my text editor, I
think I've narrowed down the problem to just this:

I need to remove all newline characters that are not at the end of a
citation (and replace them with a single space). That is, those that are
not followed by the start of a new numbered citation. This seems to
involve a look-ahead RE, but I'm not sure how to write those. This is
what I came up with:


\n(?=(\d)+)

(I can never remember if I need parentheses around '\d' or if the +
should be inside it or not!

I included code in my previous post that will parse the entire bib,
making use of the numbering and eliminating the most probable, but still
fairly rare, potential ambiguity. You might want to check out that code,
as my testing it showed that it worked with your example.

James
 
J

John Salerno

James said:
I included code in my previous post that will parse the entire bib,
making use of the numbering and eliminating the most probable, but still
fairly rare, potential ambiguity. You might want to check out that code,
as my testing it showed that it worked with your example.

Thanks. It looked a little involved so I hadn't started to work through
it yet, but I'll do that now before I actually try to write something
from scratch. :)
 
J

John Salerno

James said:
import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
m = regex.search(aline)
if m is not None:
recnum, aline = m.groups()
if int(recnum) == counter:
if record is not None:
records.append(record)
record = [aline.strip()]
counter += 1
continue
record.append(aline.strip())

if record is not None:
records.append(record)

records = [" ".join(r) for r in records]

What do I need to do to get this to run against the text that I have? Is
'lines' meant to be a list of the lines from the original citation file?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top