Suggestions for how to approach this problem?

Discussion in 'Python' started by John Salerno, May 8, 2007.

  1. John Salerno

    John Salerno Guest

    I figured I might give myself a little project to make my life at work
    easier, so here's what I want to do:

    I have a large list of publication citations that are numbered. The
    numbers are simply typed in with the rest of the text. What I want to do
    is remove the numbers and then put bullets instead. Now, this alone
    would be easy enough, with a little Python and a little work by hand,
    but the real issue is that because of the way these citations were
    typed, there are often line breaks at the end of each line -- in other
    words, the person didn't just let the line flow to the next line, they
    manually pressed Enter. So inserting bullets at this point would put a
    bullet at each line break.

    So I need to remove the line breaks too, but of course not *all* of them
    because each reference still needs a line break between it. So I'm
    hoping I could get an idea or two for approaching this. I figure regular
    expressions will be needed, and maybe it would be good to remove the
    line breaks first and *not* remove a line break that comes before the
    numbers (because that would be the proper place for one), and then
    finally remove the numbers.

    Thanks.
     
    John Salerno, May 8, 2007
    #1
    1. Advertising

  2. John Salerno

    John Salerno Guest

    John Salerno wrote:


    > typed, there are often line breaks at the end of each line


    Also, there are sometimes tabs used to indent the subsequent lines of
    citation, but I assume with that I can just replace the tab with a space.
     
    John Salerno, May 8, 2007
    #2
    1. Advertising

  3. In <4640ca5b$0$23392$>, John Salerno wrote:

    > I have a large list of publication citations that are numbered. The
    > numbers are simply typed in with the rest of the text. What I want to do
    > is remove the numbers and then put bullets instead. Now, this alone
    > would be easy enough, with a little Python and a little work by hand,
    > but the real issue is that because of the way these citations were
    > typed, there are often line breaks at the end of each line -- in other
    > words, the person didn't just let the line flow to the next line, they
    > manually pressed Enter. So inserting bullets at this point would put a
    > bullet at each line break.
    >
    > So I need to remove the line breaks too, but of course not *all* of them
    > because each reference still needs a line break between it. So I'm
    > hoping I could get an idea or two for approaching this. I figure regular
    > expressions will be needed, and maybe it would be good to remove the
    > line breaks first and *not* remove a line break that comes before the
    > numbers (because that would be the proper place for one), and then
    > finally remove the numbers.


    I think I have vague idea how the input looks like, but it would be
    helpful if you show some example input and wanted output.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, May 8, 2007
    #3
  4. John Salerno

    John Salerno Guest

    Marc 'BlackJack' Rintsch wrote:

    > I think I have vague idea how the input looks like, but it would be
    > helpful if you show some example input and wanted output.


    Good idea. Here's what it looks like now:

    1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
    irradiated
    bacteriophage T2. J. Bacteriol. 87:1330-1338.
    2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
    factor. Lancet 2:1138.
    3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
    resistance factors in
    Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
    synthesis on the
    transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
    4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54.
    5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
    diverticular disease of the
    colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.

    As you can see, any single citation is broken over several lines as a
    result of a line break. I want it to look like this:

    1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray

    irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
    2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
    factor. Lancet 2:1138.
    3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
    resistance factors in Enterobacteriaceae. 34. The specific effects
    of the inhibitors of DNA synthesis on the
    transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
    4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
    16:50-54.
    5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
    diverticular disease of the colon: Evaluation of an eleven-year
    period. Annals Surg. 166:947-955.

    Now, since this is pasted, it might not even look good to you. But in
    the second example, the numbers are meant to be bullets and so the
    indentation would happen automatically (in Word). But for now they are
    just typed.
     
    John Salerno, May 8, 2007
    #4
  5. On Tuesday 08 May 2007 22:23:31 John Salerno wrote:
    > John Salerno wrote:
    > > typed, there are often line breaks at the end of each line

    >
    > Also, there are sometimes tabs used to indent the subsequent lines of
    > citation, but I assume with that I can just replace the tab with a space.


    Is this how the text looks like:

    123
    some information

    124 some other information

    126(tab here)something else

    If this is the case (the numbers are at the beginning, and after the numbers
    there is either a newline or a tab, the logic might be this simple:

    get the numbers at the beginning of the line. Check for \n and \t after the
    number, if either exists, remove them or replace them with a space or
    whatever you prefer, and there you have it. Also, how are the records
    seperated? By empty lines? If so, \n\n is an empty line in a string, like
    this:
    """
    some text here\n
    \n
    some other text here\n
    """
     
    Necmettin Begiter, May 8, 2007
    #5
  6. John Salerno

    Dave Hansen Guest

    On May 8, 3:00 pm, John Salerno <> wrote:
    > Marc 'BlackJack' Rintsch wrote:
    > > I think I have vague idea how the input looks like, but it would be
    > > helpful if you show some example input and wanted output.

    >
    > Good idea. Here's what it looks like now:
    >
    > 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
    > irradiated
    > bacteriophage T2. J. Bacteriol. 87:1330-1338.
    > 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
    > factor. Lancet 2:1138.
    > 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
    > resistance factors in
    > Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
    > synthesis on the
    > transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.


    Questions:

    1) Do the citation numbers always begin in column 1?

    2) Are the citation numbers always followed by a period and then at
    least one whitespace character?

    If so, I'd probably use a regular expression like ^[0-9]+\.[ \t] to
    find the beginning of each cite. then I would output each cite
    through a state machine that would reduce consecutive whitespace
    characters (space, tab, newline) into a single character, separating
    each cite with a newline.

    Final formatting can be done with paragraph styles in Word.

    HTH,
    -=Dave
     
    Dave Hansen, May 8, 2007
    #6
  7. John Salerno

    James Stroud Guest

    John Salerno wrote:
    > Marc 'BlackJack' Rintsch wrote:
    > Here's what it looks like now:
    >
    > 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
    > irradiated
    > bacteriophage T2. J. Bacteriol. 87:1330-1338.
    > 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
    > factor. Lancet 2:1138.
    > 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
    > resistance factors in
    > Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA
    > synthesis on the
    > transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
    > 4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
    > 16:50-54.
    > 5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
    > diverticular disease of the
    > colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.
    >
    > As you can see, any single citation is broken over several lines as a
    > result of a line break. I want it to look like this:
    >
    > 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
    > irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.
    > 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
    > factor. Lancet 2:1138.
    > 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
    > resistance factors in Enterobacteriaceae. 34. The specific effects
    > of the inhibitors of DNA synthesis on the
    > transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.
    > 4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician
    > 16:50-54.
    > 5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
    > diverticular disease of the colon: Evaluation of an eleven-year
    > period. Annals Surg. 166:947-955.
    >
    > Now, since this is pasted, it might not even look good to you. But in
    > the second example, the numbers are meant to be bullets and so the
    > indentation would happen automatically (in Word). But for now they are
    > just typed.


    If you can count on the person not skipping any numbers in the
    citations, you can take an "AI" approach to hopefully weed out the rare
    circumstance that a number followed by a period starts a line in the
    middle of the citation. This is not failsafe, say if you were on
    citation 33 and it was in chapter 34 and that 34 happend to start a new
    line. But, then again, even a human would take a little time to figure
    that one out--and probably wouldn't be 100% accurate either. I'm sure
    there is an AI word for the type of parser that could parse something
    like this unambiguously and I'm sure that it has been proven to be
    impossible to create:

    import re
    records = []
    record = None
    counter = 1
    regex = re.compile(r'^(\d+)\. (.*)')
    for aline in lines:
    m = regex.search(aline)
    if m is not None:
    recnum, aline = m.groups()
    if int(recnum) == counter:
    if record is not None:
    records.append(record)
    record = [aline.strip()]
    counter += 1
    continue
    record.append(aline.strip())

    if record is not None:
    records.append(record)

    records = [" ".join(r) for r in records]


    py> import re
    py> records = []
    py> record = None
    py> counter = 1
    py> regex = re.compile(r'^(\d+)\. (.*)')
    py> for aline in lines:
    .... m = regex.search(aline)
    .... if m is not None:
    .... recnum, aline = m.groups()
    .... if int(recnum) == counter:
    .... if record is not None:
    .... records.append(record)
    .... record = [aline.strip()]
    .... counter += 1
    .... continue
    .... record.append(aline.strip())
    ....
    py> if record is not None:
    .... records.append(record)
    ....
    py> records = [" ".join(r) for r in records]
    py> records

    ['Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray
    irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.',
    'Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R
    factor. Lancet 2:1138.',
    'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic
    resistance factors in Enterobacteriaceae. 34. The specific effects of
    the inhibitors of DNA synthesis on the transfer of R factor and F
    factor. Med. Biol. (Tokyo) 73:79-83.',
    'Levy, S.B. (1967) Blood safari into Kenya. The New Physician
    16:50-54.',
    'Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of
    diverticular disease of the colon: Evaluation of an eleven-year period.
    Annals Surg. 166:947-955.']


    James
     
    James Stroud, May 8, 2007
    #7
  8. John Salerno

    John Salerno Guest

    Necmettin Begiter wrote:

    > Is this how the text looks like:
    >
    > 123
    > some information
    >
    > 124 some other information
    >
    > 126(tab here)something else
    >
    > If this is the case (the numbers are at the beginning, and after the numbers
    > there is either a newline or a tab, the logic might be this simple:


    They all seem to be a little different. One consistency is that each
    number is followed by two spaces. There is nothing separating each
    reference except a single newline, which I want to preserve. But within
    each reference there might be a combination of spaces, tabs, or newlines.
     
    John Salerno, May 9, 2007
    #8
  9. John Salerno

    John Salerno Guest

    Dave Hansen wrote:

    > Questions:
    >
    > 1) Do the citation numbers always begin in column 1?


    Yes, that's one consistency at least. :)

    > 2) Are the citation numbers always followed by a period and then at
    > least one whitespace character?


    Yes, it seems to be either one or two whitespaces.

    > find the beginning of each cite. then I would output each cite
    > through a state machine that would reduce consecutive whitespace
    > characters (space, tab, newline) into a single character, separating
    > each cite with a newline.


    Interesting idea! I'm not sure what "state machine" is, but it sounds
    like you are suggesting that I more or less separate each reference,
    process it, and then rewrite it to a new file in the cleaner format?
    That might work pretty well.
     
    John Salerno, May 9, 2007
    #9
  10. John Salerno

    John Salerno Guest

    James Stroud wrote:

    > If you can count on the person not skipping any numbers in the
    > citations, you can take an "AI" approach to hopefully weed out the rare
    > circumstance that a number followed by a period starts a line in the
    > middle of the citation.


    I don't think any numbers are skipped, but there are some cases where a
    number is followed by a period within a citation. But this might not
    matter since each reference number begins at the start of the line, so I
    could use the RE to start at the beginning.
     
    John Salerno, May 9, 2007
    #10
  11. John Salerno

    John Salerno Guest

    John Salerno wrote:

    > So I need to remove the line breaks too, but of course not *all* of them
    > because each reference still needs a line break between it.


    After doing a bit of search and replace for tabs with my text editor, I
    think I've narrowed down the problem to just this:

    I need to remove all newline characters that are not at the end of a
    citation (and replace them with a single space). That is, those that are
    not followed by the start of a new numbered citation. This seems to
    involve a look-ahead RE, but I'm not sure how to write those. This is
    what I came up with:


    \n(?=(\d)+)

    (I can never remember if I need parentheses around '\d' or if the +
    should be inside it or not!
     
    John Salerno, May 9, 2007
    #11
  12. John Salerno

    James Stroud Guest

    John Salerno wrote:
    > John Salerno wrote:
    >
    >> So I need to remove the line breaks too, but of course not *all* of
    >> them because each reference still needs a line break between it.

    >
    >
    > After doing a bit of search and replace for tabs with my text editor, I
    > think I've narrowed down the problem to just this:
    >
    > I need to remove all newline characters that are not at the end of a
    > citation (and replace them with a single space). That is, those that are
    > not followed by the start of a new numbered citation. This seems to
    > involve a look-ahead RE, but I'm not sure how to write those. This is
    > what I came up with:
    >
    >
    > \n(?=(\d)+)
    >
    > (I can never remember if I need parentheses around '\d' or if the +
    > should be inside it or not!


    I included code in my previous post that will parse the entire bib,
    making use of the numbering and eliminating the most probable, but still
    fairly rare, potential ambiguity. You might want to check out that code,
    as my testing it showed that it worked with your example.

    James
     
    James Stroud, May 9, 2007
    #12
  13. John Salerno

    John Salerno Guest

    James Stroud wrote:

    > I included code in my previous post that will parse the entire bib,
    > making use of the numbering and eliminating the most probable, but still
    > fairly rare, potential ambiguity. You might want to check out that code,
    > as my testing it showed that it worked with your example.


    Thanks. It looked a little involved so I hadn't started to work through
    it yet, but I'll do that now before I actually try to write something
    from scratch. :)
     
    John Salerno, May 10, 2007
    #13
  14. John Salerno

    John Salerno Guest

    James Stroud wrote:

    > import re
    > records = []
    > record = None
    > counter = 1
    > regex = re.compile(r'^(\d+)\. (.*)')
    > for aline in lines:
    > m = regex.search(aline)
    > if m is not None:
    > recnum, aline = m.groups()
    > if int(recnum) == counter:
    > if record is not None:
    > records.append(record)
    > record = [aline.strip()]
    > counter += 1
    > continue
    > record.append(aline.strip())
    >
    > if record is not None:
    > records.append(record)
    >
    > records = [" ".join(r) for r in records]


    What do I need to do to get this to run against the text that I have? Is
    'lines' meant to be a list of the lines from the original citation file?
     
    John Salerno, May 10, 2007
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. SimonH

    How to approach a problem..

    SimonH, Feb 25, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    408
    thechaosengine
    Feb 26, 2005
  2. S.M. Altaf [MVP]

    Re: best way to approach this problem

    S.M. Altaf [MVP], Nov 11, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    501
    Stimp
    Nov 11, 2005
  3. Stimp
    Replies:
    0
    Views:
    382
    Stimp
    Dec 28, 2005
  4. Hitesh Joshi
    Replies:
    1
    Views:
    271
    Sybren Stuvel
    Jul 19, 2006
  5. bilsch
    Replies:
    64
    Views:
    1,173
    Gene Wirchenko
    Jul 2, 2012
Loading...

Share This Page