Regexp issue . . .

Discussion in 'Perl' started by MichaelC, Nov 25, 2003.

  1. MichaelC

    MichaelC Guest

    Hi all. I am having a particularly difficult time with a perl script that I
    am writing. The problem area is a place where I need to strip some newlines
    out of a file.

    My source data is text which is in paragraph form, but has line breaks
    within the paragraphs. I need to do as much processing as possible in order
    to minimise the amount of manual changes that I have to make.

    Sample text is as follows:

    "This document is intended to give you an
    overview of DG as well as highlight some of
    the features. This is a brought to your handheld using DG."
    With DG you can view and edit word processing and spreadsheet files on
    your handheld. Simple push-button synchronization of
    the handheld with the desktop will maintain the most up-to-date
    version of a file on both the desktop and handheld.

    I want these to be parsed as follows:

    "This document is intended to give you an overview of DG as well as
    highlight some of the features. This is a brought to your handheld using
    DG." With DG you can view and edit word processing and spreadsheet files on
    your handheld. Simple push-button synchronization of the handheld with the
    desktop will maintain the most up-to-date version of a file on both the
    desktop and handheld.

    --

    One way that I thought might work is to catch all lines that begin upper
    case, prepend them with a line break, strip the trailing break, then trap
    all lines that start lower case and dump them as-is. Repeat this until no
    matches are made on the lower case test, then clean up all those extra line
    breaks.

    I came up with this . . . but all it seems to do is strip all newlines out.

    while( <infl> ) {

    my $x = $_;
    if ( $x =~ ?^[^a-z]? ) { $x =~ s!(.*)\n!\n\1 ! }
    else { $x =~ s!(.*)\n!\1 ! }
    print outfl $x;
    }

    Any help would be greately appreciated.

    Michael
    MichaelC, Nov 25, 2003
    #1
    1. Advertising

  2. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    "MichaelC" <> wrote in
    news:d9Dwb.492453$9l5.241927@pd7tw2no:

    > Hi all. I am having a particularly difficult time with a perl script
    > that I am writing. The problem area is a place where I need to strip
    > some newlines out of a file.
    >
    > My source data is text which is in paragraph form, but has line breaks
    > within the paragraphs. I need to do as much processing as possible in
    > order to minimise the amount of manual changes that I have to make.


    You don't say what you mean by "paragraph form". If you're using that
    term in the usual sense, then you mean that the paragraphs have double
    newlines between them. Is that so? If so, Perl can read paragraph-at-a-
    time for you:

    $/ = '';
    $paragraph = <>;

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP8NO2mPeouIeTNHoEQKl7wCgwhaYGGLKl2VuQu4P7cXtQv9C8ZQAn0K0
    9YlaoVGjDaBonogRTFfOnn5h
    =h9Av
    -----END PGP SIGNATURE-----
    Eric J. Roode, Nov 25, 2003
    #2
    1. Advertising

  3. MichaelC

    MichaelC Guest

    "Eric J. Roode" <> wrote in message
    news:Xns943E4EE1E1E8Dsdn.comcast@216.196.97.136...
    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > "MichaelC" <> wrote in
    > news:d9Dwb.492453$9l5.241927@pd7tw2no:
    >
    > > Hi all. I am having a particularly difficult time with a perl script
    > > that I am writing. The problem area is a place where I need to strip
    > > some newlines out of a file.
    > >
    > > My source data is text which is in paragraph form, but has line breaks
    > > within the paragraphs. I need to do as much processing as possible in
    > > order to minimise the amount of manual changes that I have to make.

    >
    > You don't say what you mean by "paragraph form". If you're using that
    > term in the usual sense, then you mean that the paragraphs have double
    > newlines between them. Is that so? If so, Perl can read paragraph-at-a-
    > time for you:
    >
    > $/ = '';
    > $paragraph = <>;
    >


    Sorry, I thought that I had defined my problem in
    enough detail. My problem is that the text that I am
    processing does NOT have double line breaks
    between paragraphs, and the text has been presented
    wrapped to 72 character width. I do not have access
    to the original, as it was lost. That is the reason for
    my current problem.
    That said, statistically, in the text that I am processing,
    the vast majority of lines that start with the set [A-Z"]
    will start a new paragraph. The converse is als true,
    in that lines that start [a-z,.!?] are definitely part of a
    logical paragraph. In that sense, I am not using the
    term "paragraph" in the way that you normally assume.

    As an object example, the explanation above is a reasonable simulation of
    the problem that I am facing. Logistically, the manually broken text is two
    paragraphs with no extra line breaks between them. I neither require nor do
    I desire double line breaks between paragraphs, what I ro need, though, is
    each paragraph on a single line with a single line break at the end, and
    ONLY there.

    For example, I need to strip all but two line breaks out of the example that
    I have provided, so that the text is contiguous from "Sorry, I" to "current
    problem." and from "That said, " to "normally assume." After some thought,
    I found a solution:

    #!/usr/bin/perl

    open(infl, "<in.txt" );
    open(outfl, ">out.txt");

    while( <infl> ) {

    my $x = $_;
    if ( $x =~ m!^[A-Z"]! ) { print outfl "\n"; }
    $x =~ s!(^.+)\n!\1 !m;

    print outfl $x;
    }

    close(infl);
    close(outfl);

    Thanks,

    Michael
    MichaelC, Nov 26, 2003
    #3
  4. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    "MichaelC" <> wrote in
    news:s5Vwb.496786$pl3.155625@pd7tw3no:

    > Sorry, I thought that I had defined my problem in
    > enough detail.


    I would say not. :)

    > My problem is that the text that I am
    > processing does NOT have double line breaks
    > between paragraphs, and the text has been presented
    > wrapped to 72 character width. I do not have access
    > to the original, as it was lost. That is the reason for
    > my current problem.
    > That said, statistically, in the text that I am processing,
    > the vast majority of lines that start with the set [A-Z"]
    > will start a new paragraph. The converse is als true,
    > in that lines that start [a-z,.!?] are definitely part of a
    > logical paragraph. In that sense, I am not using the
    > term "paragraph" in the way that you normally assume.


    It sounds like you want to remove all newlines, except where the newline
    is followed by an uppercase character. Is that correct?

    If so, I'd suggest reading the entire file into memory, and doing a
    simple substitution on it:

    $/ = undef;
    $content = <FILE>;
    $content =~ s/\n(?![[:upper:]])//g;

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP8SeSmPeouIeTNHoEQKoVQCfdSokT7bnrjmUOkqt4NVFOnp9A48An3t1
    xj9Z1HMNOPOnq8PJ6NJF1KvR
    =1T1p
    -----END PGP SIGNATURE-----
    Eric J. Roode, Nov 26, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    152
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    467
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    344
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    186
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    171
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page