Printing only a portion of a matched regex -- newbie quesiton

Discussion in 'Perl Misc' started by DIAMOND Mark R., Aug 9, 2004.

  1. My apologies to begin with. I am a relatively new, and infrequent user of
    perl.

    I have a series of html files with contact information for doctors. The
    files have enormous amounts of other stuff in them including script, image
    links and so on.
    But the names all appear between a particular <span ...> tag and a </b> tag,
    with the words like "level7Name" or "level2Contact" (the quotes are in the
    tag) marking the particlar spans.
    Line breaks don't seem to follow any particular pattern. The two structures
    <span ... level.Name> .... nametoprint</b> and the equivalent for the
    contact address are quite distinct without any strange embedding of the two.

    What I'd like to do is print out the names, and the contact information, but
    I've obviously gone wrong somewhere. I couldn't work out whether I should or
    should not have a global at the end of the s///, but in either case, I still
    have a problem. Any help would be very much appreciated.

    $/ = ".\n";
    $doctorlistfile = "c:\\tmp\\doctors.tmp";
    open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
    $doctorlistfile \n";
    while(<>) {
    s/<span +class=\"level[0-9]Name\"><b>([^<]*)<\/b>/ $1 /;
    print DOCTORLISTFILE $1;
    s/<span +class=\"level[0-9]Contact\"><b>([^<]*)<\/b>/ $1 /;
    print DOCTORLISTFILE $1;
    }

    --
    Mark R. Diamond
    DIAMOND Mark R., Aug 9, 2004
    #1
    1. Advertising

  2. I should have added that I have searched the NG on Google groups, but part
    of the problem is that I'm not quite sure what I should be searching for
    "print only match OR matching" pointed me to solutions which printed only
    *lines* with an appropriate match.

    mark
    DIAMOND Mark R., Aug 9, 2004
    #2
    1. Advertising

  3. Thanks, Brian. You are quite right. I just want to match, not change. And I
    do want those newlines.. But it only prints the first instance of a name. I
    have made two slight changes . The first so that the print is conditional,
    the second because I realised that the tag that marks the end of the name or
    contact is not always the same, so I have checked for the beginning of the
    tag only in the following.

    $/ = ".\n";
    $doctorlistfile = "c:\\tmp\\doctors.tmp";
    open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
    $doctorlistfile \n";
    while(<>) {
    print DOCTORLISTFILE "$1\n" if m/<span
    +class="level[0-9]Name"><b>([^<]*)</;
    print DOCTORLISTFILE "$1\n" if m/<span
    +class="level[0-9]Contact"><b>([^<]*)</;
    }

    but as I say, only a single name (the first correct match) is extracted from
    the file.

    Another question to which I am unsure of the answer is whether the second
    appearance of $1 is correct, or whether the indices of the $ increase
    throughout the loop rather than just within each regex; i.e. is the first
    match in the second regex actually called $2 ?

    Cheers.

    --
    Mark R. Diamond


    "Brian Kell" <> wrote in message
    news:eek:...
    > $/ = ".\n";
    > $doctorlistfile = "c:\\tmp\\doctors.tmp";
    > open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
    > $doctorlistfile \n";
    > while(<>) {
    > s/<span +class=\"level[0-9]Name\"><b>([^<]*)<\/b>/ $1 /;
    > print DOCTORLISTFILE $1;
    > s/<span +class=\"level[0-9]Contact\"><b>([^<]*)<\/b>/ $1 /;
    > print DOCTORLISTFILE $1;
    > }
    >
    > It looks like you're close. You probably just want to use m// instead of
    > s///, though, since you're only trying to match, not actually do a
    > substitution, right? (And you probably want to print a newline after each
    > of those, right?)
    >
    > So something like this might work:
    >
    > m/<span +class="level[0-9]Name"><b>([^<]*)<\/b>/;
    > print DOCTORLISTFILE "$1\n";
    >
    > If that doesn't work, what does it print instead?
    >
    > Brian
    DIAMOND Mark R., Aug 9, 2004
    #3
  4. DIAMOND Mark R.

    Joe Smith Guest

    DIAMOND Mark R. wrote:

    > $/ = ".\n";
    > while(<>) {


    If your file does not have any lines that end with a period, then
    the entire file will be read in by <>, and the code inside the while{}
    block will be executed only once. Try
    print "$. = '$_'\n";
    as a debugging aid.

    > print DOCTORLISTFILE "$1\n" if m/<span
    > +class="level[0-9]Name"><b>([^<]*)</;
    > print DOCTORLISTFILE "$1\n" if m/<span
    > +class="level[0-9]Contact"><b>([^<]*)</;


    > Another question to which I am unsure of the answer is whether the second
    > appearance of $1 is correct


    In each regex, $1 corresponds to the first set of capturing parentheses in
    that regex. The presence of any other regex in the file does not change this.
    -Joe
    Joe Smith, Aug 9, 2004
    #4
  5. DIAMOND Mark R.

    gnari Guest

    "DIAMOND Mark R." <> wrote in message
    news:cf70lr$ea2$...
    > Thanks, Brian. You are quite right. I just want to match, not change. And

    I
    > do want those newlines.. But it only prints the first instance of a name.

    I
    > have made two slight changes . The first so that the print is conditional,
    > the second because I realised that the tag that marks the end of the name

    or
    > contact is not always the same, so I have checked for the beginning of the
    > tag only in the following.
    >
    > $/ = ".\n";

    this looks a bit tentative in light of your first post.
    skip it

    > $doctorlistfile = "c:\\tmp\\doctors.tmp";
    > open(DOCTORLISTFILE, "> $doctorlistfile" ) || die "Can't open
    > $doctorlistfile \n";
    > while(<>) {
    > print DOCTORLISTFILE "$1\n" if m/<span
    > +class="level[0-9]Name"><b>([^<]*)</;


    you were almost there.
    change the if to a while and add a /g:
    print DOCTORLISTFILE "$1\n"
    while m/<span +class="level[0-9]Name"><b>([^<]*)</g;

    >
    > but as I say, only a single name (the first correct match) is extracted

    from
    > the file.


    consistent with your $/ , probably

    >
    > Another question to which I am unsure of the answer is whether the second
    > appearance of $1 is correct, or whether the indices of the $ increase
    > throughout the loop rather than just within each regex; i.e. is the first
    > match in the second regex actually called $2 ?


    each regex resets the $n variables

    gnari
    gnari, Aug 9, 2004
    #5
  6. Re: Printing only a portion of a matched regex -- Thanks

    Many thanks to all. I have solved my problem and learned quite a bit.

    --
    Mark R. Diamond
    DIAMOND Mark R., Aug 10, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VB Programmer

    Border for only portion of a cell?

    VB Programmer, Jun 10, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    314
    VB Programmer
    Jun 10, 2004
  2. =?Utf-8?B?QW5kcmV3?=

    How to only refresh portion of a .aspx web page?

    =?Utf-8?B?QW5kcmV3?=, Sep 27, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    4,308
    Steve C. Orr [MVP, MCSD]
    Sep 28, 2005
  3. JustSomeGuy

    Newbie quesiton.

    JustSomeGuy, Dec 5, 2004, in forum: XML
    Replies:
    3
    Views:
    402
    Bjoern Hoehrmann
    Dec 5, 2004
  4. philbo30

    Newbie quesiton: Mix C and C++

    philbo30, Sep 16, 2007, in forum: C Programming
    Replies:
    8
    Views:
    343
    Pierre Asselin
    Sep 17, 2007
  5. Tobias
    Replies:
    1
    Views:
    149
    Tobias
    Jan 21, 2007
Loading...

Share This Page