Newbie Text Processing Question

Discussion in 'Python' started by gshepherd281@earthlink.net, Oct 5, 2005.

  1. Guest

    Hi,

    I'm a total newbie to Python so any and all advice is greatly
    appreciated.

    I'm trying to use regular expressions to process text in an SGML file
    but only in one section.

    So the input would look like this:

    <ch-part no="I"><title>RESEARCH GUIDE
    <sec-main no="1.01"><title>content
    <para>content

    <sec-main no="2.01"><title>content
    <para>content


    <ch-part no="II"><title>FORMS
    <sec-main no="3.01"><title>content

    <sec-sub1 no="1"><title>content
    <para>content

    <sec-sub2 no="1"><title>content
    <para>content


    and the output like this:

    <ch-part no="I"><title>RESEARCH GUIDE
    <sec-main no="1.01"><title>content
    <biblio>
    <para>content
    </biblio>

    <sec-main no="2.01"><title>content
    <biblio>
    <para>content
    </biblio>

    <ch-part no="II"><title>FORMS
    <sec-main no="3.01"><title>content

    <sec-sub1 no="1"><title>content
    <para>content

    <sec-sub2 no="1"><title>content
    <para>content


    But no matter what I try I end up changing the entire file rather than
    just one part.

    Here's what I've come up with so far but I can't think of anything
    else.

    ***

    import os, re
    setpath = raw_input("Enter the path where the program should run: ")
    print

    for root, dirs, files in os.walk(setpath):
    fname = files
    for fname in files:
    inputFile = file(os.path.join(root,fname), 'r')
    line = inputFile.read()
    inputFile.close()


    chpart_pattern = re.compile(r'<ch-part
    no=\"[A-Z]{1,4}\"><title>(RESEARCH)', re.IGNORECASE)

    while 1:
    if chpart_pattern.search(line):
    line = re.sub(r"<sec-main
    no=(\"[0-9]*.[0-9]*\")><title>(.*)", r"<sec-main
    no=\1><title>\2\n<biblio>", line)
    outputFile = file(os.path.join(root,fname), 'w')
    outputFile.write(line)
    outputFile.close()
    break

    if chpart_pattern.search(line) is None:
    print 'none'
    break

    Thanks,

    Greg
     
    , Oct 5, 2005
    #1
    1. Advertising

  2. James Stroud Guest

    You can edit a file in place, but it is not applicable to what you are doing.
    As soon as you insert the first "<biblio>", you've shifted everything
    downstream by those 8 bytes. Since they map to a physically located blocks on
    a physical drive, you will have to rewrite those blocks. If it is a big file
    you can do something conceptually similar to piping, where the original file
    is read in line by line and a new file is created:

    afile = open("somefile.xml")
    newfile = open("somenewfile.xml", "w")
    for aline in afile:
    if tests_positive(aline):
    newfile.write(make_the_prelude(aline))
    newfile.write(aline)
    newfile.write(make_the_afterlude(aline))
    else:
    newfile.write(aline)
    afile.close()
    newfile.close()

    James

    On Tuesday 04 October 2005 20:13, Gregory Piñero wrote:
    > That's how Python works. You read in the whole file, edit it, and write it
    > back out. As far as I know there's no way to edit a file "in place" which
    > I'm assuming is what you're asking?


    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com/
     
    James Stroud, Oct 5, 2005
    #2
    1. Advertising

  3. Mike Meyer Guest

    writes:
    > I'm a total newbie to Python so any and all advice is greatly
    > appreciated.


    Well, I've got some for you.

    > I'm trying to use regular expressions to process text in an SGML file
    > but only in one section.


    This is generally a bad idea. SGML family languages aren't easy to
    parse - even the ones that were designed to be easy to parse - and
    generally require very complex regular expessions to get right. It may
    be that your SGML data can be parsed by the re you use, but there
    are almost certainly valid SGML documents that your parser will not
    properly parse.

    In general, it's better to use a parser for the language in question.

    > So the input would look like this:
    >
    > <ch-part no="I"><title>RESEARCH GUIDE
    > <sec-main no="1.01"><title>content
    > <para>content
    >
    > <sec-main no="2.01"><title>content
    > <para>content
    >
    >
    > <ch-part no="II"><title>FORMS
    > <sec-main no="3.01"><title>content
    >
    > <sec-sub1 no="1"><title>content
    > <para>content
    >
    > <sec-sub2 no="1"><title>content
    > <para>content



    This is funny-looking SGML. Are the the end tags really optional for
    all the tag types?

    > But no matter what I try I end up changing the entire file rather than
    > just one part.


    Other have explained why you can't do that, so I'll skip it.

    > Here's what I've come up with so far but I can't think of anything
    > else.
    >
    > ***
    >
    > import os, re
    > setpath = raw_input("Enter the path where the program should run: ")
    > print
    >
    > for root, dirs, files in os.walk(setpath):
    > fname = files
    > for fname in files:
    > inputFile = file(os.path.join(root,fname), 'r')
    > line = inputFile.read()
    > inputFile.close()
    >
    >
    > chpart_pattern = re.compile(r'<ch-part
    > no=\"[A-Z]{1,4}\"><title>(RESEARCH)', re.IGNORECASE)


    This makes a number of assumptions that are invalid about SGML in
    general, but may be valid for your sample text - how attributes are
    quoted, the lack of line breaks, which can be added without changing
    the content, and the format of the "no" attribute.

    > while 1:
    > if chpart_pattern.search(line):
    > line = re.sub(r"<sec-main
    > no=(\"[0-9]*.[0-9]*\")><title>(.*)", r"<sec-main
    > no=\1><title>\2\n<biblio>", line)


    Ditto.

    Heren's an sgmllib solution that gets does what you do above, except
    it writes it to standard out:

    #!/usr/bin/env python

    import sys
    from sgmllib import SGMLParser

    datain = """
    <ch-part no="I"><title>RESEARCH GUIDE
    <sec-main no="1.01"><title>content
    <para>content

    <sec-main no="2.01"><title>content
    <para>content


    <ch-part no="II"><title>FORMS
    <sec-main no="3.01"><title>content

    <sec-sub1 no="1"><title>content
    <para>content

    <sec-sub2 no="1"><title>content
    <para>content
    """

    class Parser(SGMLParser):

    def __init__(self):
    # install the handlers with funny names
    setattr(self, "start_ch-part", self.handle_ch_part)

    # And start with chapter 0
    self.ch_num = 0

    SGMLParser.__init__(self)

    def format_attributes(self, attributes):
    return ['%s="%s"' % pair for pair in attributes]

    def unknown_starttag(self, tag, attributes):
    taglist = self.format_attributes(attributes)
    taglist.insert(0, tag)
    sys.stdout.write('<%s>' % ' '.join(taglist))

    def handle_data(self, data):
    sys.stdout.write(data)

    def handle_ch_part(self, attributes):
    """This should be called start_ch-part, but, well, you know."""

    self.unknown_starttag('ch-part', attributes)
    for name, value in attributes:
    if name == 'no':
    self.ch_num = value

    def start_para(self, attributes):
    if self.ch_num == 'I':
    sys.stdout.write('<biblio>\n')
    self.unknown_starttag('para', attributes)


    parser = Parser()
    parser.feed(datain)
    parser.close()


    sgmllib isn't a very good SGML parser - it was written to support
    htmllib, and really only handles that subset of sgml well. In
    particular, it doesn't really understand DTDs, so can't handle the
    missing end tags in your example. You may be able to work around that.

    If you can coerce this to XML, then the xml tools in the standard
    library will work well. For HTML, I like BeautifulSoup, but that's
    mostly because it deals with all the crud on the net that is passed
    off as HTML. For SGML - well, I don't have a good answer. Last time I
    had to deal with real SGML, I used a C parser that spat out a parse
    tree that could be parsed properly.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Oct 5, 2005
    #3
  4. Gregory Piñero wrote:

    >That's how Python works. You read in the whole file, edit it, and write it
    > back out.


    that's how file systems work. if file systems generally supported insert
    operations, Python would of course support that feature.

    </F>
     
    Fredrik Lundh, Oct 5, 2005
    #4
  5. On Wed, 5 Oct 2005 07:46:49 +0200, "Fredrik Lundh"
    <> declaimed the following in comp.lang.python:

    > Gregory Piñero wrote:
    >
    > >That's how Python works. You read in the whole file, edit it, and write it
    > > back out.

    >
    > that's how file systems work. if file systems generally supported insert
    > operations, Python would of course support that feature.
    >

    My college system's default for editor files was "keyed"... Each
    line was independent, and the key was the line number (including a
    decimal part for inserted lines).

    1.000 first line
    1.500 inserted line
    2.000 last line

    The machine had three "native" file formats... consecutive (what most
    would consider a regular binary/stream [written from start to end]
    file), keyed (ISAM type -- also used by the FORTRAN runtime for "random"
    access by record number), and random (fixed size contiguous disk
    allocation, with NO structure assumed -- all access was by offset from
    start of file allocation).

    Of course, that strange system also maintained separate read/write
    pointers on files, so one could open "update" mode -- where one had to
    read a record before writing (over) the record. No seeks needed.
    "Scratch" required write before read. But the I/O did not have to be in
    lockstep, you could read three records, write one, then read the fourth,
    write the second...
    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
     
    Dennis Lee Bieber, Oct 5, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sharon
    Replies:
    0
    Views:
    400
    Sharon
    Jun 16, 2004
  2. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    540
    Michael Foord
    Sep 17, 2004
  3. Replies:
    2
    Views:
    278
    John Machin
    Feb 20, 2005
  4. Todd_Calhoun
    Replies:
    4
    Views:
    378
    Bengt Richter
    Apr 2, 2005
  5. Jim
    Replies:
    7
    Views:
    180
    Brian McCauley
    Dec 11, 2003
Loading...

Share This Page