[ActivePython 2.5.1.1] Why does Python not return first line?

G

Gilles Ganault

Hello

I'm stuck at why Python doesn't return the first line in this simple
regex:

===========
response = "<span>Address :</span></td>\r\t\t<td>\r\t\t\t3 Abbey Road,
St Johns Wood <br />\r\t\t\tLondon, NW8 9AY\t\t</td>"

re_address = re.compile('<span>Address
:</span></td>.+?<td>(.+?)</td>',re.I | re.S | re.M)

address = re_address.search(response)
if address:
address = address.group(1).strip()
print "address is %s" % address
else:
print "address not found"
===========
C:\test.py
London, NW8 9AY<br />
===========

Could this be due to the non-printable characters like TAB or ENTER?
FWIW, I think that the original web page I'm trying to parse is from a
*nix host.

Thanks for any hint.
 
G

Gilles Ganault

I'm stuck at why Python doesn't return the first line in this simple
regex

Found it: Python does extract the token, but displaying it requires
removing hidden chars:

=====
response = "<span>Address :</span></td>\r\t\t<td>\r\t\t\t3 Abbey Road,
St Johns Wood <br />\r\t\t\tLondon, NW8 9AY\t\t</td>"

re_address = re.compile('<span>Address
:</span></td>.+?<td>(.+?)</td>',re.I | re.S | re.M)

address = re_address.search(response)
if address:
address = address.group(1).strip()

#Important!
for item in ["\t","\r"," <br />"]:
address = address.replace(item,"")

print "address is %s" % address
else:
print "address not found"
=====

HTH,
 
J

John Machin

Found it: Python does extract the token, but displaying it requires
removing hidden chars:

=====
response = "<span>Address :</span></td>\r\t\t<td>\r\t\t\t3 Abbey Road,
St Johns Wood <br />\r\t\t\tLondon, NW8 9AY\t\t</td>"

re_address = re.compile('<span>Address
:</span></td>.+?<td>(.+?)</td>',re.I | re.S | re.M)

address = re_address.search(response)
if address:
        address = address.group(1).strip()

When in doubt, use the repr() function (2.X) or the ascii() function
(3.X); it will show you unambiguously exactly what you have in a
string; in this case:

'3 Abbey Road said:
        #Important!
        for item in ["\t","\r"," <br />"]:
                address = address.replace(item,"")

        print "address is %s" % address

and the result is:

3 Abbey Road, St Johns WoodLondon, NW8 9AY

WoodLondon ??

Consider the possibility that whether the webpage originated on *x or
not, the author inserted that "<br />" with beneficial intent i.e. not
just to annoy you. You may wish to replace it with something instead
of discarding it.

If you really want the address to look tidy, you could do something
like this:

def norm_space(s):
return ' '.join(s.split())

tidy = ", ".join([norm_space(x) for x in address.replace('<br />',
',').strip(' ,').split(',')])

Perhaps the "<br /") has even more significance (line break?) than a
comma ... in which case you should split the address into lines first,
and apply the tidy process to each line.

HTH,
John
 
D

Dennis Lee Bieber

Actually, the problem is that the only newlines you have on there are Mac OS
Classic/Commodore newlines. Windows new lines date back to typewriters.

Teletype -- all manual typewriters I learned on had the new-line
mechanism built into the "carriage return" lever... In truth, the
resistance was that the new-line(s) would activate before the carriage
moved; but to the user, it was basically one action -- slap the lever
hard enough to slide the carriage over.

Teletypes, OTOH, really did use one character to advance the platen
by a line, and a second to move the print-head to the left. (and may
have needed "rub-out" characters to act as timing delays while the
print-head moved)


And I'm pretty sure my CBM Amiga's used <lf> for end-of-line. My
TRS-80s, OTOH, used <cr> for end-of-line.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
F

Falcolas

address = re_address.search(response)
if address:
address = address.group(1).strip()

#Important!
for item in ["\t","\r"," <br />"]:
address = address.replace(item,"")

As you found, your script works just fine, it's just that during
terminal output the \r performs a carriage return and wipes out
everything prior to it.

FWIW, I've rarely seen a \r by itself, even in Windows (where it's
usually \r\n). Unix generally just outputs the \n, so my guess is that
some other process which created the output removed newline
characters, but didn't account for the carriage return characters
first.

Wiping out the \r characters as you did will solve your display
issues, though any other code should read right past them.

~G
 
S

Steven D'Aprano

FWIW, I've rarely seen a \r by itself, even in Windows (where it's
usually \r\n). Unix generally just outputs the \n, so my guess is that
some other process which created the output removed newline characters,
but didn't account for the carriage return characters first.


\r is the line terminator for classic Mac. (I think OS X uses \n, but
presumably Apple applications are smart enough to use either.)

I also remember a software package that allowed you to choose between \r
\n and \n\r when exporting data to text files. It's been some years since
I've used it -- by memory it was a custom EDI application for a rather
large Australian hardware company. (Presumably their developers couldn't
remember which came first, the \r or the \n, so they made it optional.)

The Unicode standard specifies that all of the following should be
considered line terminators:

LF: Line Feed, U+000A
CR: Carriage Return, U+000D
CR+LF: CR followed by LF, U+000D followed by U+000A
NEL: Next Line, U+0085
FF: Form Feed, U+000C
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

http://en.wikipedia.org/wiki/Newline#Unicode

so presumably if you're getting data from non-Windows or Unix systems,
you could find any of these. Aren't standards wonderful? There are so
many to choose from.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top