Text parsing via regex

Robocop · Dec 8, 2008

I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary). So i immediately came up with
something along the lines of:

string = "a bunch of nonsense that could be really long, or really
short depending on the situation"
r = re.compile(r".{50}")
m = r.match(string)

then i started to realize that i didn't know how to do exactly what i
wanted. At this point i wanted to find a way to simply use something
like:

parsed_1, parsed_2,...parsed_n = m.groups()

However i'm having several problems. I know that playskool regular
expression i wrote above will only parse every 50 characters, and will
blindly cut words in half if the parsed string doesn't end with a
whitespace. I'm relatively new to regexes and i don't know how to
have it take that into account, or even what type of logic i would
need to fill in the extra whitespaces to make the string the proper
length when avoiding cutting words up. So that's problem #1. Problem
#2 is that because the string is of arbitrary length, i never know how
many parsed strings i'll have, and thus do not immediately know how
many variables need to be created to accompany them. It's easy enough
with each pass of the function to find how many i will have by doing:
mag = len(string)
upper_lim = mag/50 + 1
But i'm not sure how to declare and set them to my parsed strings.
Now problem #1 isn't as pressing, i can technically get away with
cutting up the words, i'd just prefer not to. The most pressing
problem right now is #2. Any help, or suggestions would be great,
anything to get me thinking differently is helpful.

r0g · Dec 8, 2008

Robocop said:
I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary). So i immediately came up with
something along the lines of:

string = "a bunch of nonsense that could be really long, or really
short depending on the situation"
r = re.compile(r".{50}")
m = r.match(string)

then i started to realize that i didn't know how to do exactly what i
wanted. At this point i wanted to find a way to simply use something
like:

parsed_1, parsed_2,...parsed_n = m.groups()

However i'm having several problems. I know that playskool regular
expression i wrote above will only parse every 50 characters, and will
blindly cut words in half if the parsed string doesn't end with a
whitespace. I'm relatively new to regexes and i don't know how to
have it take that into account, or even what type of logic i would
need to fill in the extra whitespaces to make the string the proper
length when avoiding cutting words up. So that's problem #1. Problem
#2 is that because the string is of arbitrary length, i never know how
many parsed strings i'll have, and thus do not immediately know how
many variables need to be created to accompany them. It's easy enough
with each pass of the function to find how many i will have by doing:
mag = len(string)
upper_lim = mag/50 + 1
But i'm not sure how to declare and set them to my parsed strings.
Now problem #1 isn't as pressing, i can technically get away with
cutting up the words, i'd just prefer not to. The most pressing
problem right now is #2. Any help, or suggestions would be great,
anything to get me thinking differently is helpful.

Hi Robocop,

What do you mean by "parses some arbitrarily long string every 50
characters"? What does your source data look like? Can you give us an
example of of a) it and b) what a match would look like.

I think you will get good mileage out of using '\b' to match word
boundaries and that you may be better off rexing your string into a list
and then padding it with whitespace after the fact but I can't say for
sure. Please clarify.

Paul McGuire · Dec 8, 2008

I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words

Are you just wrapping text? If so, then use the textwrap module.

import textwrap

source_string = "a bunch of nonsense that could be really long, or
really short depending on the situation"

print textwrap.fill(source_string,50)
print textwrap.wrap(source_string,50)

print map(len,textwrap.wrap(source_string,50))
pad50 = lambda s : (s+ " "*50)[:50]
print '|\n'.join(map(pad50,textwrap.wrap(source_string,50)))

Prints:

a bunch of nonsense that could be really long, or
really short depending on the situation
['a bunch of nonsense that could be really long, or',
'really short depending on the situation']
[49, 39]
a bunch of nonsense that could be really long, or |
really short depending on the situation

-- Paul

I V · Dec 8, 2008

Regexps may not be the solution here. You could consider the textwrap
module ( http://docs.python.org/library/textwrap.html ), although that
will only split your text into strings up to 50 characters long, rather
than padding with whitespace to exactly 50 characters.

If you really need the strings to be exactly 50 characters long (and, are
you sure you do?), try:

# Split the input up into separate words
words = input_string.split()

groups = []
current_string = ''
current_length = 0
for word in words:
if current_length + len(word) +1 <= 50:
# If adding a space followed by the current word
# wouldn't take us over 50 chars, add the word.
current_string += ' ' + word
current_length += len(word)+1
else:
# Pad the string with spaces, and add it to our
# list of string
current_string += ' ' * (50 - current_length)
groups.append(current_string)
current_string = word
current_length = len(word)

Whenever you find yourself thinking "I don't know how many variables I
need," the answer is almost always that you need one variable, which is a
list. In the code above, the 50-char-long strings will all get put in the
list called "groups".

Shane Geiger · Dec 8, 2008

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/148061

def wrap(text, width):
"""
A word-wrap function that preserves existing line breaks
and most spaces in the text. Expects that existing line
breaks are posix newlines (\n).
"""
return reduce(lambda line, word, width=width: '%s%s%s' %
(line,
' \n'[(len(line)-line.rfind('\n')-1
+ len(word.split('\n',1)[0]
) >= width)],
word),
text.split(' ')
)

# 2 very long lines separated by a blank line
msg = """Arthur: "The Lady of the Lake, her arm clad in the purest \
shimmering samite, held aloft Excalibur from the bosom of the water, \
signifying by Divine Providence that I, Arthur, was to carry \
Excalibur. That is why I am your king!"

Dennis: "Listen. Strange women lying in ponds distributing swords is \
no basis for a system of government. Supreme executive power derives \
from a mandate from the masses, not from some farcical aquatic \
ceremony!\""""

# example: make it fit in 40 columns
print(wrap(msg,40))

# result is below
"""
Arthur: "The Lady of the Lake, her arm
"""

I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary). So i immediately came up with
something along the lines of:

string = "a bunch of nonsense that could be really long, or really
short depending on the situation"
r = re.compile(r".{50}")
m = r.match(string)

then i started to realize that i didn't know how to do exactly what i
wanted. At this point i wanted to find a way to simply use something
like:

parsed_1, parsed_2,...parsed_n = m.groups()

However i'm having several problems. I know that playskool regular
expression i wrote above will only parse every 50 characters, and will
blindly cut words in half if the parsed string doesn't end with a
whitespace. I'm relatively new to regexes and i don't know how to
have it take that into account, or even what type of logic i would
need to fill in the extra whitespaces to make the string the proper
length when avoiding cutting words up. So that's problem #1. Problem
#2 is that because the string is of arbitrary length, i never know how
many parsed strings i'll have, and thus do not immediately know how
many variables need to be created to accompany them. It's easy enough
with each pass of the function to find how many i will have by doing:
mag = len(string)
upper_lim = mag/50 + 1
But i'm not sure how to declare and set them to my parsed strings.
Now problem #1 isn't as pressing, i can technically get away with
cutting up the words, i'd just prefer not to. The most pressing
problem right now is #2. Any help, or suggestions would be great,
anything to get me thinking differently is helpful.

--
Shane Geiger
IT Director
National Council on Economic Education
(e-mail address removed) | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

Vlastimil Brom · Dec 8, 2008

2008/12/8 Robocop said:
I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters,
...

Hi, not sure, if I understand the task completely, but maybe some of
the variants below using re may help (depending on what should be done
further with the resulting test segments);
in the first two possibilities the resulting lines are 50 characters
long + 1 for "\n"; possibly 49 would be used if needed.

import re

input_txt = """I'm having a little text parsing problem that i think
would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary). So i immediately came up with
something along the lines of:"""

# print re.sub(r"((?s).{1,50}\b)", lambda m: m.group().ljust(50) +
"\n", input_txt) # re.sub using a function

# for m in re.finditer(r"((?s).{1,50}\b)", input_txt): # adjusting
the matches via finditer
# print m.group().ljust(50)

print [chunk.ljust(50) for chunk in re.findall(r"((?s).{1,50}\b)",
input_txt)] # adjusting the matched parts in findall

hth,
vbr

MRAB · Dec 8, 2008

Paul said:
I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words

Click to expand...

Are you just wrapping text? If so, then use the textwrap module.

import textwrap

source_string = "a bunch of nonsense that could be really long, or
really short depending on the situation"

print textwrap.fill(source_string,50)
print textwrap.wrap(source_string,50)

print map(len,textwrap.wrap(source_string,50))
pad50 = lambda s : (s+ " "*50)[:50]
print '|\n'.join(map(pad50,textwrap.wrap(source_string,50)))

Instead of:

pad50 = lambda s : (s+ " "*50)[:50]

you could use:

pad50 = lambda s: s.ljust(50)

Prints:

a bunch of nonsense that could be really long, or
really short depending on the situation
['a bunch of nonsense that could be really long, or',
'really short depending on the situation']
[49, 39]
a bunch of nonsense that could be really long, or |
really short depending on the situation

-- Paul

MRAB · Dec 8, 2008

Vlastimil said:
Hi, not sure, if I understand the task completely, but maybe some of
the variants below using re may help (depending on what should be done
further with the resulting test segments);
in the first two possibilities the resulting lines are 50 characters
long + 1 for "\n"; possibly 49 would be used if needed.

import re

input_txt = """I'm having a little text parsing problem that i think
would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary). So i immediately came up with
something along the lines of:"""

# print re.sub(r"((?s).{1,50}\b)", lambda m: m.group().ljust(50) +
"\n", input_txt) # re.sub using a function

I also thought of r"(.{1,50}\b)", but then I realised that there's a
subtle problem: it says that the captured text should end on a word
boundary, when, in fact, we just don't want it to split within a word.
It would still be acceptable if it split between 2 non-word characters.
Aargh!

# for m in re.finditer(r"((?s).{1,50}\b)", input_txt): # adjusting
the matches via finditer
# print m.group().ljust(50)

print [chunk.ljust(50) for chunk in re.findall(r"((?s).{1,50}\b)",
input_txt)] # adjusting the matched parts in findall

Robocop · Dec 8, 2008

Wow! Thanks for all the input, it looks like that textwrapper will
work great for my needs. And thanks for the regex help everyone.
Also, i was thinking of using a list, but i haven't used them much in
python. Is there anything in python that is equivalent to pushback in
c++ for vectors? As in, could i just initialize a list, and then
pushback values into it as i need them? Thanks again!

Robert Kern · Dec 8, 2008

Robocop said:
Wow! Thanks for all the input, it looks like that textwrapper will
work great for my needs. And thanks for the regex help everyone.
Also, i was thinking of using a list, but i haven't used them much in
python. Is there anything in python that is equivalent to pushback in
c++ for vectors?

list.append()

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

rdmurray · Dec 10, 2008

list.append()

Fun with lists:

>>> l = list()
>>> l.append('a')
>>> l ['a']
>>> l.extend(['b', 'c'])
>>> l ['a', 'b', 'c']
>>> l[1:] = [1, 2, 3]
>>> l ['a', 1, 2, 3]
>>> l.pop() 3
>>> l ['a', 1, 2]
>>> l[:0] = ['z']
>>> l

Click to expand...

Click to expand...

['z', 'a', 1, 2]

--RDM

C language. work with text	3	Dec 10, 2021
My regex kung-fu is not strong =(	0	Apr 4, 2020
parsing tab and newline delimited text	6	Aug 4, 2010
Batch modifying text - content and context based	5	Jan 19, 2023
Parsing Text file	8	Jul 2, 2013
Parsing multiple lines from text file using regex	0	Oct 27, 2013
Iframe link overlapping text	4	Jan 18, 2021
Measuring a string of text	1	Sep 15, 2022

Text parsing via regex

Robocop

r0g

Paul McGuire

I V

Shane Geiger

Vlastimil Brom

MRAB

MRAB

Robocop

Robert Kern

rdmurray

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads