C
Chris Newman
I am working on a script to process a large number of old electoral records.
There are about 100,000 records in all but here is a representative sample
BTW hd =household duties
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
Note that the first names are in the same sequence as the occupations. An
occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
though other records include up to six family members. In all cases there is
a pattern:
1 person . . . occupation is immediately followed by a line return
(naturally)
2 people . . . first occupation is followed by an '&', last occupation by
line return
3 or more people . . . the first and up to the second last occupation are
followed by commas and the remainder of the line follows the aforementioned
patterns
My initial thoughts
Use a global REGEX that would step though and match the next occupation but
it has not proved that easy. Need a way to move the 'matching point forward
to a ampersand, comma or line return depending on context. If anyone could
provide some insights into whether RE can provide this level of control or
point me to a more appropriate solution.
Here the relevant code snippet:
#preceding code to do with last name, addresses etc This part works well
@matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
foreach $FirstName (@matches ) {
(m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
$Occupation =$1; # stores the next matching occupation with each successive
loop
print ("\"$FirstName\",\"$Occupation\");
}
There are about 100,000 records in all but here is a representative sample
BTW hd =household duties
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
Note that the first names are in the same sequence as the occupations. An
occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
though other records include up to six family members. In all cases there is
a pattern:
1 person . . . occupation is immediately followed by a line return
(naturally)
2 people . . . first occupation is followed by an '&', last occupation by
line return
3 or more people . . . the first and up to the second last occupation are
followed by commas and the remainder of the line follows the aforementioned
patterns
My initial thoughts
Use a global REGEX that would step though and match the next occupation but
it has not proved that easy. Need a way to move the 'matching point forward
to a ampersand, comma or line return depending on context. If anyone could
provide some insights into whether RE can provide this level of control or
point me to a more appropriate solution.
Here the relevant code snippet:
#preceding code to do with last name, addresses etc This part works well
@matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
foreach $FirstName (@matches ) {
(m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
$Occupation =$1; # stores the next matching occupation with each successive
loop
print ("\"$FirstName\",\"$Occupation\");
}