splitting a long string into a list

R

ronrsr

I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-


longstring = 'Agricultural subsidies; Foreign aidAgriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic ChemicalsAntibiotics,
AnimalsAgricultural Subsidies, Global TradeAgricultural
SubsidiesBiodiversityCitizen ActivismCommunity
GardensCooperativesDietingAgriculture, CottonAgriculture, Global
TradePesticides, MonsantoAgriculture, SeedCoffee, HungerPollution,
Water, FeedlotsFood PricesAgriculture, WorkersAnimal Feed, Corn,
PesticidesAquacultureChemical
WarfareCompostDebtConsumerismFearPesticides, US, Childhood Development,
Birth DefectsCorporate Reform, Personhood (Dem. Book)Corporate Reform,
Personhood, Farming (Dem. Book)Crime Rates, Legislation,
EducationDebt, Credit CardsDemocracyPopulation, WorldIncomeDemocracy,
Corporate Personhood, Porter Township (Dem. Book)Disaster
ReliefDwellings, SlumsEconomics, MexicoEconomy, LocalEducation,
ProtestsEndangered Habitat, RainforestEndangered SpeciesEndangered
Species, Extinctionantibiotics, livestockAgricultural subsidies;
Foreign aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
Toxic Chemicals;Antibiotics, Animals;Agricultural Subsidies, Global
Trade;Agricultural Subsidies;Biodiversity;Citizen Activism;Community
Gardens;Cooperatives;Dieting;Agriculture, Cotton;Agriculture, Global
Trade;Pesticides, Monsanto;Agriculture, Seed;Coffee, Hunger;Pollution,
Water, Feedlots;Food Prices;Agriculture, Workers;Animal Feed, Corn,
Pesticides;Aquaculture;Chemical
Warfare;Compost;Debt;Consumerism;Fear;Pesticides, US, Childhood
Development, Birth Defects;Corporate Reform, Personhood (Dem.
Book);Corporate Reform, Personhood, Farming (Dem. Book);Crime Rates,
Legislation, Education;Debt, Credit Cards;Democracy;Population,
World;Income;Democracy, Corporate Personhood, Porter Township (Dem.
Book);Disaster Relief;Dwellings, Slums;Economics, Mexico;Economy,
Local;Education, Protests;Endangered Habitat, Rainforest;Endangered
Species;Endangered Species, Extinction;antibiotics,
livestock;Pesticides, Water;Environment, Environmentalist;Food, Hunger,
Agriculture, Aid, World, Development;Agriculture, Cotton
Trade;Agriculture, Cotton, Africa;Environment, Energy;Fair Trade (Dem.
Book);Farmland, Sprawl;Fast Food, Globalization, Mapping;depression,
mental illness, mood disorders;Economic Democracy, Corporate
Personhood;Brazil, citizen activism, hope, inspiration, labor
issues;citizen activism, advice, hope;Pharmaceuticals, Medicine,
Drugs;Community Investing;Environment, Consumer Waste Reduction,
Consumer Behavior and Taxes;Hunger, US, Poverty;FERTILITY,
Women;Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development, Birth Defects; Toxic Chemicals;Antibiotics,
Animals;Agricultural Subsidies, Global Trade;Agricultural
Subsidies;Biodiversity;Citizen Activism;Community
Gardens;Cooperatives;Dieting;Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
Toxic Chemicals;Antibiotics, Animals;Agricultural Subsidies, Global
Trade;Agricultural Subsidies;Biodiversity;Citizen Activism;Community
Gardens;Cooperatives;Dieting;Agriculture, Cotton;Agriculture, Global
Trade;Pesticides, Monsanto;Agriculture, Seed;Coffee, Hunger;Pollution,
Water, Feedlots;Food Prices;Agriculture, Workers;Animal Feed, Corn,
Pesticides;Aquaculture;Chemical
Warfare;Compost;Debt;Consumerism;Fear;Pesticides, US, Childhood
Development, Birth Defects;Corporate Reform, Personhood (Dem.
Book);Corporate Reform, Personhood, Farming (Dem. Book);Crime Rates,
Legislation, Education;Debt, Credit Cards;'
 
R

Robert Kern

ronrsr said:
I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

Did you follow the recommendations given to you the last time you asked this
question? What did you try? What results do you want to get?

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
R

ronrsr

still having a heckuva time with this.

here's where it stand - the split function doesn't seem to work the way
i expect it to.


longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'> 1

longkw.replace(',',';')

Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development


kw = longkw.split("; ,") #kw is now a list of len 1

kw,typekw= ['Agricultural subsidies; Foreign aid;Agriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic Chemicals;Antibiotics,
Animals;Agricultural Subsidies


what I would like is to break the string into a list of the delimited
words, but have had no luck doing that - I thought split wuld do that,
but it doens't.

bests,

-rsr-
 
T

Tim Roberts

ronrsr said:
I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-


longstring = 'Agricultural subsidies; Foreign aidAgriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic ChemicalsAntibiotics,
AnimalsAgricultural Subsidies, Global TradeAgricultural
SubsidiesBiodiversityCitizen ActivismCommunity...

What do you want out of this? It looks like there are several levels
crammed together here. At first blush, it looks like topics separated by
"; ", so this should get you started:

topics = longstring.split("; ")
 
J

John Machin

ronrsr said:
I have a single long string - I'd like to split it into a list of
unique keywords. Sadly, the database wasn't designed to do this, so I
must do this in Python - I'm having some trouble using the .split()
function, it doesn't seem to do what I want it to - any ideas?

thanks very much for your help.

r-sr-


longstring = 'Agricultural subsidies; Foreign aidAgriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
[snip most of VERY long string]
Book);Corporate Reform, Personhood, Farming (Dem. Book);Crime Rates,
Legislation, Education;Debt, Credit Cards;'


Hi ronster,

As far as I recall, without digging in the archives:

We would probably agree (if shown the schema) that the database wasn't
designed. However it seems to have changed. Last time you asked, it
was at least queryable and producing rows, each containing one column
(a string of structure unknown to us and not divulged by you). You were
given extensive advice: how to use split(), plus some questions to
answer about the data e.g. the significance (if any) of semicolon
versus comma. You were also asked about the SQL that was used. You were
asked to explain what you meant by "keywords". All of those questions
were asked so that we could understand your problem, and help you.
Since then, nothing.

Now you have what appears to be something like your previous results
stripped of newlines and smashed together (are the newlines of no
significance at all?), and you appear to be presenting it as a new
problem.

What's going on?

Regards,
John
 
P

Peter Otten

ronrsr said:
still having a heckuva time with this.

You don't seem to get it.
here's where it stand - the split function doesn't seem to work the way
i expect it to.


longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'> 1

longkw.replace(',',';')
'eat, drink; man, woman'

Aha, Python doesn't replace in place, it creates a new string instead.
Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development


kw = longkw.split("; ,") #kw is now a list of len 1
sample = "eat+-drink+man-woman"
sample.split("+-") ['eat', 'drink+man-woman']
sample.split("+")
['eat', '-drink', 'man-woman']

Aha, Python interprets the complete split() argument as the delimiter, not
each of its characters.

Do you think you can combine these two findings to make your code work? You
will have to replace() first and then split().

Peter
 
C

Cameron Walsh

ronrsr said:
still having a heckuva time with this.

here's where it stand - the split function doesn't seem to work the way
i expect it to.


longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'> 1

longkw.replace(',',';')

Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development

Here you have discovered that string.replace() returns a string and does
NOT modify the original string. Try this for clarification:
'DAWWIJFWA;;;;;;dwadw;djwkajdw'





kw = longkw.split("; ,") #kw is now a list of len 1

Yes, because it is trying to split longkw wherever it finds the whole
string "; '" and NOT wherever it finds ";" or " " or ",". This has been
stated before by NickV, Duncan Booth, Fredrik Lundh and Paul McGuire
amongst others. You will need to do either:

a.)

# First split on every semicolon
a = longkw.split(";")
b = []
# Then split those results on whitespace
#(the default action for string.split())
for item in a:
b.append(item.split())
# Then split on commas
kw = []
for item in b:
kw.append(item.split(","))

or b.)

# First replace commas with spaces
longkw = longkw.replace(",", " ")
# Then replace semicolons with spaces
longkw = longkw.replace(";", " ")
# Then split on white space, (default args)
kw = longkw.split()


Note that we did:
longkw = longkw.replace(",", " ")
and not just:
longkw.replace(",", " ")


You will find that method A may give empty strings as some elements of
kw. If so, use method b.


Finally, if you have further problems, please please do the following:

1.) Provide your input data clearly, exactly as you have it.
2.) Show exactly what you want the output to be, including any special
cases.
3.) If something doesn't work the way you expect it to, tell us how you
expect it to work so we know what you mean by "doesn't work how I expect
it to"
4.) Read all the replies carefully and if you don't understand the
reply, ask for clarification.
5.) Read the help functions carefully - what the input parameters have
to be and what the return value will be, and whether or not it changes
the parameters or original object. Strings are usually NOT mutable so
any functions that operate on strings tend to return the result as a new
string and leave the original string intact.

I really hope this helps,

Cameron.
 
F

Frederic Rentsch

ronrsr said:
still having a heckuva time with this.

here's where it stand - the split function doesn't seem to work the way
i expect it to.


longkw1,type(longkw): Agricultural subsidies; Foreign
aid;Agriculture; Sustainable Agriculture - Support; Organic
Agriculture; Pesticides, US, Childhood Development, Birth Defects;
<type 'list'> 1

longkw.replace(',',';')

Agricultural subsidies; Foreign aid;Agriculture; Sustainable
Agriculture - Support; Organic Agriculture; Pesticides, US, Childhood
Development


kw = longkw.split("; ,") #kw is now a list of len 1

kw,typekw= ['Agricultural subsidies; Foreign aid;Agriculture;
Sustainable Agriculture - Support; Organic Agriculture; Pesticides, US,
Childhood Development, Birth Defects; Toxic Chemicals;Antibiotics,
Animals;Agricultural Subsidies


what I would like is to break the string into a list of the delimited
words, but have had no luck doing that - I thought split wuld do that,
but it doens't.

bests,

-rsr-



Agricultural subsidies
Foreign aidAgriculture
Sustainable Agriculture - Support
Organic Agriculture

.... etc.

To get rid of the the leading space on some lines simply add
corresponding replacements. SE does any number of substitutions in one
pass. Defining them is a simple matter of writing them up in one single
string from which the translator object is made:

Agricultural subsidies
Foreign aidAgriculture
Sustainable Agriculture - Support
Organic Agriculture


Regards

Frederic
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top