Byte Offsets of Tokens, Ngrams and Sentences?

M

Muhammad Adeel

Hi,

Does any one know how to tokenize a string in python that returns the
byte offsets and tokens? Moreover, the sentence splitter that returns
the sentences and byte offsets? Finally n-grams returned with byte
offsets.

Input:
This is a string.

Output:
This 0
is 5
a 8
string. 10


thanks
 
G

Gabriel Genellina

Does any one know how to tokenize a string in python that returns the
byte offsets and tokens? Moreover, the sentence splitter that returns
the sentences and byte offsets? Finally n-grams returned with byte
offsets.

Input:
This is a string.

Output:
This 0
is 5
a 8
string. 10

Like this?

py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
.... print g.group(), g.start()
....
This 0
is 5
a 8
string. 10
 
M

Muhammad Adeel

Like this?

py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
...   print g.group(), g.start()
...
This 0
is 5
a 8
string. 10

Hi,

Thanks. Can you please tell me how to do for n-grams and sentences as
well?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top