python - Strange behaviour with nltk sentence tokenizer and special characters -
I get some strange behavior when using sent_tokenizer
for German text.
Example code:
sent_tokenizer = nltk.data.load ('tokenizers / punkt / german.pickle') sent to sent_tokenizer.tokenize ("Super quality. Toll. Teil. ") Print sent
This fails with error:
traceback (most recent call final): sent_tokenize (" Super Quality: To be sent to Tolis Teil. "): File" / Library / Framework / Python. Framework / Version-2.7 / Lib / python2.7 / Site-Package / NLTK / TalkNews /__init__.py ", line 82 , The tow sent in Coniser Tokenizer (text) file "/ library / framwork works / python framework / version-2.7 / lib / python2.7 / site-packages / nltk / tokenize / punkt.py", line 1270, to identify return list (self .sentences_from_text (text, realign_boundaries) file "/ library / framework / python framework / version / 2.7 / lib / python2.7 / site-package / nltk / tokenize / punkt.py", line 1318, S_from_text return in sentence [ For text s: e] e, self.span_tokenize (text, realign_boundaries)] file "/ library / framework / python frameworkwork / ss File "/ library / slab / python2.7 / site-packages / nltk / tonknews / punkt.py", line 1309, SPAN_token return for sl [sl.start, sl.stop] in the slice) file / libraries / Frames / python / framework / version -7.7 / lib / Python 2/7 / site-package / nltk / tokenize / punkt.py ", line 1348, _realign_boundaries for sl1, _pair_iter (slice) in sl2: file" / Library / Framework / Python.framework / Versions / 2.7 / lib / python2.7 / site-package / nltk / tokenize / punkt.py ", in line 354, _pair_iter = next (this) file" / library / framework / python. On frame If / self-text_contains_sentbreak (reference): file "/ library / frames / python / framework / version" in the file / version -2.7 / lib / python2.7 / site-packages / nltk / tokenize /punkt.py ", line 1324, _slices_from_text -2.7 / lib / python2.7 / site-packages / nltk / tokenize / punkt Py ", line 1369, for yourself in text_contains_sentbreak ._annotate_tokens (self._tokenize_words (text)): file" / library / framework / python Framework / version -7.7 / lib / python2.7 / site-packages / nltk / tokenize / punkt.py ", line1504, _annotate_second_pass for t1, _pair_iter (token) in T2: file" / Library / Framework / Python . Framework / version-2.7 / lib / Python 2.7 / site-package / nltk / talkiniz / pincat ", line 354, _pair_iter = next (this) file" / library / framework / Python. Token for aug_tok in Framework / Version-2.7 / Lib / python2.7 / Site-package / NLTK / TalkNiz / Pincap ", in Line 621, _annotate_first_pass: file" / Library / Framework / Python.framework/Versions/2.7/lib In the _tokenize_words plaintext.split ('\ n') for line in /python2.7/site-packages/nltk/tokenize/punkt.py ", line 586,: Unicodecode error: 'ascii' in codec state 6 byte 0xc3 Can not decode: not in serial no category (128)
while:
sent_tokenizer = nltk.data.load ('tokenizers / punk / German pixel ') Printed
works perfectly
I.
Caution: When making a Unicode string, make sure that you use the encoded version of the string (It may be necessary to decode it before, for example S decode ("UTF 8").
text = "super quality. Tolls teile." Sent_tokenizer = nltk.data.load ('tokenizer / punkt / german.pickle) sent_tokenizer.tokenize (text.decode (' utf8 ')): sent in print
as a magic Works.
Comments
Post a Comment