Содержание
- 2. Text Normalization Every NLP task needs to do text normalization: Segmenting/tokenizing words in running text Normalizing
- 3. How many words? I do uh main- mainly business data processing Fragments, filled pauses Seuss’s cat
- 4. Рыбак рыбака видит издалека. Рыбак и рыбака — одна лемма, но разные словоформы. В чем отличие
- 5. How many words? they lay back on the San Francisco grass and looked at the stars
- 6. Он не мог не ответить на это письмо. Сколько моделей и токенов?
- 7. How many words? N = number of tokens V = vocabulary = set of types |V|
- 8. Simple Tokenization in UNIX (Inspired by Ken Church’s UNIX for Poets.) Given a text file, output
- 9. The first step: tokenizing tr -sc ’A-Za-z’ ’\n’ THE SONNETS by William Shakespeare From fairest creatures
- 10. The second step: sorting tr -sc ’A-Za-z’ ’\n’ A A A A A A A A
- 11. More counting Merging upper and lower case tr ‘A-Z’ ‘a-z’ Sorting the counts tr ‘A-Z’ ‘a-z’
- 12. Issues in Tokenization Finland’s capital → Finland Finlands Finland’s ? what’re, I’m, isn’t → What are,
- 13. Tokenization: language issues French L'ensemble → one token or two? L ? L’ ? Le ?
- 14. Какие проблемы, связанные с особенностями языков, могут возникнуть?
- 15. Tokenization: language issues Chinese and Japanese no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 莎拉波娃 现在 居住 在 美国
- 16. Какие особенности японского языка еще больше осложняют обработку текста?
- 17. Word Tokenization in Chinese Also called Word Segmentation Chinese words are composed of characters Characters are
- 18. Maximum Matching Word Segmentation Algorithm Given a wordlist of Chinese, and a string. Start a pointer
- 19. Max-match segmentation illustration Thecatinthehat Thetabledownthere Doesn’t generally work in English! But works astonishingly well in Chinese
- 21. Скачать презентацию