- Turkish Lessons for Beginners - Learn Turkish Grammar [FREE COURSE]
- Turkishle:
- Turkish Grammar
- Turkish Grammar Videos
Notes
- Turkish has evidentiality; Evidentiality - Wikipedia
- Ä°ki parça et ve mangal kömĂŒrĂŒ çalarken yakalanmıĆ.
- Verb tables donât really make sense, see for example: Turkish verb âselamlamakâ conjugated
- this could be improved by baking the compositionality into the UI
- Etymology: Inherited from Ottoman Turkish ŰłÙŰ§Ù ÙŰ§Ù Ù (selamlamaážł, âto saluteâ), from ŰłÙŰ§Ù (selam), from Arabic ŰłÙÙÙŰ§Ù (salÄm), verbal noun of ŰłÙÙÙÙ Ù (salima, âto be safe, to be wellâ), morphologically selam + -la + -mak. See: Arabic âŰłÙÙÙŰ§Ù â.
- very nice feature of Verbix
Interesting
- Impact of Tokenization on Language Models An Analysis for Turkish - annotations therein
Tokenizers
I tried out a couple different tokenizers because tokenization is a very interesting topic for Turkish given it its rich inflectional morphology (it is very agglutinative).
- mertcobanov/turkish-wordpiece-tokenizer
- ctoraman/hate-speech-berturk - the tokenizer from this Turkish BERT model for hate speech detection
- this is from Cagri Toraman who coauthored Impact of Tokenization on Language Models An Analysis for Turkish
Side-by-side comparison
>>> word = "Bilgi sınırsızdır, sonsuz gibi bir Ćey."
>>> from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer
>>> word_tokenizer = WordTokenizer()
>>> word_tokenizer.
word_tokenizer.pre_compiled_regexes word_tokenizer.tokenize(
>>> word_tokenizer.tokenize(word)
('Bilgi', 'sınırsızdır', ',', 'sonsuz', 'gibi', 'bir', 'Ćey', '.')
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("ctoraman/hate-speech-berturk")
>>> tokenizer.tokenize(word)
['bilgi', 'sınırsız', '##dır', ',', 'sonsuz', 'gibi', 'bir', 'sey', '.']
Iâm not sure this was a fair comparison: I used the WordTokenizer
class from trtokenizer.tr_tokenizer
and Iâm not sure if this aims to break down words into subwords - maybe not as itâs called a word tokenizer!
Some more nice examples using ctoraman/hate-speech-berturk
>>> tokenizer.tokenize("Ben topa vurdum")
['ben', 'topa', 'vurdu', '##m']
>>> tokenizer.tokenize("soÄudum")
['so', '##gu', '##dum']
>>> tokenizer.tokenize("gökyĂŒzĂŒ")
['go', '##ky', '##uzu']