Notes

  • Turkish has evidentiality; Evidentiality - Wikipedia
    • Ä°ki parça et ve mangal kömĂŒrĂŒ çalarken yakalanmÄ±ĆŸ.
  • Verb tables don’t really make sense, see for example: Turkish verb ‘selamlamak’ conjugated
    • this could be improved by baking the compositionality into the UI
    • Etymology: Inherited from Ottoman Turkish ŰłÙ„Ű§Ù…Ù„Ű§Ù…Ù‚ (selamlamaážł, “to salute”), from ŰłÙ„Ű§Ù… (selam), from Arabic ŰłÙŽÙ„ÙŽŰ§Ù… (salām), verbal noun of ŰłÙŽÙ„ÙÙ…ÙŽ (salima, “to be safe, to be well”), morphologically selam + -la + -mak. See: Arabic â€˜ŰłÙŽÙ„ÙŽŰ§Ù…â€™.
    • very nice feature of Verbix

Interesting

Tokenizers

I tried out a couple different tokenizers because tokenization is a very interesting topic for Turkish given it its rich inflectional morphology (it is very agglutinative).

  1. mertcobanov/turkish-wordpiece-tokenizer
  2. ctoraman/hate-speech-berturk - the tokenizer from this Turkish BERT model for hate speech detection

Side-by-side comparison

>>> word = "Bilgi sınırsızdır, sonsuz gibi bir Ɵey."
>>> from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer
>>> word_tokenizer = WordTokenizer()
>>> word_tokenizer.
word_tokenizer.pre_compiled_regexes  word_tokenizer.tokenize(
>>> word_tokenizer.tokenize(word)
('Bilgi', 'sınırsızdır', ',', 'sonsuz', 'gibi', 'bir', 'Ɵey', '.')
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("ctoraman/hate-speech-berturk")
>>> tokenizer.tokenize(word)
['bilgi', 'sınırsız', '##dır', ',', 'sonsuz', 'gibi', 'bir', 'sey', '.']

I’m not sure this was a fair comparison: I used the WordTokenizer class from trtokenizer.tr_tokenizer and I’m not sure if this aims to break down words into subwords - maybe not as it’s called a word tokenizer!

Some more nice examples using ctoraman/hate-speech-berturk

>>> tokenizer.tokenize("Ben topa vurdum")
['ben', 'topa', 'vurdu', '##m']
>>> tokenizer.tokenize("soğudum")
['so', '##gu', '##dum']
>>> tokenizer.tokenize("gökyĂŒzĂŒ")
['go', '##ky', '##uzu']