Title: Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Authors: Mehdi Ali, Michael Fromm, Klaudia Thellmann, Jan Ebert, Alexander Arno Weber, Richard Rutmann, Charvi Jain, Max LĂŒbbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff, Lena Jurkschat, Hammam Abdelwahab, Benny Jörg Stein, Karl-Heinz Sylla, Pavel Denisov, Nicoloâ Brandizzi, Qasid Saleem, Anirban Bhowmick, Lennard Helmer, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Alex Jude, Lalith Manjunath, Samuel Weinbach, Carolin Penke, Oleg Filatov, Shima Asaadi, Fabio Barth, Rafet Sifa, Fabian KĂŒch, Andreas Herten, RenĂ© JĂ€kel, Georg Rehm, Stefan Kesselheim, Joachim Köhler, Nicolas Flores-Herr
Published: 30th September 2024 (Monday) @ 16:05:38
Link: http://arxiv.org/abs/2410.03730v2
Abstract
We present two multilingual LLMs designed to embrace Europeâs linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the modelsâ development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.
Basically contemporary with EuroLLM Multilingual Language Models for Europe:
Unlike the previously mentioned efforts, we specifically address 24 official European languages and focus on ensuring that a large fraction of the training data is composed of non-English data, representing a major step towards European LLMs. Concurrent to our work, EuroLLM (Martins et al., 2024), a 1.7B decoder-only LLM that follows the same spirit as our undertaking by addressing all 24 European languages and Salamandra1 covering 32 languages, have been presented.
Submitted: 30 Sep 2024 (v1) whereas
In multilingual natural language processing (NLP), it is crucial to train balanced multilingual tokenizers (Petrov et al., 2023; Ali et al., 2024) to avoid increased training and inference costs and latency during inference for non-English queries. Furthermore, it prevents the model from learning long-range dependencies in limited context windows (Vaswani et al., 2017). Therefore, we developed a custom multilingual tokenizer, closely following Ali et al. (2024), that is optimized for all 24 official European languages. It aims to reduce excessive text fragmentation, a phenomenon termed high âfertilityâ, and refers to the average number of tokens generated per word.
Fertility (F) is defined as the ratio of the total number of tokens (T) to the total number of words (W) in a text, as shown in the following:
Our model is a 7B transformer-based decoder-only model. Table 7 provides an overview of our model architecture. We want to highlight that our architectural choices are derived from internal ablation studies and findings from related work. Our models have a sequence length of 4096 tokens and employ Rotary (Su et al., 2024) positional embeddings that are employed to train state-of-the-art models (Dubey et al., 2024). To accelerate inference and reduce memory requirements, we employed grouped-query attention (Ainslie et al., 2023). An entire overview of our architectural choices is presented in Table 7.
These and other design decisions were guided by medium-scale (Chinchilla-optimal (Hoffmann et al., 2022) training of a 2.6B parameter model) ablation runs for various training-related hyperparameters.