- The pitfalls of next-token prediction đ looks great; found after seeing Vaishnavhâs reply in this thread
- How Context Affects Language Modelsâ Factual Predictions
- Language Models as Knowledge Bases
- Explainability for Large Language Models A Survey
- Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback
- Information-Theoretic Probing for Linguistic Structure
- UL2 Unifying Language Learning Paradigms
- Are Sixteen Heads Really Better than One
- BERT Rediscovers the Classical NLP Pipeline
- What Does BERT Look At An Analysis of BERTâs Attention
Reinforcement Learning from Human Feedback for LLMs moved to đ Reinforcement Learning
Tokenization
- SuperBPE Space Travel for Language Models
- Byte Latent Transformer Patches Scale Better Than Tokens
- CANINE Pre-training an Efficient Tokenization-Free Encoder for Language Representation
- A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
- Arithmetic coding for data compression
- Byte Pair Encoding is Suboptimal for Language Model Pretraining
- Subword Regularization Improving Neural Network Translation Models with Multiple Subword Candidates
- SentencePiece A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
- Neural Machine Translation of Rare Words with Subword Units
- Byte Pair Encoding is Suboptimal for Language Model Pretraining
- Zero-Shot Tokenizer Transfer
- Fishing for Magikarp Automatically Detecting Under-trained Tokens in Large Language Models
- SolidGoldMagikarp (plus, prompt generation) â LessWrong
- Byte-Pair Encoding tokenization - Hugging Face NLP Course didactic resource đ€
- didactic resources from LLMs-from-scratch by Sebastian Raschka
- Byte Pair Encoding (BPE) Tokenizer From Scratch: ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb
- Comparing Various Byte Pair Encoding (BPE) Implementations: ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb
- Very slow for inputs like âaâ * 100000 #195 Issue #195 on the openai/tiktoken repo
- discusses interesting algorithmic considerations of tokenization, for example as raised in this comment about a different rank merge algorithm
- see also JTokkit - A Java tokenizer library designed for use with OpenAI models
Regex can be relevant, for example in the implementation of pretokenizers e.g. by tiktoken
Notable (L)LMs
- Manus (interesting are the use cases)
- Liquid Foundation Models Our First Series of Generative AI Models
- Large Language Diffusion Models
- Introducing Mercury, the first commercial-scale diffusion large language model - Inception Labs - Mercury (diffusion-based LLM)
- OmniParser for Pure Vision Based GUI Agent and OmniParser v2
- Llamas đŠ
- Introducing deep research
- DeepSeek:
- Pangea A Fully Open Multilingual Multimodal LLM for 39 Languages
- HyperCLOVA X Technical Report
- Smarter, Better, Faster, Longer A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference - ModernBERT from Jeremy Howard and co
- EuroBERT Scaling Multilingual Encoders for European Languages
- They Said It Couldnât Be Done - Pleias 1.0 models
- EuroLLM Multilingual Language Models for Europe
- Teuken-7B-Base & Teuken-7B-Instruct Towards European LLMs
- SaulLM-7B A pioneering Large Language Model for Law
- Tower An Open Multilingual Large Language Model for Translation-Related Tasks - Tower
- A Paradigm Shift in Machine Translation Boosting Translation Performance of Large Language Models - ALMA
- Contrastive Preference Optimization Pushing the Boundaries of LLM Performance in Machine Translation - ALMA-R
- Gemma Open Models Based on Gemini Research and Technology
- PaLM 2 Technical Report
- PaLM Scaling Language Modeling with Pathways
- Aya Model An Instruction Finetuned Open-Access Multilingual Language Model - skim - 118 pages
- Command R & Command R+ from Cohere - no paper per se but there are the following blogs:
- Flamingo a Visual Language Model for Few-Shot Learning - Flamingo
- Efficient Training of Language Models to Fill in the Middle - Fill-in-the-Middle (FIM)
- Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism - Megatron
- Apple Intelligence Foundation Language Models - Apple Foundation Models (AFM)
- Phi models:
- Phi-4-Mini Technical Report Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
- microsoftPhi-4-multimodal-instruct · Hugging Face - technical report is in the HF repo as a PDFâŠ
and seemingly nowhere else đ€Š- OK they put up an arXiv paper
- microsoftPhi-4-multimodal-instruct · Hugging Face - technical report is in the HF repo as a PDFâŠ
- Phi-4 Technical Report
- Phi-3 Technical Report A Highly Capable Language Model Locally on Your Phone
- Phi-2 The surprising power of small language models (post; no technical paper)
- Textbooks Are All You Need - Phi-1
- Phi-4-Mini Technical Report Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
- Alpaca A Strong, Replicable Instruction-Following Model
- Vicuna An Open-Source Chatbot Impressing GPT-4 with 90_ ChatGPT Quality LMSYS Org
- Finetuned Language Models Are Zero-Shot Learners - FLAN
- Language Models are Few-Shot Learners - GPT-3
- Language Models are Unsupervised Multitask Learners - GPT-2
- Improving Language Understanding by Generative Pre-Training - GPT
- BART Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- RoBERTa A Robustly Optimized BERT Pretraining Approach
- XLNet Generalized Autoregressive Pretraining for Language Understanding
- SpanBERT Improving Pre-training by Representing and Predicting Spans
- BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
- Pythia A Suite for Analyzing Large Language Models Across Training and Scaling
- OPT Open Pre-trained Transformer Language Models
- A Neural Probabilistic Language Model
Instruction Tuning & Supervised Fine-tuning
- Finetuned Language Models Are Zero-Shot Learners - FLAN - I think seminal-ish paper on IT
- Instruction Tuning for Large Language Models A Survey
- Reflection-Tuning Data Recycling Improves LLM Instruction-Tuning
Retrieval Augmented Generation (RAG)
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - seminal
- Stanford CS25: V3 I Retrieval Augmented Language Models (lecture; December 5, 2023) - Douwe Kiela, introduces topic, surveys recent literature on retrieval augmented language models and finishes with some of the main open questions
- OpenScholar Synthesizing Scientific Literature with Retrieval-augmented LMs
- Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
- Into the Unknown Unknowns Engaged Human Learning through Participation in Language Model Agent Conversations
- See also Google Learn About and Deep Research
Reasoning & Adaptive Computation Time
- Reasoning best practices - OpenAI Platform clipped on 2025-02-14 given this tweet from @OpenAIDevs
- Scaling up Test-Time Compute with Latent Reasoning A Recurrent Depth Approach
- Adaptive Computation Time for Recurrent Neural Networks
- LLM Post-Training A Deep Dive into Reasoning Large Language Models
See also thereasoning tag
Alignment
- Statistical Rejection Sampling Improves Preference Optimization
- See papers under đ Reinforcement Learning
Chain of Thought
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
- Language Models are Multilingual Chain-of-Thought Reasoners
- Chain-of-Thought Prompting for Speech Translation
Chain-of-Thought Prompting induces Language Models to perform reasoning and leverages in-context learning. Chain-of-Thought (CoT) was introduced in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (AFAIK) in the context of arithmetic reasoning, i.e. wordy numeracy problems like:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
Agents
- BALROG Benchmarking Agentic LLM and VLM Reasoning On Games
- Evaluating Language Model Agency through Negotiations
Attacks on and Defences for (L)LMs
- Stealing Part of a Production Language Model
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- Stealing User Prompts from Mixture of Experts
- The Instruction Hierarchy Training LLMs to Prioritize Privileged Instructions
- Preserving Privacy in Large Language Models A Survey on Current Threats and Solutions - from Michele Miranda
- Defeating Prompt Injections by Design
Watermarking of (Large) Language Models
- A Watermark for Large Language Models
- Watermarks in the Sand Impossibility of Strong Watermarking for Generative Models
Scaling Laws
- Chinchillaâs Death
- Distillation Scaling Laws
- Go smol or go home Harm de Vries
- Scaling Laws for Neural Language Models - Kaplan scaling laws
- Training Compute-Optimal Large Language Models - Chinchilla scaling laws
Evaluation and Leaderboards
- NoLiMa: NoLiMa Long-Context Evaluation Beyond Literal Matching
- EnigmaEval: EnigmaEval A Benchmark of Long Multimodal Reasoning Challenges - EnigmaEval Leaderboard
- Humanityâs Last Exam: Humanityâs Last Exam
- GAIA: GAIA a benchmark for General AI Assistants
- Open-LLM-Leaderboard From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
- From Crowdsourced Data to High-Quality Benchmarks Arena-Hard and BenchBuilder Pipeline
- Measuring Massive Multitask Language Understanding
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
- Hugging Face mirror of the live leaderboard created and maintained at https://lmarena.ai/leaderboard. - âPlease link to the original URL for citation purposes: https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboardâ
- MT-Bench Browser: https://huggingface.co/spaces/lmsys/mt-bench
- Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots: https://lmarena.ai/?leaderboard
- MT-Bench-101 A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
- INCLUDE Evaluating Multilingual Language Understanding with Regional Knowledge
- ROCStories and the Story Cloze Test - commonsense reasoning framework for evaluating story understanding. Requires a system to choose the correct ending to a four-sentence story
- Chatbot Arena LLM Leaderboard
- Language Model Comparison disaggregates into quality, speed and cost
- A Survey on Evaluation of Large Language Models
- Connecting the Dots Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- See also WebDev Arena: web.lmarena.ai - AI Battle to build the best website!
- Open LLM Leaderboard Hugging Face - Comparing Large Language Models in an open and reproducible way
Local LLMs
- openwebui
- ollama
- lm studio đ I love the UI
- TinyChat Large Language Model on the Edge
See also
- Large Concept Models Language Modeling in a Sentence Representation Space
- How is LLaMa.cpp possible? - Thread by @karpathy
Related Notes in Obsidian
See: