đŸȘŽ Anil's Garden

        • Assembly
        • AWK
        • Bash - Notes
        • Bash - Resources
        • Bash - Snippets
        • C
        • C++
        • Carbon
        • Dart
        • Erlang
        • Go
        • Haskell
        • Java
        • JavaScript
        • Lua
        • Perl
        • Python - Best Practices
        • Python - Internals
        • Python - Notes
        • Python - Resources
        • R
        • Rust
        • Scala
        • Swift, SwiftUI and Developing for macOS
        • TOML
        • TypeScript
        • WASM Web Assembly
        • YAML
        • Zig
        • zsh
      • Algorithms and Data Structures
      • Arch Linux
      • Asynchronous Programming & Concurrency
      • Build Systems
      • Compilers, Interpreters and Binaries
      • Compression, Encoding, Codecs, Text Encodings and Communication
      • Computer Architecture
      • Computer Science
      • Conda
      • Copilot (GitHub Copilot)
      • cron
      • Cryptography and Cybersecurity
      • CUDA
      • Databases and Data Interchange
      • Debugging
      • Development Containers
      • DevOps and MLOps
      • Distributed Computing, Distributed and Multi-GPU Training
      • Documentation (Maintaining Docs)
      • Email and SMTP
      • Fuzzing and Fuzzers
      • Git - Notes
      • Git - Resources
      • GitHub
      • Globbing
      • Graphs
      • Hugging Face
      • Machine Learning Engineering (Implementation Best Practices)
      • Make
      • MLX
      • Networking and Computer Networks
      • Operating Systems (OS), Kernels, Linux and Unix
      • PyTorch - Functions
      • PyTorch - Notes
      • PyTorch - Resources
      • Questions
      • Regex
      • Reverse Engineering
      • Software Development
      • Software Licences (Licenses) and Licensing
      • tmux
      • Vim
      • VSCode
          • abstract
          • Blighty - Etymology, Origin & Meaning
          • renminbi
          • anaphora
          • clitic
          • evidentiality
          • realis
          • selection
          • Italian
          • Japanese
          • Latin
          • Mandarin
          • Russian
          • Turkish
        • Languages of the World
        • Linguistics
        • Linguistics Glossary
        • Phonetics vs Phonemics (Phonology)
        • Writing Systems
        • A Line in the Sand - Britain, France and the Struggle that Shaped the Middle East
        • Advertising
        • Ancient History, Classics, Classical Literature and Theology
        • Blender
        • Bluesky
        • Books
        • Bracket City, Crosswords
        • Candide
        • Chess
        • Cinema (Film; Movies) and Television (TV)
        • Codebase Visualiser
        • Coding Projects for Development
        • Commercial LLMs (inc APIs)
        • Core Dumped (channel)
        • Creative Coding
        • Creative Coding Crafts Space (C3S)
        • CS Memes and Culture
        • D3 Health Dashboard
        • Darknet Diaries
        • Data Analysis and Visualisation (Data Viz)
        • Design
        • Diabetes
        • Digital Garden
        • DIY and Construction
        • DNS Server (Domain Name System Server)
        • Dreams from My Father
        • Economics and Finance
        • Edinburgh Guide
        • Education
        • Electoral Systems
        • F1
        • Figma
        • Finance and Trading
        • Fitness
        • Flags of the World
        • Flights
        • Fonts
        • Food
        • Free Speech
        • Goodreads
        • Healthcare, Biomedical, Medicine
        • History
        • Home Server
        • Homebrew
        • Housing and Rents
        • Immich
        • Investing
        • Istanbul Guide
        • Journocoders
        • Kagi
        • Kids
        • Law and Justice
        • London Guide
        • MacBook and macOS
        • MacBook Setup Checklist
        • Mental Anchors
        • Metamorphoses - A New Play
        • Model Context Protocol
        • Music
        • Music Theory
        • Music Understanding and Analysis, and Spotify Fun
        • NotebookLM and Automated Podcasting
        • Obsidian
        • Obsidian - Installing Plugins Manually
        • Obsidian Clone or Note-taking App
        • Online Safety Act (UK)
        • OSINT
        • Overview of Company Valuation Methods
        • Palettes
        • Pareto Efficiency
        • Pegasus How a Spy in Your Pocket Threatens the End of Privacy, Dignity, and Democracy - Laurent Richard, Sandrine Rigaud
        • Photography
        • Printing, Stamps and Heraldry
        • Privacy - Staying Secure Online
        • PyTorch's Transformer and Multi-Head Attention Implementation
        • Reading
        • Reading with a Motive vs Reading
        • Retro Tech
        • Semantic Querying of Obsidian
        • Small Web
        • Spaced Repetition Learning
        • Speech LLM-based Language Learning
        • Streaming, Twitch, YouTube, Videography
        • The Artist - Lucy Steeds
        • The Panama Papers - Breaking the Story of How the Rich and Powerful Hide Their Money
        • The Secret Barrister - Stories of the Law and How It's Broken
        • Time Tracking App - Single User, Native Swift
        • UK Law and Justice Podcast Recommendations (Perplexity)
        • UTM
        • Vibe Coding and Agents
        • Volts, Watts, Amps
        • Web Browsers
        • Web Development and Building a Website
        • Wordle-bot
        • YouTube Automated Uploader
        • Base64 Encoding
        • Bilinear Interpolation
        • ChatML
        • Connectionist Temporal Classification
        • Content Addressability
        • Cosine Similarity vs Pearson Moment Correlation Coefficiant
        • Decaying Learning Rate Exponentially when Scaling Batch Size and Base Learning Rate
        • Differential Privacy in Machine Learning and Stats Lectures
        • EinOps
        • Exiting Early from Nested Functions - Case Study with Epoch and Batch-wise Training Loops
        • Expectation Maximisation Algorithm
        • Fisher Information
        • Generating from LLMs
        • Gibberlink
        • Gram Matrix and Linear Regression
        • Graphs Spectral Clustering
        • Hidden Markov Models
        • How many iterations will a training run last?
        • Kalman Filtering
        • Learning Rate Warmup
        • Multiclass vs multilabel classification
        • RSA Encryption-Decryption Identity Proof via Euler's Theorem
        • Sampling for Text Generation, Nucleus Sampling (top-$p$), the need for top-$k$ and Beam Search
        • Singular Value Decomposition
        • Typing for PyTorch
        • Vector Projection
        • Vector Quantization
        • Weight Initialisation
        • What are the differences between a digital signature, a MAC and a hash?
        • Whitening, sharpening & smoothing
        • "My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit's AI Community
        • "Why Should I Trust You?": Explaining the Predictions of Any Classifier
        • $\infty$-former: Infinite Memory Transformer
        • $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
        • $\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens
        • $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
        • 2 OLMo 2 Furious
        • 100,000 Podcasts: A Spoken English Document Corpus
        • A Bayesian approach to translators' reliability assessment
        • A Bayesian Perspective on Generalization and Stochastic Gradient Descent
        • A Brief Overview of Unsupervised Neural Speech Representation Learning
        • A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
        • A Call for Clarity in Reporting BLEU Scores
        • A Causal Bayesian Networks Viewpoint on Fairness
        • A Closer Look at Few-shot Classification
        • A Closer Look at Spatiotemporal Convolutions for Action Recognition
        • A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
        • A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
        • A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
        • A Comprehensive Survey of Machine Translation Approaches
        • A Comprehensive Survey on Long Context Language Modeling
        • A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
        • A Convergence Theory for Deep Learning via Over-Parameterization
        • A Cookbook of Self-Supervised Learning
        • A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
        • A Cross-Language Perspective On Speech Information Rate
        • A Diagnostic Study of Explainability Techniques for Text Classification
        • A firm foundation for private data analysis
        • A Generalized EigenGame with Extensions to Multiview Representation Learning
        • A guide to convolution arithmetic for deep learning
        • A halo model approach for mock catalogs of time-variable strong gravitational lenses
        • A Kernel-Based View of Language Model Fine-Tuning
        • A Large-Scale Evaluation of Speech Foundation Models
        • A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
        • A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
        • A Mathematical Theory of Communication
        • A method to convert neural signals into sound sequences
        • A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops
        • A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
        • A Neural Algorithm of Artistic Style
        • A Neural Probabilistic Language Model
        • A new algorithm for data compression
        • A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
        • A practical tutorial on Variational Bayes
        • A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech
        • A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
        • A Primer on Bayesian Neural Networks: Review and Debates
        • A Primer on Causal Analysis
        • A Probabilistic Neuro-symbolic Layer for Algebraic Constraint Satisfaction
        • A Review of Deep Learning Techniques for Speech Processing
        • A Review of Sparse Expert Models in Deep Learning
        • A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
        • A Simple Framework for Contrastive Learning of Visual Representations
        • A Suite for Acoustic Language Model Evaluation
        • A Survey of Large Language Models
        • A Survey of Mamba
        • A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
        • A Survey of Visual Transformers
        • A Survey on Evaluation of Large Language Models
        • A Survey on In-context Learning
        • A Survey on Language Models for Code
        • A Survey on Large Language Models for Code Generation
        • A Survey on LLM-as-a-Judge
        • A Survey on Multimodal Large Language Models
        • A Survey on Neural Speech Synthesis
        • A Survey on Retrieval-Augmented Text Generation for Large Language Models
        • A Survey on Speech Large Language Models
        • A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets
        • A Tutorial on Fisher Information
        • A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
        • A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
        • A unified architecture for natural language processing: deep neural networks with multitask learning
        • A unified view of entropy-regularized Markov decision processes
        • A Universal Law of Robustness via Isoperimetry
        • A Vulnerability in Implementations of SHA-3, SHAKE, EdDSA, and Other NIST-Approved Algorithms
        • A Watermark for Large Language Models
        • Accelerating Large Language Model Decoding with Speculative Sampling
        • Accelerating t-SNE using Tree-Based Algorithms
        • Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
        • Acoustic BPE for Speech Generation with Discrete Tokens
        • Active Data Curation Effectively Distills Large-Scale Multimodal Models
        • Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
        • Adam-mini: Use Fewer Learning Rates To Gain More
        • Adam: A Method for Stochastic Optimization
        • Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
        • Adapting Language Models to Compress Contexts
        • Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference
        • Adaptive Computation Time for Recurrent Neural Networks
        • Adaptive deconvolutional networks for mid and high level feature learning
        • Adaptive Machine Translation with Large Language Models
        • Adaptive Prototype Learning and Allocation for Few-Shot Segmentation
        • Adaptive Retrieval-Augmented Generation for Conversational Systems
        • Adaptive Semiparametric Language Models
        • Adaptively Sparse Transformers
        • AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
        • AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
        • AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
        • AdaSpeech: Adaptive Text to Speech for Custom Voice
        • AdaSplash: Adaptive Sparse Flash Attention
        • AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
        • Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
        • Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
        • Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize
        • Adversarial Attacks and Defences: A Survey
        • Adversarial Feature Learning
        • Adversarial NLI: A New Benchmark for Natural Language Understanding
        • AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages
        • Agent Skill Acquisition for Large Language Models via CycleQD
        • AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
        • AI and Memory Wall
        • AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
        • AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline
        • AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
        • ALBA : Reinforcement Learning for Video Object Segmentation
        • ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
        • Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists
        • Alice's Adventures in a Differentiable Wonderland -- Volume I, A Tour of the Land
        • Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
        • Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
        • AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
        • Aligning Speech to Languages to Enhance Code-switching Speech Recognition
        • Aligning to Adults Is Easy, Aligning to Children Is Hard: A Study of Linguistic Alignment in Dialogue Systems
        • Alpaca: A Strong, Replicable Instruction-Following Model
        • An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition
        • An Analysis of Energy Consumption and Carbon Footprints of Cryptocurrencies and Possible Solutions
        • An Attention Free Transformer
        • An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
        • An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training
        • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
        • An Empirical Exploration of Curriculum Learning for Neural Machine Translation
        • An Empirical Study of Mamba-based Language Models
        • An Empirical Study of Translation Hypothesis Ensembling with Large Language Models
        • An Emulator for Fine-Tuning Large Language Models using Small Language Models
        • An End-to-End Transformer Model for 3D Object Detection
        • An engine not a camera: Measuring performative power of online search
        • An Evolved Universal Transformer Memory
        • An Explanation of In-context Learning as Implicit Bayesian Inference
        • An Exploration of Neural Sequence-to-Sequence Architectures for Automatic Post-Editing
        • An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
        • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
        • An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech
        • An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition
        • An introduction to graph theory
        • An Introduction to Variational Autoencoders
        • An Introduction to Vision-Language Modeling
        • Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
        • Analyzing Context Contributions in LLM-based Machine Translation
        • Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing
        • AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
        • Apollo: An Exploration of Video Understanding in Large Multimodal Models
        • Apple Intelligence Foundation Language Models
        • Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods
        • Architectures of Topological Deep Learning: A Survey on Topological Neural Networks
        • Are aligned neural networks adversarially aligned?
        • Are All Good Word Vector Spaces Isomorphic?
        • Are discrete units necessary for Spoken Language Modeling?
        • Are Sixteen Heads Really Better than One?
        • Are We Done with MMLU?
        • Areas of Attention for Image Captioning
        • Arithmetic coding for data compression
        • Artificial Kuramoto Oscillatory Neurons
        • ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training
        • Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
        • Associative Embedding: End-to-End Learning for Joint Detection and Grouping
        • AST: Audio Spectrogram Transformer
        • Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability
        • Attention as a Guide for Simultaneous Speech Translation
        • Attention Is All You Need
        • Attention-Based Models for Speech Recognition
        • Audio Editing with Non-Rigid Text Prompts
        • Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
        • Audio-Language Models for Audio-Centric Tasks: A survey
        • Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
        • AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
        • AudioGen: Textually Guided Audio Generation
        • AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
        • AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
        • AudioLM: a Language Modeling Approach to Audio Generation
        • AudioPaLM: A Large Language Model That Can Speak and Listen
        • AudioX: Diffusion Transformer for Anything-to-Audio Generation
        • Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling
        • Augmented Language Models: a Survey
        • Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
        • Auto-Encoding Variational Bayes
        • Autoregressive Image Generation using Residual Quantization
        • AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
        • Avocodo: Generative Adversarial Network for Artifact-free Vocoder
        • Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
        • Aya Vision: Advancing the Frontier of Multilingual Multimodality
        • BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
        • Bag of Tricks for Efficient Text Classification
        • Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
        • Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis
        • BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
        • BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
        • BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
        • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
        • Bayesian Learning for Neural Networks: an algorithmic survey
        • Bayesian Measures of Model Complexity and Fit
        • Benchmarking Attacks on Learning with Errors
        • BERT Learns to Teach: Knowledge Distillation with Meta Learning
        • BERT Rediscovers the Classical NLP Pipeline
        • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
        • BERTScore: Evaluating Text Generation with BERT
        • BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
        • Better & Faster Large Language Models via Multi-token Prediction
        • Better Instruction-Following Through Minimum Bayes Risk
        • Better speech synthesis through scaling
        • Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
        • Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
        • Beyond Left and Right: The Role of System Trust in COVID-19 Attitudes and Behaviors
        • Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
        • Beyond Text Compression: Evaluating Tokenizers Across Scales
        • Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
        • Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
        • Big Bird: Transformers for Longer Sequences
        • Big Self-Supervised Models are Strong Semi-Supervised Learners
        • Big Transfer (BiT): General Visual Representation Learning
        • BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
        • Billion-scale semi-supervised learning for image classification
        • BLAB: Brutally Long Audio Bench
        • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
        • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
        • Blockwise Parallel Decoding for Deep Autoregressive Models
        • Boltzmann Exploration Done Right
        • Boosting Distributed Training Performance of the Unpadded BERT Model
        • Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning
        • Bootstrap your own latent: A new approach to self-supervised Learning
        • Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
        • Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
        • BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
        • Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
        • Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition
        • Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo)
        • Building Bridges between Regression, Clustering, and Classification
        • Building Machine Translation Systems for the Next Thousand Languages
        • Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings
        • BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems
        • ByT5 model for massively multilingual grapheme-to-phoneme conversion
        • Byte Latent Transformer: Patches Scale Better Than Tokens
        • Byte Pair Encoding is Suboptimal for Language Model Pretraining
        • Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
        • Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits
        • Can Automatic Metrics Assess High-Quality Translations?
        • Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
        • Can language models learn from explanations in context?
        • Can Large Language Models Reason and Plan?
        • Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
        • Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks
        • Can Whisper Perform Speech-Based In-Context Learning?
        • Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
        • CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
        • Canonical Capsules: Self-Supervised Capsules in Canonical Pose
        • Careless Whisper: Speech-to-Text Hallucination Harms
        • Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?
        • CASPER: A Large Scale Spontaneous Speech Dataset
        • CAT: Content-Adaptive Image Tokenization
        • Categorical Reparameterization with Gumbel-Softmax
        • Causal inference with Bayes rule
        • Causal Reasoning for Algorithmic Fairness
        • CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
        • CDXFormer: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory
        • Cem Mil Podcasts: A Spoken Portuguese Document Corpus For Multi-modal, Multi-lingual and Multi-Dialect Information Access Research
        • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
        • Chain-of-Thought Prompting for Speech Translation
        • Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
        • Character-Aware Neural Language Models
        • Character-level Convolutional Networks for Text Classification
        • Character-Level Language Modeling with Deeper Self-Attention
        • Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
        • ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
        • ChatMusician: Understanding and Generating Music Intrinsically with LLM
        • ChipNeMo: Domain-Adapted LLMs for Chip Design
        • CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition
        • CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
        • Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
        • Clotho: An Audio Captioning Dataset
        • CMU's IWSLT 2024 Simultaneous Speech Translation System
        • Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
        • CoCa: Contrastive Captioners are Image-Text Foundation Models
        • Cockpit: A Practical Debugging Tool for the Training of Deep Neural Networks
        • Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
        • CodeRAG-Bench: Can Retrieval Augment Code Generation?
        • CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
        • Coding Theorems for a Discrete Source With a Fidelity Criterion
        • Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner
        • COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
        • CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
        • COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
        • COMET: A Neural Framework for MT Evaluation
        • CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
        • Common Voice: A Massively-Multilingual Speech Corpus
        • CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
        • Compact Speech Translation Models via Discrete Speech Units Pretraining
        • Comparative layer-wise analysis of self-supervised speech models
        • Comparing Discrete and Continuous Space LLMs for Speech Recognition
        • Competence-based Curriculum Learning for Neural Machine Translation
        • Compositional Entailment Learning for Hyperbolic Vision-Language Models
        • Computational Optimal Transport
        • Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
        • Condita: A state machine like architecture for multimodal task bots
        • Conditional Image Generation with PixelCNN Decoders
        • Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
        • Confidence-Aware Scheduled Sampling for Neural Machine Translation
        • Confident Adaptive Language Modeling
        • Conformal Prediction for Natural Language Processing: A Survey
        • Conformer: Convolution-augmented Transformer for Speech Recognition
        • Connecting Speech Encoder and Large Language Model for ASR
        • Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
        • Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
        • ConSeC: Word Sense Disambiguation as Continuous Sense Comprehension
        • Consent in Crisis: The Rapid Decline of the AI Data Commons
        • Context Encoders: Feature Learning by Inpainting
        • Context Encoding for Semantic Segmentation
        • Context-aware Neural Machine Translation for English-Japanese Business Scene Dialogues
        • ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
        • Continuous Audio Language Models
        • Continuous Learning from Human Post-Edits for Neural Machine Translation
        • Continuous Speech Tokenizer in Text To Speech
        • Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
        • Contrastive language and vision learning of general fashion concepts
        • Contrastive Language-Image Pre-training for the Italian Language
        • Contrastive Learning with Hard Negative Samples
        • Contrastive Multiview Coding
        • Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words
        • Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
        • Contrastive Representation Learning: A Framework and Review
        • Controllable Speech Representation Learning Via Voice Conversion and AIC Loss
        • Controlling Neural Networks with Rule Representations
        • ConvMLP: Hierarchical Convolutional MLPs for Vision
        • CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
        • CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving
        • CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
        • Counterfactual Fairness
        • Counterfactual harm
        • Counterfactual Reasoning and Learning Systems
        • CoVoST 2 and Massively Multilingual Speech-to-Text Translation
        • CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
        • CroissantLLM: A Truly Bilingual French-English Language Model
        • CroMo: Cross-Modal Learning for Monocular Depth Estimation
        • Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
        • Cross-lingual Language Model Pretraining
        • Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training
        • Cross-task weakly supervised learning from instructional videos
        • Cryptanalytic Extraction of Neural Network Models
        • CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages
        • CTC-based Compression for Direct Speech Translation
        • CTCBERT: Advancing Hidden-unit BERT with CTC Objectives
        • Current Limitations of Language Models: What You Need is Retrieval
        • CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
        • Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
        • DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech
        • DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
        • DASB - Discrete Audio and Speech Benchmark
        • DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
        • DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
        • Data Augmentation Approaches in Natural Language Processing: A Survey
        • Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
        • Data Efficient Reflow for Few Step Audio Generation
        • Data Selection for Language Models via Importance Resampling
        • data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
        • Dataset Distillation: A Comprehensive Review
        • DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
        • DeBERTa: Decoding-enhanced BERT with Disentangled Attention
        • DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
        • Decoding speech perception from non-invasive brain recordings
        • Decoupled Weight Decay Regularization
        • Deep Biaffine Attention for Neural Dependency Parsing
        • Deep Clustering for Unsupervised Learning of Visual Features
        • Deep contextualized word representations
        • Deep Ensemble as a Gaussian Process Approximate Posterior
        • Deep Ensembles: A Loss Landscape Perspective
        • Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond
        • Deep Learning with Differential Privacy
        • Deep Mask Memory Network with Semantic Dependency and Context Moment for Aspect Level Sentiment Classification
        • Deep Neural Networks and Tabular Data: A Survey
        • Deep reinforcement learning from human preferences
        • Deep Residual Learning for Image Recognition
        • Deep Voice 2: Multi-Speaker Neural Text-to-Speech
        • Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
        • Deep Voice: Real-time Neural Text-to-Speech
        • DeepGaze II: Reading fixations from deep features trained on object recognition
        • DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
        • DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation
        • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
        • DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
        • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
        • DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
        • DeepSeek-V3 Technical Report
        • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
        • DeepSpace: Dynamic Spatial and Source Cue Based Source Separation for Dialog Enhancement
        • Defeating Prompt Injections by Design
        • Deformable DETR: Deformable Transformers for End-to-End Object Detection
        • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
        • DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders
        • DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021
        • DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
        • DEMix Layers: Disentangling Domains for Modular Language Modeling
        • Dense Associative Memory for Pattern Recognition
        • Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
        • DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
        • DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
        • Depthwise Convolution is All You Need for Learning Multiple Visual Domains
        • Describing Multimedia Content using Attention-based Encoder--Decoder Networks
        • Designing and Interpreting Probes with Control Tasks
        • DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
        • DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
        • Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement
        • DETRs with Collaborative Hybrid Assignments Training
        • DeVAn: Dense Video Annotation for Video-Language Models
        • Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
        • Did Translation Models Get More Robust Without Anyone Even Noticing?
        • Difference-Masking: Choosing What to Mask in Continued Pretraining
        • Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche
        • Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
        • Direct speech-to-speech translation with a sequence-to-sequence model
        • Direct speech-to-speech translation with discrete units
        • Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
        • Discovery of Unstable Singularities
        • Discrete Audio Tokens: More Than a Survey!
        • Discrete Latent Structure in Neural Networks
        • DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
        • Disentangling Textual and Acoustic Features of Neural Speech Representations
        • Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
        • DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
        • Distillation Scaling Laws
        • Distilling the Knowledge in a Neural Network
        • Distributed Representations of Words and Phrases and their Compositionality
        • Distribution Fields for Tracking
        • Distributional term representations: an experimental comparison
        • Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
        • DM-Codec: Distilling Multimodal Representations for Speech Tokenization
        • DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization
        • dMel: Speech Tokenization made Simple
        • DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors
        • DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors
        • Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
        • Do Context-Aware Translation Models Pay the Right Attention?
        • Do Multi-Sense Embeddings Improve Natural Language Understanding?
        • DOCE: Finding the Sweet Spot for Execution-Based Code Generation
        • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
        • Does Simultaneous Speech Translation need Simultaneous Models?
        • Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
        • Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation
        • Don't Decay the Learning Rate, Increase the Batch Size
        • Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation
        • Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
        • Don't Read Too Much into It: Adaptive Computation for Open-Domain Question Answering
        • DoWhy: An End-to-End Library for Causal Inference
        • DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
        • DRAW: A Recurrent Neural Network For Image Generation
        • Dropout: A Simple Way to Prevent Neural Networks from Overfitting
        • DTrOCR: Decoder-only Transformer for Optical Character Recognition
        • Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
        • Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
        • E-Branchformer: Branchformer with Enhanced merging for speech recognition
        • Ecco: An Open Source Library for the Explainability of Transformer Language Models
        • Effective Approaches to Attention-based Neural Machine Translation
        • Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform
        • Efficient Compression of Multitask Multilingual Speech Models
        • Efficient Estimation of Word Representations in Vector Space
        • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
        • Efficient Memory Management for Large Language Model Serving with PagedAttention
        • Efficient Methods for Natural Language Processing: A Survey
        • Efficient Neural Audio Synthesis
        • Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space
        • Efficient Parallel Audio Generation using Group Masked Language Modeling
        • Efficient Pre-training for Localized Instruction Generation of Videos
        • Efficient Representation Learning via Adaptive Context Pooling
        • Efficient softmax approximation for GPUs
        • Efficient Stagewise Pretraining via Progressive Subnetworks
        • Efficient Tool Use with Chain-of-Abstraction Reasoning
        • Efficient Training of Language Models to Fill in the Middle
        • Efficient Transformers: A Survey
        • Efficient Visual Pretraining with Contrastive Detection
        • Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
        • Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
        • Efficiently Programming Large Language Models using SGLang
        • Efficiently Scaling Transformer Inference
        • ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
        • Elucidating the Design Space of Diffusion-Based Generative Models
        • Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric
        • Emergent and Predictable Memorization in Large Language Models
        • Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
        • Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
        • Emerging Properties in Self-Supervised Vision Transformers
        • Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
        • EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
        • EMMeTT: Efficient Multimodal Machine Translation Training
        • EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
        • EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
        • Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
        • EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
        • EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
        • Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features
        • Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
        • Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
        • Encoding of speech in convolutional layers and the brain stem based on language experience
        • Encoding sound in the cochlea: from receptor potential to afferent discharge
        • End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
        • End-to-End Dense Video Captioning with Parallel Decoding
        • End-to-End Learning of Visual Representations from Uncurated Instructional Videos
        • End-to-End Object Detection with Transformers
        • End-to-End Simultaneous Speech Translation with Differentiable Segmentation
        • End-to-End Speech Recognition: A Survey
        • End-to-End Speech-to-Text Translation: A Survey
        • End-to-end Temporal Action Detection with Transformer
        • End-to-End Text-Dependent Speaker Verification
        • Energy and Policy Considerations for Deep Learning in NLP
        • Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
        • Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization
        • EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
        • Enriching Word Vectors with Subword Information
        • eP-ALM: Efficient Perceptual Augmentation of Language Models
        • Epitran: Precision G2P for Many Languages
        • Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation
        • Error detecting and error correcting codes
        • ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech
        • ESPnet-SpeechLM: An Open Speech Language Model Toolkit
        • ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
        • ESPnet-ST: All-in-One Speech Translation Toolkit
        • ESPnet: End-to-End Speech Processing Toolkit
        • Estimating the Completeness of Discrete Speech Units
        • Estimating Training Data Influence by Tracing Gradient Descent
        • Estimating Worst-Case Frontier Risks of Open-Weight LLMs
        • Estimation of Non-Normalized Statistical Models by Score Matching
        • ETC: Encoding Long and Structured Inputs in Transformers
        • Euclidean Embedding of Co-occurrence Data
        • EuroBERT: Scaling Multilingual Encoders for European Languages
        • EuroLLM-9B: Technical Report
        • EuroLLM: Multilingual Language Models for Europe
        • Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization
        • Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates
        • Evaluating deep learning architectures for speech emotion recognition
        • Evaluating Frontier Models for Dangerous Capabilities
        • Evaluating Language Model Agency through Negotiations
        • Evaluating language models as risk scores
        • Evaluating Large Language Models Trained on Code
        • Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
        • Evaluating the Stability of Embedding-based Word Similarities
        • Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
        • Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
        • Evasion Attacks against Machine Learning at Test Time
        • EVE: Explainable Vector Based Embedding Technique Using Wikipedia
        • Evolution through Large Models
        • Explainability for Large Language Models: A Survey
        • Explainability for Speech Models: On the Challenges of Acoustic Feature Selection
        • Explainability Via Causal Self-Talk
        • Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
        • Exploiting Similarities among Languages for Machine Translation
        • Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning
        • Exploration on HuBERT with Multiple Resolutions
        • Exploring Simple Siamese Representation Learning
        • Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study
        • Exploring the Benefits of Tokenization of Discrete Acoustic Units
        • Exploring the Limits of Language Modeling
        • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
        • EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
        • Extracting Training Data from Diffusion Models
        • Extracting Training Data from Large Language Models
        • Extraction of Salient Sentences from Labelled Documents
        • Extreme Masking for Learning Instance and Distributed Visual Representations
        • F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
        • Facebook AI WMT21 News Translation Task Submission
        • fairseq S2T: Fast Speech-to-Text Modeling with fairseq
        • Faith and Fate: Limits of Transformers on Compositionality
        • Falcon2-11B Technical Report
        • Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
        • Fast and Simplex: 2-Simplicial Attention in Triton
        • Fast and Vectorizable Alternative to Binary Search in O(1) Applicable to a Wide Domain of Sorted Arrays of Floating Point Numbers
        • Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
        • Fast Inference from Transformers via Speculative Decoding
        • Fast Model Editing at Scale
        • Fast Transformer Decoding: One Write-Head is All You Need
        • FastPitch: Parallel Text-to-speech with Pitch Prediction
        • FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
        • FastSpeech: Fast, Robust and Controllable Text to Speech
        • Fauno: The Italian Large Language Model that will leave you senza parole!
        • Federated Learning: Strategies for Improving Communication Efficiency
        • Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning
        • Fermat Factorization in the Wild
        • FEVER: a large-scale dataset for Fact Extraction and VERification
        • Few-Shot Keyword Spotting in Any Language
        • Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond
        • Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
        • Fine-tuning Language Models for Factuality
        • Finetuned Language Models Are Zero-Shot Learners
        • Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models
        • Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
        • Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
        • Flamingo: a Visual Language Model for Few-Shot Learning
        • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
        • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
        • FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
        • FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
        • Flow Matching for Generative Modeling
        • Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
        • Flying and swimming animals cruise at a Strouhal number tuned for high power efficiency
        • FNet: Mixing Tokens with Fourier Transforms
        • Focal Loss for Dense Object Detection
        • Focal Modulation Networks
        • Focal Modulation Networks for Interpretable Sound Classification
        • FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
        • Following the Human Thread in Social Navigation
        • Formal Limitations on the Measurement of Mutual Information
        • Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis
        • Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization
        • Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
        • From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
        • From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
        • From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
        • From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation
        • From Recognition to Cognition: Visual Commonsense Reasoning
        • From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition
        • From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
        • From Sparse to Soft Mixtures of Experts
        • From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
        • Full Parameter Fine-tuning for Large Language Models with Limited Resources
        • Fully Character-Level Neural Machine Translation without Explicit Segmentation
        • Fully Convolutional Networks for Semantic Segmentation
        • FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
        • FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
        • Fundamentals of Grammatology
        • GAIA: a benchmark for General AI Assistants
        • GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
        • GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
        • Gaussian Mixture Latent Vector Grammars
        • GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models
        • Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
        • Gemini: A Family of Highly Capable Multimodal Models
        • Gemma 2: Improving Open Language Models at a Practical Size
        • Gemma: Open Models Based on Gemini Research and Technology
        • Gender Bias in Contextualized Word Embeddings
        • Gender Bias in Coreference Resolution
        • Generalization Ability of MOS Prediction Networks
        • Generalization in diffusion models arises from geometry-adaptive harmonic representations
        • Generalization through Memorization: Nearest Neighbor Language Models
        • Generalized Shape Metrics on Neural Representations
        • Generating Diverse High-Fidelity Images with VQ-VAE-2
        • Generating Long Sequences with Sparse Transformers
        • Generative Adversarial Networks
        • Generative Models: What do they know? Do they know things? Let's find out!
        • Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
        • Generative Spoken Dialogue Language Modeling
        • Generative Spoken Language Modeling from Raw Audio
        • Generator Matching: Generative modeling with arbitrary Markov processes
        • Genie: Generative Interactive Environments
        • Geographic Adaptation of Pretrained Language Models
        • Geographic and Geopolitical Biases of Language Models
        • Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
        • GFlowNet Foundations
        • GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
        • Git Re-Basin: Merging Models modulo Permutation Symmetries
        • Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models
        • GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
        • Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
        • Globally Normalized Transition-Based Neural Networks
        • GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge
        • Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
        • Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
        • Glow: Generative Flow with Invertible 1x1 Convolutions
        • GLU Variants Improve Transformer
        • GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
        • Goku: Flow Based Video Generative Foundation Models
        • Good Night at 4 pm?! Time Expressions in Different Cultures
        • Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
        • Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
        • Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
        • Gorilla: Large Language Model Connected with Massive APIs
        • GPT-4 Technical Report
        • gpt-oss-120b & gpt-oss-20b Model Card
        • GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
        • Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
        • Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
        • Gradient Descent Converges to Minimizers
        • Granary: Speech Recognition and Translation Dataset in 25 European Languages
        • Grandmaster-Level Chess Without Search
        • Graph Pre-training for AMR Parsing and Generation
        • Grapheme-to-Phoneme Models for (Almost) Any Language
        • Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
        • Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
        • Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
        • Group Normalization
        • Group Robust Preference Optimization in Reward-free RLHF
        • GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
        • Guess What I Think: Streamlined EEG-to-Image Generation with Latent Diffusion Models
        • Guiding a Diffusion Model with a Bad Version of Itself
        • HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
        • HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
        • Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users
        • HellaSwag: Can a Machine Really Finish Your Sentence?
        • HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
        • HGRN2: Gated Linear RNNs with State Expansion
        • Hi-Fi Multi-Speaker English TTS Dataset
        • Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models
        • Hierarchical nucleation in deep neural networks
        • HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
        • HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec
        • HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features
        • HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
        • HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
        • High Fidelity Neural Audio Compression
        • High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
        • High-Fidelity Audio Compression with Improved RVQGAN
        • High-Fidelity Simultaneous Speech-To-Speech Translation
        • High-speed high-security signatures
        • HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
        • Highly accurate protein structure prediction with AlphaFold
        • Highway Networks
        • Holistic Evaluation of Language Models
        • Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval
        • Houdini: Fooling Deep Structured Prediction Models
        • How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
        • How (not) to do Phonological Typology: The Case of Pitch-Accent
        • How Context Affects Language Models' Factual Predictions
        • How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena
        • How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations
        • How Does Batch Normalization Help Optimization?
        • How Effective are State Space Models for Machine Translation?
        • How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings
        • How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
        • How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
        • How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation
        • How many degrees of freedom do we need to train deep networks: a loss landscape perspective
        • How Much Knowledge Can You Pack Into the Parameters of a Language Model?
        • How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
        • How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
        • How to represent part-whole hierarchies in a neural network
        • How to Train Your Energy-Based Models
        • How transferable are features in deep neural networks?
        • How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
        • How well can VMEC predict the initial saturation of external kink modes in near circular tokamaks and $l=2$ stellarators?
        • HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
        • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
        • Human Action Localization with Sparse Spatial Supervision
        • Human-in-the-Loop Causal Discovery under Latent Confounding using Ancestral GFlowNets
        • Humanity's Last Exam
        • Hungry Hungry Hippos: Towards Language Modeling with State Space Models
        • Hyena Hierarchy: Towards Larger Convolutional Language Models
        • HyperAttention: Long-context Attention in Near-Linear Time
        • Hyperbolic Active Learning for Semantic Segmentation under Domain Shift
        • Hyperbolic Deep Neural Networks: A Survey
        • Hyperbolic Geometry
        • Hyperbolic Learning with Multimodal Large Language Models
        • Hyperbolic Neural Networks
        • HYperbolic Self-Paced Learning for Self-Supervised Skeleton-based Action Representations
        • HyperCLOVA X Technical Report
        • Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design
        • HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
        • I3D: Transformer architectures with input-dependent dynamic depth for speech recognition
        • Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
        • ILLUME: Rationalizing Vision-Language Models through Human Interactions
        • Im2Text: Describing Images Using 1 Million Captioned Photographs
        • Image and Video Tokenization with Binary Spherical Quantization
        • Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
        • ImageBind: One Embedding Space To Bind Them All
        • ImageNet Large Scale Visual Recognition Challenge
        • Imitation Learning as $f$-Divergence Minimization
        • Impact of Tokenization on Language Models: An Analysis for Turkish
        • Implicit Generation and Generalization in Energy-Based Models
        • Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation
        • Improved Baselines with Momentum Contrastive Learning
        • Improved Baselines with Visual Instruction Tuning
        • Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
        • Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback
        • Improving language models by retrieving from trillions of tokens
        • Improving Language Understanding by Generative Pre-Training
        • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
        • Improving Neural Language Models with a Continuous Cache
        • Improving Neural Machine Translation Models with Monolingual Data
        • Improving neural networks by preventing co-adaptation of feature detectors
        • Improving Personalized Explanation Generation through Visualization
        • Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
        • Improving Word Representations via Global Context and Multiple Word Prototypes
        • Improving Zero-Shot Translation by Disentangling Positional Information
        • Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning
        • In Defense of Grid Features for Visual Question Answering
        • INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
        • Inferring and Executing Programs for Visual Reasoning
        • InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization
        • InfoNCE: Identifying the Gap Between Theory and Practice
        • Information Theory and Statistics: an overview
        • Information-Theoretic Probing for Linguistic Structure
        • InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
        • Inseq: An Interpretability Toolkit for Sequence Generation Models
        • Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks
        • Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
        • Instruction Tuning for Large Language Models: A Survey
        • InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
        • Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
        • INTELLECT-1 Technical Report
        • Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
        • InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
        • Interpolating Compressed Parameter Subspaces
        • Interpretable Convolutional Filters with SincNet
        • Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings
        • Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
        • Intriguing properties of neural networks
        • Intrinsic dimension of data representations in deep neural networks
        • Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
        • Intrusive And Non-Intrusive Perceptual Speech Quality Assessment Using A Convolutional Neural Network
        • Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
        • Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud Forecasting for Sequential Pose Forecasting
        • Investigating Backtranslation in Neural Machine Translation
        • Investigating Decoder-only Large Language Models for Speech-to-text Translation
        • Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages
        • Investigating Multilingual NMT Representations at Scale
        • Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
        • Is Context Helpful for Chat Translation Evaluation?
        • Is Feedback All You Need? Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning
        • Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
        • Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
        • Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
        • Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis
        • Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
        • Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
        • iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
        • It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
        • ITALIC: An Italian Intent Classification Dataset
        • ITU-T coders for wideband, superwideband, and fullband speech communication [Series Editorial]
        • Jamba: A Hybrid Transformer-Mamba Language Model
        • Jasper: An End-to-End Convolutional Neural Acoustic Model
        • JetFormer: An Autoregressive Generative Model of Raw Images and Text
        • Johnson-Lindenstrauss Lemma, Linear and Nonlinear Random Projections, Random Fourier Features, and Random Kitchen Sinks: Tutorial and Survey
        • Joint-task Self-supervised Learning for Temporal Correspondence
        • JOREK3D: An extension of the JOREK nonlinear MHD code to stellarators
        • JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
        • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
        • Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP
        • KAN: Kolmogorov-Arnold Networks
        • Kimi-Audio Technical Report
        • KIT's Multilingual Speech Translation System for IWSLT 2023
        • kNN For Whisper And Its Effect On Bias And Speaker Adaptation
        • Knowledge Conflicts for LLMs: A Survey
        • Knowledge distillation: A good teacher is patient and consistent
        • Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges
        • LAION-5B: An open large-scale dataset for training next generation image-text models
        • LaMP: When Large Language Models Meet Personalization
        • Language agents achieve superhuman synthesis of scientific knowledge
        • Language Agnostic Speech Embeddings for Emotion Classification
        • Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
        • Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
        • Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
        • Language Model Can Listen While Speaking
        • Language Modeling with Deep Transformers
        • Language Modeling with Gated Convolutional Networks
        • Language Models are Few-Shot Learners
        • Language Models are Multilingual Chain-of-Thought Reasoners
        • Language Models are Realistic Tabular Data Generators
        • Language Models are Unsupervised Multitask Learners
        • Language Models as Knowledge Bases?
        • Language Models Represent Space and Time
        • Language Models: A Guide for the Perplexed
        • Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition
        • LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
        • Laplace Redux -- Effortless Bayesian Deep Learning
        • Large Associative Memory Problem in Neurobiology and Machine Learning
        • Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
        • Large Batch Training of Convolutional Networks
        • Large Concept Models: Language Modeling in a Sentence Representation Space
        • Large Language Diffusion Models
        • Large Language Model Influence on Diagnostic Reasoning A Randomized Clinical Trial
        • Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences
        • Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
        • Large Language Models Are State-of-the-Art Evaluators of Translation Quality
        • Large Language Models As Evolution Strategies
        • Large Language Models for Compiler Optimization
        • Large Language Models for Data Annotation: A Survey
        • Large Language Models: A Survey
        • Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
        • Large-Scale Automatic Audiobook Creation
        • Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
        • Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
        • Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification
        • Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling
        • Layer Normalization
        • LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
        • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
        • Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition
        • Learnability and the Vapnik-Chervonenkis dimension
        • Learned feature representations are biased by complexity, learning order, position, and more
        • Learning a similarity metric discriminatively, with application to face verification
        • Learning Action Changes by Measuring Verb-Adverb Textual Relationships
        • Learning and Evaluating General Linguistic Intelligence
        • Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
        • Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
        • Learning Correspondence from the Cycle-Consistency of Time
        • Learning Differentially Private Recurrent Language Models
        • Learning Filterbanks from Raw Speech for Phone Recognition
        • Learning Interactive Real-World Simulators
        • Learning Language-Specific Layers for Multilingual Machine Translation
        • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
        • Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
        • Learning Source Disentanglement in Neural Audio Codec
        • Learning Sparse Neural Networks through $L_0$ Regularization
        • Learning Speaker Representations with Mutual Information
        • Learning Temporal Dynamics from Cycles in Narrated Video
        • Learning Temporal Sentence Grounding From Narrated EgoVideos
        • Learning the Predictability of the Future
        • Learning to Compress Prompts with Gist Tokens
        • Learning to Generate Reviews and Discovering Sentiment
        • Learning to Merge Word Senses
        • Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
        • Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining
        • Learning to summarize from human feedback
        • Learning Transferable Visual Models From Natural Language Supervision
        • Learning with Fenchel-Young Losses
        • Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
        • LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
        • Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
        • Leveraging Audio-Only Data for Text-Queried Target Sound Extraction
        • Leveraging Content and Acoustic Representations for Speech Emotion Recognition
        • Leveraging Gloss Knowledge in Neural Word Sense Disambiguation by Hierarchical Co-Attention
        • Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation
        • Libri-Light: A Benchmark for ASR with Limited or No Supervision
        • Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
        • Librispeech An ASR corpus based on public domain audio books
        • LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
        • LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
        • Lifting the Curse of Multilinguality by Pre-training Modular Transformers
        • Lightweight and Efficient Spoken Language Identification of Long-form Audio
        • Lightweight Audio Segmentation for Long-form Speech Translation
        • LIMO: Less is More for Reasoning
        • Linear Connectivity Reveals Generalization Strategies
        • Linear-time Minimum Bayes Risk Decoding with Reference Aggregation
        • Linformer: Self-Attention with Linear Complexity
        • Linguini: A benchmark for language-agnostic linguistic reasoning
        • Linguistic Regularities in Sparse and Explicit Word Representations
        • Liquid Time-constant Networks
        • Liquid: Language Models are Scalable Multi-modal Generators
        • Listen, Think, and Understand
        • LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
        • Listenable Maps for Audio Classifiers
        • LiT: Zero-Shot Transfer with Locked-image text Tuning
        • LL3M: Large Language 3D Modelers
        • Llama 2: Open Foundation and Fine-Tuned Chat Models
        • LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
        • Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens
        • LLaMA-Omni: Seamless Speech Interaction with Large Language Models
        • LLaMA: Open and Efficient Foundation Language Models
        • Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
        • LLaSM: Large Language and Speech Model
        • LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
        • LLaVA-OneVision: Easy Visual Task Transfer
        • LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
        • LLM Post-Training: A Deep Dive into Reasoning Large Language Models
        • LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
        • LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History
        • LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
        • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
        • LLM4Eval: Large Language Model for Evaluation in IR
        • LM-Polygraph: Uncertainty Estimation for Language Models
        • LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models
        • Localizing Objects with Self-Supervised Transformers and no Labels
        • LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
        • Locating and Editing Factual Associations in GPT
        • Logits of API-Protected LLMs Leak Proprietary Information
        • Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
        • Long-Context Generalization with Sparse Attention
        • Long-Context Language Modeling with Parallel Context Encoding
        • Longformer: The Long-Document Transformer
        • LongNet: Scaling Transformers to 1,000,000,000 Tokens
        • LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
        • Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation
        • Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
        • LoRA: Low-Rank Adaptation of Large Language Models
        • LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
        • Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
        • Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
        • Lost in the Middle: How Language Models Use Long Contexts
        • LRS3-TED: a large-scale dataset for visual speech recognition
        • LSSED: a large-scale dataset and benchmark for speech emotion recognition
        • Lumiere: A Space-Time Diffusion Model for Video Generation
        • LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models
        • M-Prometheus: A Suite of Open Multilingual LLM Judges
        • Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
        • Making AI Forget You: Data Deletion in Machine Learning
        • Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models
        • Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game
        • Making Pre-trained Language Models Better Few-shot Learners
        • Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
        • Mamba in Speech: Towards an Alternative to Self-Attention
        • Mamba: Linear-Time Sequence Modeling with Selective State Spaces
        • Many-Shot In-Context Learning
        • MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
        • Marian: Fast Neural Machine Translation in C++
        • MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
        • Mask-Predict: Parallel Decoding of Conditional Masked Language Models
        • Masked Autoencoders Are Scalable Vision Learners
        • Masked Autoencoders that Listen
        • MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
        • MaskGIT: Masked Generative Image Transformer
        • MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
        • Massively Multilingual Neural Grapheme-to-Phoneme Conversion
        • Massively Multilingual Neural Machine Translation
        • Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
        • Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
        • Matrix Decomposition and Applications
        • Matryoshka Diffusion Models
        • Matryoshka Quantization
        • Matryoshka Representation Learning
        • MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
        • MAWPS: A Math Word Problem Repository
        • MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
        • Measuring and Increasing Context Usage in Context-Aware Machine Translation
        • Measuring Massive Multitask Language Understanding
        • Measuring the Effects of Data Parallelism on Neural Network Training
        • Measuring the Intrinsic Dimension of Objective Landscapes
        • Measuring the Mixing of Contextual Information in the Transformer
        • MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
        • MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf
        • MEG-MASC: a high-quality magneto-encephalography dataset for evaluating natural speech processing
        • MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
        • Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
        • MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
        • Membership Inference Attacks on Machine Learning: A Survey
        • MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory
        • Memory Layers at Scale
        • Memory Performance Attacks: Denial of Memory Service in {Multi-Core} Systems
        • MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
        • MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond
        • MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
        • MERLOT: Multimodal Neural Script Knowledge Models
        • Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
        • Meta-Learning Online Adaptation of Language Models
        • Meta-Transformer: A Unified Framework for Multimodal Learning
        • METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
        • MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
        • MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
        • MEXMA: Token-level objectives improve sentence representations
        • MFPP: Morphological Fragmental Perturbation Pyramid for Black-Box Model Explanations
        • mGeNTE: A Multilingual Resource for Gender-Neutral Language and Translation
        • mHuBERT-147: A Compact Multilingual HuBERT Model
        • Microsoft COCO: Common Objects in Context
        • MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
        • Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
        • Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
        • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
        • MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
        • Minimum Bayes-Risk Decoding for Statistical Machine Translation
        • MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
        • MIO: A Foundation Model on Multimodal Tokens
        • Mistral 7B
        • Mixed Precision Training
        • Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
        • Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings
        • Mixtral of Experts
        • Mixture-of-Experts Graph Transformers for Interpretable Particle Collision Detection
        • ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
        • ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
        • MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
        • MLP-Mixer: An all-MLP Architecture for Vision
        • MLS: A Large-Scale Multilingual Dataset for Speech Research
        • MM-LLMs: Recent Advances in MultiModal Large Language Models
        • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
        • MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
        • MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
        • MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
        • MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
        • Model Editing with Canonical Examples
        • Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
        • Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
        • Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
        • Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation
        • Modelling low-resource accents without accent-specific TTS frontend
        • Modelling of saturated external MHD instabilities in tokamaks: a comparison of 3D free boundary equilibria and nonlinear stability calculations
        • Modular Deep Learning
        • Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
        • ModuleFormer: Modularity Emerges from Mixture-of-Experts
        • MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
        • Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
        • Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
        • Momentum Contrast for Unsupervised Visual Representation Learning
        • Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods
        • MoonCast: High-Quality Zero-Shot Podcast Generation
        • More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
        • MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
        • Moshi: a speech-text foundation model for real-time dialogue
        • MOSNet: Deep Learning based Objective Assessment for Voice Conversion
        • MouSi: Poly-Visual-Expert Vision-Language Models
        • Movie Gen: A Cast of Media Foundation Models
        • MovieNet: A Holistic Dataset for Movie Understanding
        • mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
        • mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
        • mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
        • MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
        • mSLAM: Massively multilingual joint pre-training for speech and text
        • MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
        • MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
        • MSTS: A Multimodal Safety Test Suite for Vision-Language Models
        • MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
        • MuLan: A Joint Embedding of Music Audio and Natural Language
        • Multi-Prototype Vector-Space Models of Word Meaning
        • Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
        • Multi-Scale Context Aggregation by Dilated Convolutions
        • Multi-sense embeddings through a word sense disambiguation process
        • Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
        • Multi-task self-supervised learning for Robust Speech Recognition
        • Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models
        • Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts
        • Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language
        • Multilingual Speech Models for Automatic Speech Recognition Exhibit Gender Performance Gaps
        • Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
        • Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
        • Multimodal Few-Shot Learning with Frozen Language Models
        • Multimodal Machine Learning: A Survey and Taxonomy
        • Multimodal Neural Databases
        • Multiple Importance Sampling ELBO and Deep Ensembles of Variational Approximations
        • Multiple Object Recognition with Visual Attention
        • Multitask Prompted Training Enables Zero-Shot Task Generalization
        • Muon is Scalable for LLM Training
        • Muon Optimizer Accelerates Grokking
        • Music Transformer
        • MusicLM: Generating Music From Text
        • MuST-C: A multilingual corpus for end-to-end speech translation
        • MuST-C: a Multilingual Speech Translation Corpus
        • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
        • Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
        • Natural language guidance of high-fidelity text-to-speech with synthetic annotations
        • Natural Language Processing (almost) from Scratch
        • Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
        • NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
        • NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
        • NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
        • Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics
        • NBDT: Neural-Backed Decision Trees
        • Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs
        • Needle In A Multimodal Haystack
        • Network In Network
        • Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
        • Neural Collaborative Filtering
        • Neural Combinatorial Optimization with Reinforcement Learning
        • Neural Discrete Representation Learning
        • Neural Grapheme-to-Phoneme Conversion with Pre-trained Grapheme Models
        • Neural Language Model Pruning for Automatic Speech Recognition
        • Neural Linguistic Steganography
        • Neural Machine Translation by Jointly Learning to Align and Translate
        • Neural Machine Translation of Rare Words with Subword Units
        • Neural Machine Translation: A Review and Survey
        • Neural Machine Translation: Challenges, Progress and Future
        • Neural Motifs: Scene Graph Parsing with Global Context
        • Neural Network Acceptability Judgments
        • Neural Networks are Decision Trees
        • Neural Networks Fail to Learn Periodic Functions and How to Fix It
        • Neural Sequence Learning Models for Word Sense Disambiguation
        • Neural Speech Synthesis with Transformer Network
        • Neural Voice Cloning with a Few Samples
        • Neural Word Embedding as Implicit Matrix Factorization
        • NeuralDEM - Real-time Simulation of Industrial Particulate Flows
        • Neurosymbolic AI -- Why, What, and How
        • NeurST: Neural Speech Translation Toolkit
        • Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
        • No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages
        • No Language Left Behind: Scaling Human-Centered Machine Translation
        • Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
        • NoLiMa: Long-Context Evaluation Beyond Literal Matching
        • Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
        • Non-Autoregressive Neural Machine Translation
        • Non-Exchangeable Conformal Language Generation with Nearest Neighbors
        • Non-Exchangeable Conformal Risk Control
        • Non-intrusive Speech Quality Assessment Using Neural Networks
        • Nonlinear Dimensionality Reduction by Locally Linear Embedding
        • Nonlinear MHD modeling of soft $ÎČ$ limits in W7-AS
        • Nonlinear MHD simulations of external kinks in quasi-axisymmetric stellarators using an axisymmetric external rotational transform approximation
        • Normalization Techniques in Training DNNs: Methodology, Analysis and Application
        • Not Just a Black Box: Learning Important Features Through Propagating Activation Differences
        • Nougat: Neural Optical Understanding for Academic Documents
        • Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
        • Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers
        • NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
        • NVLM: Open Frontier-Class Multimodal LLMs
        • OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
        • OLMo: Accelerating the Science of Language Models
        • OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
        • OmniParser for Pure Vision Based GUI Agent
        • On Compositions of Transformations in Contrastive Self-Supervised Learning
        • On Divergence Measures for Training GFlowNets
        • On Information and Sufficiency
        • On Instruction-Finetuning Neural Machine Translation Models
        • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
        • On Layer Normalization in the Transformer Architecture
        • On the cyclic nature of perception in vision versus audition
        • On the difficulty of training Recurrent Neural Networks
        • On the Effectiveness of Acoustic BPE in Decoder-Only TTS
        • On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
        • On the Fundamental Impossibility of Hallucination Control in Large Language Models
        • On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
        • On the Integration of Optical Flow and Action Recognition
        • On The Landscape of Spoken Language Models: A Comprehensive Survey
        • On the Limitations of Compute Thresholds as a Governance Strategy
        • On the Measure of Intelligence
        • On the Number of Linear Regions of Deep Neural Networks
        • On the Opportunities and Risks of Foundation Models
        • On the Out-of-distribution Generalization of Probabilistic Image Modelling
        • On the Representation Collapse of Sparse Mixture of Experts
        • One Mind, Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models
        • One ruler to measure them all: Benchmarking multilingual long-context language models
        • One TTS Alignment To Rule Them All
        • One Wide Feedforward is All You Need
        • ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
        • One-Shot Open Affordance Learning with Foundation Models
        • One-To-Many Multilingual End-to-end Speech Translation
        • OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
        • OneLLM: One Framework to Align All Modalities with Language
        • Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
        • Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
        • Open-Source Conversational AI with SpeechBrain 1.0
        • OpenAssistant Conversations -- Democratizing Large Language Model Alignment
        • OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
        • OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
        • OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
        • OpenVoice: Versatile Instant Voice Cloning
        • OPT: Open Pre-trained Transformer Language Models
        • Optical Flow with Semantic Segmentation and Localized Layers
        • Optimal Bounds for Open Addressing Without Reordering
        • Optimization Methods for Large-Scale Machine Learning
        • OpusLM: A Family of Open Unified Speech Language Models
        • Otter: A Multi-Modal Model with In-Context Instruction Tuning
        • Our data, ourselves: privacy via distributed noise generation
        • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
        • Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation
        • Overcoming catastrophic forgetting in neural networks
        • Ovis: Structural Embedding Alignment for Multimodal Large Language Model
        • OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
        • OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
        • OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
        • OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
        • P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
        • PaLI: A Jointly-Scaled Multilingual Language-Image Model
        • PaliGemma 2: A Family of Versatile VLMs for Transfer
        • PaliGemma: A versatile 3B VLM for transfer
        • PaLM 2 Technical Report
        • PaLM: Scaling Language Modeling with Pathways
        • PALO: A Polyglot Large Multimodal Model for 5B People
        • Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
        • Parakeet A natural sounding, conversational text-to-speech model
        • Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
        • Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
        • Parallel Scheduled Sampling
        • Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
        • Parallel Tacotron: Non-Autoregressive and Controllable TTS
        • Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
        • Parallel WaveNet: Fast High-Fidelity Speech Synthesis
        • Parameter-efficient fine-tuning of large-scale pre-trained language models
        • Parameter-Efficient Transfer Learning for NLP
        • Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
        • Parsing with Compositional Vector Grammars
        • PaSS: Parallel Speculative Sampling
        • Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
        • Pay Attention to MLPs
        • PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols
        • Pengi: An Audio Language Model for Audio Tasks
        • Perceiver IO: A General Architecture for Structured Inputs & Outputs
        • Perceiver: General Perception with Iterative Attention
        • Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs
        • Perceptual Losses for Real-Time Style Transfer and Super-Resolution
        • Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines
        • Phase behavior of Cacio and Pepe sauce
        • Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
        • Phi-4 Technical Report
        • Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
        • Phonetic Analysis of Self-supervised Representations of English Speech
        • Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors
        • Pitfalls and Outlooks in Using COMET
        • PIXAR: Auto-Regressive Language Modeling in Pixel Space
        • PLACEHOLDER hertz-dev - Standard Intelligence
        • Playing Atari with Deep Reinforcement Learning
        • Playing Language Game with LLMs Leads to Jailbreaking
        • Poisoning Language Models During Instruction Tuning
        • Poisoning Web-Scale Training Datasets is Practical
        • PolyLM: An Open Source Polyglot Large Language Model
        • PolyVoice: Language Models for Speech to Speech Translation
        • Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
        • Practical recommendations for gradient-based training of deep architectures
        • Prediction and Entropy of Printed English
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Preliminary WMT24 Ranking of General MT Systems and LLMs
        • Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
        • Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
        • Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions
        • Prime Collective Communications Library -- Technical Report
        • Principles of Visual Tokens for Efficient Video Understanding
        • Probabilistic Artificial Intelligence
        • Probabilistic encryption & how to play mental poker keeping secret all partial information
        • Probing the phonetic and phonological knowledge of tones in Mandarin TTS models
        • Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
        • Progress Report: Towards European LLMs
        • Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
        • Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
        • Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models
        • Prompting Large Language Models with Speech Recognition Abilities
        • Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages
        • Property Neurons in Self-Supervised Speech Transformers
        • Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
        • Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases
        • Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
        • Proximal Policy Optimization Algorithms
        • PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
        • Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
        • Pushing the Limits of Zero-shot End-to-End Speech Translation
        • Pyramid Feature Attention Network for Saliency detection
        • Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
        • Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
        • Qualitatively characterizing neural network optimization problems
        • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
        • Quality-Aware Decoding for Neural Machine Translation
        • Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
        • Quantifying Memorization Across Neural Language Models
        • Quantifying the Plausibility of Context Reliance in Neural Machine Translation
        • Quantifying the Uniqueness and Divisiveness of Presidential Discourse
        • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
        • Qwen Technical Report
        • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
        • Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
        • Qwen2 Technical Report
        • Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
        • Qwen2.5 Technical Report
        • Qwen3 Technical Report
        • Randomized Approximation of the Gram Matrix: Exact Computation and Probabilistic Bounds
        • Re-ranking Person Re-identification with k-reciprocal Encoding
        • Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
        • Reading Digits in Natural Images with Unsupervised Feature Learning
        • Real Time Speech Enhancement in the Waveform Domain
        • ReALM: Reference Resolution As Language Modeling
        • Recent Advances in Direct Speech-to-text Translation
        • Recent Advances in Discrete Speech Tokens: A Review
        • Recent Advances in Speech Language Models: A Survey
        • Recent Developments on ESPnet Toolkit Boosted by Conformer
        • RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
        • Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
        • Recurrent Memory Transformer
        • Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
        • Reducing Activation Recomputation in Large Transformer Models
        • Reducing the Dimensionality of Data with Neural Networks
        • Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
        • Reformer: The Efficient Transformer
        • Reframing Human-AI Collaboration for Generating Free-Text Explanations
        • Regularized Evolution for Image Classifier Architecture Search
        • Reinforcement Learning: An Overview
        • Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
        • Relative representations enable zero-shot latent space communication
        • Replacing the do-calculus with Bayes rule
        • Representation Learning with Contrastive Predictive Coding
        • Representational dissimilarity metric spaces for stochastic neural networks
        • Representational similarity analysis – connecting the branches of systems neuroscience
        • Representations of language in a model of visually grounded speech signal
        • Representing Speech Through Autoregressive Prediction of Cochlear Tokens
        • Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
        • Reranking Laws for Language Generation: A Communication-Theoretic Perspective
        • ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
        • Residual Contrastive Learning for Image Reconstruction: Learning Transferable Representations from Noisy Images
        • Retentive Network: A Successor to Transformer for Large Language Models
        • Rethinking and Improving Multi-task Learning for End-to-end Speech Translation
        • Rethinking Attention with Performers
        • Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
        • Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective
        • Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
        • Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
        • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
        • Revisiting Acoustic Features for Robust ASR
        • Revisiting Feature Prediction for Learning Visual Representations from Video
        • Revisiting minimum description length complexity in overparameterized models
        • Revisiting Model Stitching to Compare Neural Representations
        • Revisiting Over-Smoothness in Text to Speech
        • Revisiting Self-Distillation
        • Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
        • Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
        • ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
        • Rho-1: Not All Tokens Are What You Need
        • Risks from Learned Optimization in Advanced Machine Learning Systems
        • RoBERTa: A Robustly Optimized BERT Pretraining Approach
        • Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS
        • Robust Speech Recognition via Large-Scale Weak Supervision
        • Robustness May Be at Odds with Accuracy
        • RoFormer: Enhanced Transformer with Rotary Position Embedding
        • Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts
        • RULER: What's the Real Context Size of Your Long-Context Language Models?
        • RWKV: Reinventing RNNs for the Transformer Era
        • S2ORC: The Semantic Scholar Open Research Corpus
        • SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
        • SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
        • SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
        • SALMONN: Towards Generic Hearing Abilities for Large Language Models
        • Sample Efficient Adaptive Text-to-Speech
        • SaulLM-7B: A pioneering Large Language Model for Law
        • SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
        • Scalable Diffusion Models with Transformers
        • Scalable Expectation Estimation with Subtractive Mixture Models
        • Scalable-Softmax Is Superior for Attention
        • Scaling Analysis of Interleaved Speech-Text Language Models
        • Scaling Instructable Agents Across Many Simulated Worlds
        • Scaling Language Models: Methods, Analysis & Insights from Training Gopher
        • Scaling Laws for Generative Mixed-Modal Language Models
        • Scaling Laws for Multilingual Neural Machine Translation
        • Scaling Laws for Neural Language Models
        • Scaling Laws for Reward Model Overoptimization
        • Scaling Laws for Transfer
        • Scaling Properties of Speech Language Models
        • Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
        • Scaling Speech Technology to 1,000+ Languages
        • Scaling Transformer to 1M tokens and beyond with RMT
        • Scaling Transformers for Low-Bitrate High-Quality Speech Coding
        • Scaling Up Influence Functions
        • Scaling Up Online Speech Recognition Using ConvNets
        • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
        • Scaling Vision with Sparse Mixture of Experts
        • Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
        • SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
        • Score-Based Generative Modeling through Stochastic Differential Equations
        • Seamless: Multilingual Expressive and Streaming Speech Translation
        • SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
        • SEANet: A Multi-modal Speech Enhancement Network
        • Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
        • Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
        • Selective State Space Model for Monaural Speech Enhancement
        • Self-Alignment with Instruction Backtranslation
        • Self-Attention with Relative Position Representations
        • Self-Chained Image-Language Model for Video Localization and Question Answering
        • Self-critical Sequence Training for Image Captioning
        • Self-Instruct: Aligning Language Model with Self Generated Instructions
        • Self-Instruct: Aligning Language Models with Self-Generated Instructions
        • Self-labelling via simultaneous clustering and representation learning
        • Self-Rewarding Language Models
        • Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
        • Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
        • Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
        • Self-Supervised Learning of Pretext-Invariant Representations
        • Self-Supervised Speech Representation Learning: A Review
        • Self-Supervised Speech Representations are More Phonetic than Semantic
        • Self-supervised Video Object Segmentation by Motion Grouping
        • Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
        • Self-Taught Evaluators
        • SELM: Speech Enhancement Using Discrete Tokens and Language Models
        • SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
        • Sentence Length
        • Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
        • SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
        • Sequence Level Training with Recurrent Neural Networks
        • Sequence Transduction with Recurrent Neural Networks
        • Sequence-Level Knowledge Distillation
        • SGDR: Stochastic Gradient Descent with Warm Restarts
        • Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
        • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
        • Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models
        • Shortcut Learning in Deep Neural Networks
        • Shortformer: Better Language Modeling using Shorter Inputs
        • Should You Mask 15% in Masked Language Modeling?
        • SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
        • Sigmoid Loss for Language Image Pre-Training
        • Similarity of Neural Network Representations Revisited
        • Simple and Controllable Music Generation
        • Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
        • Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning
        • Simple, Scalable Adaptation for Neural Machine Translation
        • Simplifying Transformer Blocks
        • Skip-Thought Vectors
        • SLIC Superpixels Compared to State-of-the-Art Superpixel Methods
        • SliceGPT: Compress Large Language Models by Deleting Rows and Columns
        • SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
        • SLURP: A Spoken Language Understanding Resource Package
        • Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
        • Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
        • SNAC: Multi-Scale Neural Audio Codec
        • Snapshot Ensembles: Train 1, get M for free
        • Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
        • SODA: Story Oriented Dense Video Captioning Evaluation Framework
        • Soft Merging of Experts with Adaptive Routing
        • Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
        • softmax is not enough (for sharp out-of-distribution)
        • SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
        • SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
        • Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
        • SoundStorm: Efficient Parallel Audio Generation
        • SoundStream: An End-to-End Neural Audio Codec
        • Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
        • Space-Time Correspondence as a Contrastive Random Walk
        • SpanBERT: Improving Pre-training by Representing and Predicting Spans
        • Sparks of Artificial General Intelligence: Early experiments with GPT-4
        • Sparse and Continuous Attention Mechanisms
        • Sparse and Structured Hopfield Networks
        • Sparse Attention with Linear Units
        • Sparse Autoencoders Find Highly Interpretable Features in Language Models
        • Sparse Communication via Mixed Distributions
        • Sparse continuous distributions and Fenchel-Young losses
        • Sparse Sequence-to-Sequence Models
        • Sparse Text Generation
        • Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
        • Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
        • Speakers of different languages remember visual scenes differently
        • SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
        • SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
        • Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
        • Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
        • Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
        • Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
        • Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
        • Speech Translation with Large Language Models: An Industrial Practice
        • Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
        • Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
        • Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
        • Speech-to-Speech Translation For A Real-world Unwritten Language
        • SpeechAlign: Aligning Speech Generation to Human Preferences
        • SpeechBrain-MOABB: An open-source Python library for benchmarking deep neural networks applied to EEG signals
        • SpeechBrain: A General-Purpose Speech Toolkit
        • SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
        • SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation
        • SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
        • SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
        • SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
        • SpeechQE: Estimating the Quality of Direct Speech Translation
        • SpeechT: Findings of the First Mentorship in Speech Translation
        • SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
        • SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
        • SpeechVerse: A Large-scale Generalizable Audio Language Model
        • SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
        • Speed/accuracy trade-offs for modern convolutional object detectors
        • SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
        • SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
        • SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training
        • SpiRit-LM: Interleaved Spoken and Written Language Model
        • Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
        • Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech
        • Spoken Language Modeling with Duration-Penalized Self-Supervised Units
        • Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
        • Spread Flows for Manifold Modelling
        • SQ-GAN: Semantic Image Communications Using Masked Vector Quantization
        • SQuId: Measuring Speech Naturalness in Many Languages
        • ST-LLM: Large Language Models Are Effective Temporal Learners
        • Stabilising and accelerating light gated recurrent units for automatic speech recognition
        • StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
        • Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
        • Stacked Quantizers for Compositional Vector Compression
        • STAR: A Benchmark for Situated Reasoning in Real-World Videos
        • StarSpace: Embed All The Things!
        • State Spaces Aren't Enough: Machine Translation Needs Attention
        • Statistical Rejection Sampling Improves Preference Optimization
        • Stealing Part of a Production Language Model
        • Stealing User Prompts from Mixture of Experts
        • Steerable CNNs
        • Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
        • StegaStamp: Invisible Hyperlinks in Physical Photographs
        • Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
        • Step-by-Step Diffusion: An Elementary Tutorial
        • STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing
        • Stochastic Average Gradient : A Simple Empirical Investigation
        • Stochastic Neighbor Embedding
        • Stochastic Neighbor Embedding with Gaussian and Student-t Distributions: Tutorial and Survey
        • Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators
        • Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
        • StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
        • Structured Neural Summarization
        • Structured Pruning of Large Language Models
        • Structured Training for Neural Network Transition-Based Parsing
        • Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
        • Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
        • StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
        • Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
        • Super Tiny Language Models
        • SUPERB: Speech processing Universal PERformance Benchmark
        • SuperBPE: Space Travel for Language Models
        • SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
        • Supervised Contrastive Learning
        • Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
        • Surrogate Gradient Learning in Spiking Neural Networks
        • Survey of Automatic Metrics for Evaluating Machine Translation at the Document Level
        • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
        • Surveying the MLLM Landscape: A Meta-Review of Current Surveys
        • SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
        • SWEb: A Large Web Dataset for the Scandinavian Languages
        • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
        • SyllableLM: Learning Coarse Semantic Units for Speech Language Models
        • Symbolic Discovery of Optimization Algorithms
        • Synthetic DNA applications in information technology
        • T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
        • T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
        • T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
        • Tacotron: Towards End-to-End Speech Synthesis
        • Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
        • Taming Transformers for High-Resolution Image Synthesis
        • Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
        • Task Singular Vectors: Reducing Task Interference in Model Merging
        • Task Vectors are Cross-Modal
        • Task-aware Retrieval with Instructions
        • Task-Aware Unified Source Separation
        • TASTY: A Transformer based Approach to Space and Time complexity
        • Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training
        • TEARS: Textual Representations for Scrutable Recommendations
        • TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation
        • TED-LIUM: an Automatic Speech Recognition dedicated corpus
        • Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
        • Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
        • Text and Code Embeddings by Contrastive Pre-Training
        • Text-Free Prosody-Aware Generative Spoken Language Modeling
        • Textbooks Are All You Need
        • Textless Speech-to-Speech Translation on Real Data
        • Textually Pretrained Speech Language Models
        • Texygen: A Benchmarking Platform for Text Generation Models
        • TGIF: A New Dataset and Benchmark on Animated GIF Description
        • The "something something" video database for learning and evaluating visual common sense
        • The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
        • The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
        • The Algorithmic Foundations of Differential Privacy
        • The AMI Meeting Corpus
        • The Anatomy of a Large-Scale Hypertextual Web Search Engine
        • The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
        • The Biological Basis of Audition
        • The boundary of neural network trainability is fractal
        • The case for 4-bit precision: k-bit Inference Scaling Laws
        • The Causal-Neural Connection: Expressiveness, Learnability, and Inference
        • The challenge of realistic music generation: modelling raw audio at scale
        • The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
        • The Curious Case of Neural Text Degeneration
        • The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
        • The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
        • The Defeat of the Winograd Schema Challenge
        • The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
        • The distributional hypothesis
        • The Elements of Differentiable Programming
        • The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation
        • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
        • The first collision for full SHA-1
        • The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
        • The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
        • The Forward-Forward Algorithm: Some Preliminary Investigations
        • The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
        • The Goldilocks zone: Towards better understanding of neural network loss landscapes
        • The Hardware Lottery
        • The Hungarian Method for the Assignment Problem
        • The Impact of Positional Encoding on Length Generalization in Transformers
        • The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics
        • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
        • The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
        • The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
        • The JOREK non-linear extended MHD code and applications to large-scale instabilities and their control in magnetically confined fusion plasmas
        • The Kinetics Human Action Video Dataset
        • The Leaderboard Illusion
        • The Llama 3 Herd of Models
        • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
        • The Marginal Value of Adaptive Gradient Methods in Machine Learning
        • The Matrix Calculus You Need For Deep Learning
        • The Metropolis-Hastings algorithm
        • The Modern Mathematics of Deep Learning
        • The Multimodal Universe: Enabling Large-Scale Machine Learning with 100TB of Astronomical Scientific Data
        • The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
        • The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
        • The Pile: An 800GB Dataset of Diverse Text for Language Modeling
        • The pitfalls of next-token prediction
        • The Power of Scale for Parameter-Efficient Prompt Tuning
        • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
        • The Relativity of Causal Knowledge
        • The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks
        • The Semantic Scholar Open Data Platform
        • The semantics of the (so-called) clausal determiner nÂŽo in Akan (Kwa)
        • The Seven Tools of Causal Inference with Reflections on Machine Learning
        • The Spotify Podcast Dataset
        • The sun compass revisited
        • The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval
        • The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
        • The THUMOS Challenge on Action Recognition for Videos "in the Wild"
        • The Topological BERT: Transforming Attention into Topology for Natural Language Processing
        • The unreasonable effectiveness of few-shot learning for machine translation
        • The VoiceMOS Challenge 2022
        • The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
        • The Winograd schema challenge
        • The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling
        • The Zero Resource Speech Challenge 2019: TTS without T
        • The Zero Resource Speech Challenge 2021: Spoken language modelling
        • Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data
        • Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
        • Three models for the description of language
        • Time-Contrastive Networks: Self-Supervised Learning from Video
        • Tiny Pointers
        • tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models
        • TinyLlama: An Open-Source Small Language Model
        • TinyLLaVA: A Framework of Small-scale Large Multimodal Models
        • Titans: Learning to Memorize at Test Time
        • TLDR: Extreme Summarization of Scientific Documents
        • TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
        • Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
        • Toolformer: Language Models Can Teach Themselves to Use Tools
        • TopoBenchmarkX: A Framework for Benchmarking Topological Deep Learning
        • Toward Joint Language Modeling for Speech Units and Text
        • Towards a definition of transcreation: a systematic literature review
        • Towards audio language modeling -- an overview
        • Towards Automatic Learning of Procedures from Web Instructional Videos
        • Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
        • Towards Causal Representation Learning
        • Towards Deep Learning Models Resistant to Adversarial Attacks
        • Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
        • Towards Expert-Level Medical Question Answering with Large Language Models
        • Towards Learning a Universal Non-Semantic Representation of Speech
        • Towards Measuring Fairness in AI: the Casual Conversations Dataset
        • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
        • Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
        • Towards Robust Speech Representation Learning for Thousands of Languages
        • Towards Understanding Grokking: An Effective Theory of Representation Learning
        • Towards Understanding Sycophancy in Language Models
        • Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
        • Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
        • Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
        • Toxicity of the Commons: Curating Open-Source Pre-Training Data
        • Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
        • Training Adaptive Computation for Open-Domain Question Answering with Computational Constraints
        • Training Compute-Optimal Large Language Models
        • Training data-efficient image transformers & distillation through attention
        • Training Deep Nets with Sublinear Memory Cost
        • Training language models to follow instructions with human feedback
        • Training Language Models with Memory Augmentation
        • Training Neural Networks from Scratch with Parallel Low-Rank Adapters
        • Training Verifiers to Solve Math Word Problems
        • Transcendence: Generative Models Can Outperform The Experts That Train Them
        • Transductive Active Learning: Theory and Applications
        • Transferable speech-to-text large language model alignment module
        • Transformation of Mean Opinion Scores to Avoid Misleading of Ranked based Statistical Techniques
        • Transformer Feed-Forward Layers Are Key-Value Memories
        • Transformer Networks for Trajectory Forecasting
        • Transformer-Squared: Self-adaptive LLMs
        • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
        • TransformerFAM: Feedback attention is working memory
        • Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
        • Transformers learn in-context by gradient descent
        • Transformers need glasses! Information over-squashing in language tasks
        • Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
        • Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts
        • Translation in the Hands of Many:Centering Lay Users in Machine Translation Interactions
        • Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
        • Translatotron 3: Speech to Speech Translation with Monolingual Data
        • Transparent and Scrutable Recommendations Using Natural Language User Profiles
        • TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
        • TruthfulQA: Measuring How Models Mimic Human Falsehoods
        • TS3-Codec: Transformer-Based Simple Streaming Single Codec
        • TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
        • Tulu 3: Pushing Frontiers in Open Language Model Post-Training
        • Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring
        • TVQA: Localized, Compositional Video Question Answering
        • Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
        • Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
        • u-$ÎŒ$P: The Unit-Scaled Maximal Update Parametrization
        • U-Net: Convolutional Networks for Biomedical Image Segmentation
        • UL2: Unifying Language Learning Paradigms
        • UltraFeedback: Boosting Language Models with Scaled AI Feedback
        • UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition
        • Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation
        • Uncovering Latent Style Factors for Expressive Speech Synthesis
        • Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
        • Understanding Black-box Predictions via Influence Functions
        • Understanding deep learning requires rethinking generalization
        • Understanding Intra-Class Knowledge Inside CNN
        • Understanding natural language
        • Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
        • Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation
        • UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner
        • UniAudio: An Audio Foundation Model Toward Universal Audio Generation
        • UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control
        • Unified Language Model Pre-training for Natural Language Understanding and Generation
        • Unified Speech-Text Pretraining for Spoken Dialog Modeling
        • Unified Video-Language Pre-training with Synchronized Audio
        • Unified Vision-Language Pre-Training for Image Captioning and VQA
        • Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
        • UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
        • Unitary Evolution Recurrent Neural Networks
        • UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
        • Universal Language Model Fine-tuning for Text Classification
        • Universal principles justify the existence of concept cells
        • Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
        • Universal Transformers
        • UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
        • Unlimiformer: Long-Range Transformers with Unlimited Length Input
        • Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
        • Unsupervised Cross-lingual Representation Learning at Scale
        • Unsupervised Deep Tracking
        • Unsupervised Dense Information Retrieval with Contrastive Learning
        • Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
        • Unsupervised Learning by Competing Hidden Units
        • Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
        • Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
        • Unsupervised Neural Machine Translation
        • Unsupervised Source Separation via Bayesian Inference in the Latent Domain
        • Unsupervised Translation of Programming Languages
        • Unsupervised Visual Representation Learning by Context Prediction
        • Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism
        • Unveiling the Role of Pretraining in Direct Speech Translation
        • URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors
        • Using Forced Alignment for Phonetics Research
        • Using the Output Embedding to Improve Language Models
        • UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
        • VALHALLA: Visual Hallucination for Machine Translation
        • VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
        • VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
        • Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
        • Variational Bayes: A report on approaches and applications
        • Variational Inference: A Review for Statisticians
        • VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
        • VCoder: Versatile Vision Encoders for Multimodal Large Language Models
        • Vec-Tok Speech: speech vectorization and tokenization for neural speech generation
        • Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge
        • VeLO: Training Versatile Learned Optimizers by Scaling Up
        • Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
        • Video as the New Language for Real-World Decision Making
        • Video Instruction Tuning With Synthetic Data
        • Video Swin Transformer
        • Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
        • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
        • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
        • Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
        • VideoBERT: A Joint Model for Video and Language Representation Learning
        • VideoChat: Chat-Centric Video Understanding
        • VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
        • VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
        • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
        • VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
        • VideoPrism: A Foundational Visual Encoder for Video Understanding
        • VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
        • VIMA: General Robot Manipulation with Multimodal Prompts
        • VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
        • Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
        • Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
        • Vision Transformers Need Registers
        • Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
        • Vision-Speech Models: Teaching Speech Models to Converse about Images
        • ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric
        • Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
        • Visual Instruction Tuning
        • Visual Prompt Tuning
        • Visualizing and Understanding Convolutional Networks
        • Visualizing Data using t-SNE
        • Visualizing the Loss Landscape of Neural Nets
        • VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
        • VITA: Towards Open-Source Interactive Omni Multimodal LLM
        • Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
        • Voice Conversion With Just Nearest Neighbors
        • VoiceBench: Benchmarking LLM-Based Voice Assistants
        • Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
        • VoxCeleb2: Deep Speaker Recognition
        • VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis
        • VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
        • Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
        • W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
        • Wasserstein GAN
        • Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
        • Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
        • Watt For What: Rethinking Deep Learning's Energy-Performance Relationship
        • wav2letter++: The Fastest Open-source Speech Recognition System
        • Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
        • Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
        • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
        • wav2vec: Unsupervised Pre-training for Speech Recognition
        • WavChat: A Survey of Spoken Dialogue Models
        • Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
        • WaveGlow: A Flow-based Generative Network for Speech Synthesis
        • WaveNet: A Generative Model for Raw Audio
        • WavLLM: Towards Robust and Adaptive Speech Large Language Model
        • WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
        • WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
        • Weighted Voronoi Stippling
        • WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
        • What Are They Doing? Joint Audio-Speech Co-Reasoning
        • What Are Tools Anyway? A Survey from the Language Model Perspective
        • What Do Speech Foundation Models Not Learn About Speech?
        • What Does BERT Look At? An Analysis of BERT's Attention
        • What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning
        • What matters when building vision-language models?
        • What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
        • What Should Not Be Contrastive in Contrastive Learning
        • What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study
        • What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
        • What's In My Big Data?
        • When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion
        • When Do Neural Networks Outperform Kernel Methods?
        • When Does Translation Require Context? A Data-driven, Multilingual Exploration
        • When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
        • When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
        • When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
        • Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
        • Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
        • Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
        • WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
        • Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
        • Why Larger Language Models Do In-context Learning Differently?
        • Why should we add early exits to neural networks?
        • Why Warmup the Learning Rate? Underlying Mechanisms and Improvements
        • WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
        • WinoGrande: An Adversarial Winograd Schema Challenge at Scale
        • WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge
        • Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective
        • Word Emdeddings through Hellinger PCA
        • Word Translation Without Parallel Data
        • Word-prosodic typology
        • word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method
        • WT5?! Training Text-to-Text Models to Explain their Predictions
        • X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
        • xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection
        • XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
        • XGBoost: A Scalable Tree Boosting System
        • xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
        • XL-WSD An Extra-Large and Cross-Lingual Evaluation Framework for Word Sense Disambiguation
        • XLNet: Generalized Autoregressive Pretraining for Language Understanding
        • XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
        • xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement
        • xLSTM: Extended Long Short-Term Memory
        • XNLI: Evaluating Cross-lingual Sentence Representations
        • XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech
        • xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
        • XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
        • xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
        • XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
        • Yet Another Algorithm for Pitch Tracking
        • Yi: Open Foundation Models by 01.AI
        • YODAS: Youtube-Oriented Dataset for Audio and Speech
        • YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
        • YuE: Scaling Open Foundation Models for Long-Form Music Generation
        • Zephyr: Direct Distillation of LM Alignment
        • Zero-shot Speech Translation
        • Zero-Shot Tokenizer Transfer
        • Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
        • Zoology: Measuring and Improving Recall in Efficient Language Models
        • ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
        • Aaron van den Oord
        • Abdelrahman Mohamed
        • Adam Polyak
        • Adel Moumen
        • Afra Alishahi
        • Agustinus Kristiadi
        • Akari Asai
        • Alan Jeffares
        • Aldo Lipani
        • Alec Radford
        • Aleksa Gordić
        • Alessio Devoto
        • Alex Graves
        • Alex H. Williams
        • Alex Krizhevsky
        • Alexander Kolesnikov
        • Alexander M. Rush
        • Alexandra Birch
        • Alexandre DĂ©fossez
        • Alexei A. Efros
        • Alexey Dosovitskiy
        • Alexis Conneau
        • Alicia Curth
        • AmĂ©lie Royer
        • AndrĂ© F. T. Martins
        • AndrĂ© Martins
        • Andrea Bacciu
        • Andrej Karpathy
        • Andrew K. Lampinen
        • Andrew Zisserman
        • Anil Batra
        • Anil Keshwani
        • Anna Rogers
        • AntĂłnio Farinhas
        • Antonio Vergari
        • Ari Holtzman
        • Armand Joulin
        • Artem Ploujnikov
        • Badr M. Abdullah
        • Barry Haddow
        • Beatrice Savoldi
        • Belen Alastruey
        • Ben Peters
        • Benjamin Minixhofer
        • Benjamin van Niekerk
        • Beomseok Lee
        • Boris Ginsburg
        • Bruno Martins
        • Cagri Toraman
        • Carla Bombi
        • Celestine Mendler-DĂŒnner
        • Cem Subakan
        • Christian Szegedy
        • Christopher D. Manning
        • Chrysoula Zerva
        • Chung-Ming Chien
        • Claude E. Shannon
        • Cynthia Dwork
        • Daniele Venturi
        • Dario Amodei
        • David Duvenaud
        • David Ha
        • David R. Mortensen
        • David Silver
        • Dennis Fucci
        • Diederik P. Kingma
        • Dietrich Klakow
        • Donato Crisostomi
        • Dong Zhang
        • Douwe Kiela
        • Duarte M. Alves
        • Edoardo Debenedetti
        • Edoardo Maria Ponti
        • Edouard Grave
        • Edward Grefenstette
        • Ekaterina Shutova
        • Eliezer de Souza da Silva
        • Emanuele RodolĂ 
        • Emine Yilmaz
        • Emmanouil Zaranis
        • Emmanuel Dupoux
        • Essam Sleiman
        • Eugene Kharitonov
        • Fabio Galasso
        • Fabrizio Silvestri
        • Felix Kreuk
        • Ferenc HuszĂĄr
        • Francesco Cariaggi
        • Francesco Paissan
        • Frank Keller
        • Gabriel Synnaeve
        • Gabriele Sarti
        • Gautier Izacard
        • Geoffrey Hinton
        • Gergely Neu
        • Giuseppe Attanasio
        • Graham K. Taylor
        • Graham Neubig
        • Grzegorz ChrupaƂa
        • Guillaume Lample
        • H. W. Kuhn
        • Haibin Wu
        • Hao Tang
        • Haytham M Fayek
        • Hector J. Levesque
        • Herman Kamper
        • Holger Schwenk
        • Hosein Mohebbi
        • Hossein A. Rahmani
        • Hugo Pitorro
        • Hung-Yi Lee
        • Ian Goodfellow
        • Ian J. Goodfellow
        • Ilya Feige
        • Ilya Sutskever
        • Ishan Misra
        • Itai Gat
        • Jade Copet
        • James Allen
        • James Chapman
        • Jan Leike
        • Jan Niehues
        • Jarod Duret
        • Jason Li
        • Javier Iranzo-SĂĄnchez
        • Jay Alammar
        • Jean-Baptiste Alayrac
        • Jeremy Howard
        • Jonas HĂŒbotter
        • JosĂ© G. C. de Souza
        • JosĂ© Pombal
        • Joshua Ainslie
        • Judea Pearl
        • Julia Kempe
        • Julian D Parker
        • JĂŒrgen A. Schmidhuber
        • Kai-Wei Chang
        • Karen Livescu
        • Kevin Flanagan
        • Kevin Murphy
        • Kohei Saijo
        • Kshitij Ambilduke
        • Kushal Lakhotia
        • Kyunghyun Cho
        • Larry M. Hyman
        • Laura Ruis
        • Laura Sevilla-Lara
        • Laurens van der Maaten
        • Laurent Besacier
        • Laurent MazarĂ©
        • Lianmin Zheng
        • Lilian Weng
        • Luca Della Libera
        • Luca Franco
        • Luca Soldaini
        • Lucas Beyer
        • Luisa Bentivogli
        • Ɓukasz Kaiser
        • Luke Zettlemoyer
        • Maarten Sap
        • Marc Stevens
        • Marcely Zanon Boito
        • Marco Gaido
        • Marco Tagliasacchi
        • Marcos Treviso
        • Marcus Rohrbach
        • Maria Antoniak
        • Maria Sofia Bucarelli
        • Mark Mazumder
        • Martijn Bartelds
        • Mathilde Caron
        • Matteo Negri
        • Matthew D Zeiler
        • Matthias Gerstgrasser
        • Mauro Cettolo
        • Max Bartolo
        • Max Welling
        • Michael Hassid
        • Michele Miranda
        • Mihaela van der Schaar
        • Miles Cranmer
        • Miljan Martic
        • Mirco Ravanelli
        • Moritz Böhle
        • Nathan Lambert
        • Neil Zeghidour
        • Nicholas Carlini
        • Nils Reimers
        • Nina Miolane
        • Nuno M. Guerreiro
        • Oleksii Hrinchuk
        • Onur Mutlu
        • Oriol Vinyals
        • Paolo Mandica
        • Pasquale Minervini
        • Patrick Fernandes
        • Paul Christiano
        • Paul Röttger
        • Paul-Ambroise Duquenne
        • Pavlo Vasylenko
        • Petar Veličković
        • Peter Holderrieth
        • Pierre-Carl Langlais
        • Pieter Abbeel
        • Pooneh Mousavi
        • Quoc Le
        • Quoc V. Le
        • Rafael Rafailov
        • RamĂłn Fernandez Astudillo
        • Razvan Pascanu
        • Ricardo Rei
        • Rico Sennrich
        • Rob Fergus
        • Roberto Navigli
        • Rohan Ramasamy
        • Ronan Collobert
        • Rongjie Huang
        • Rowan Zellers
        • Ruoming Pang
        • Salah Zaiem
        • Samuel R. Bowman
        • Sander Land
        • Sanyuan Chen
        • Sara Papi
        • Saul Santos
        • Sebastian Raschka
        • Sebastian Riedel
        • Sebastian Ruder
        • Sergey Ioffe
        • Shane Legg
        • Shay B. Cohen
        • Shayne Longpre
        • Shinji Watanabe
        • Shital Shah
        • Shreyank N Gowda
        • Siddhant Arora
        • Simon Willison
        • Simone Conia
        • Simone Scardapane
        • Sonal Sannigrahi
        • Stanislav Fort
        • Steven McDonagh
        • Taku Kudo
        • Tal Remez
        • Tatsunori B. Hashimoto
        • Telmo Pessoa Pires
        • Thomas Palmeira Ferraz
        • Tim Dettmers
        • Tim RocktĂ€schel
        • Titouan Parcollet
        • Tom B. Brown
        • Tsz Kin Lam
        • Tu-Anh Nguyen
        • Vadim Borisov
        • Vaishnavh Nagarajan
        • Vijay Janapa Reddi
        • Vivek Iyer
        • Vlad Niculae
        • Wei-Ning Hsu
        • Wojciech Zaremba
        • Xin Zhang
        • Xinyue Hao
        • Xipeng Qiu
        • Xubo Liu
        • Yair Lakretz
        • Yann LeCun
        • Yifan Peng
        • Yonatan Belinkov
        • Yoshua Bengio
        • Yossi Adi
        • Zalan Borsos
        • ZalĂĄn Borsos
        • An Evolutionary Perspective on Language
        • Animal Navigation Systems
        • Bayes: Conjugate Inference
        • CPC: Representation Learning with Contrastive Predictive Coding
        • Four Early Lessons from Working on Machine Learning Projects
        • Generalized Linear Models and the Exponential Family
        • Graphs: Community Structure
        • Graphs: Motifs, Graphlets and Structural Roles in Networks
        • Jabri, Owens and Efros (2020) Space-Time Correspondence as a Contrastive Random Walk
        • LSTMs + Grammar as a Foreign Language
        • Mean, Median and Mode as Representatives
        • Self-Supervised Visual Representation Learning
        • Some Information Theory
        • The Hierarchical Softmax
        • The Probability Distributions
        • The Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
            • 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing – Celebrating Signal Processing
            • Author kit instructions – 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
            • Important Dates – 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
            • Publishing and Paper Presentation Options – 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
            • Call for Papers
            • 2024 Conference
            • Blogposts Track ICLR 2024 Announcing Accepted Blogposts – ICLR Blog
            • ICLR 2024 Outstanding Paper Awards – ICLR Blog
            • ICLR 2024 Papers
            • ICLR 2024 Test of Time Award – ICLR Blog
            • ICLR2024 Papers - a Hugging Face Space by ICLR2024
            • 2025 Dates and Deadlines
            • Announcing the NeurIPS 2024 Test of Time Paper Awards – NeurIPS Blog
            • Dynamic Sparsity in Machine Learning NeurIPS 2024 Tutorial
            • NeurIPS 2024 Call for Papers
          • ACAIN 2025 – Advanced Course & Symposium on Artificial Intelligence and Neuroscience
          • Conferences
          • I Can’t Believe It’s Not Better Initiative - ICLR Workshop 2025 - Call for Papers
          • ICLR
          • ICTIR 2024
          • International Conference on the Theory of Information Retrieval (ICTIR) - SIGIR
          • Interspeech (International Speech Communication Association)
          • Interspeech 2025 - Call for Papers
          • Interspeech 2025 - Challenges
          • Interspeech 2025 - Home
          • NLP4DH - NLP4DH & IWCLUL 2023
          • SIGdial – Special Interest Group on Discourse and Dialogue
          • SIGIR 2024
          • Buckeye Corpus Information
          • DoReCo - Homepage
          • HuggingFaceM4/the_cauldron · Datasets at Hugging Face
          • iisys-hof/HUI-Audio-Corpus-German: This is the official repository for the HUI-Audio-Corpus-German. The corresponding paper is in the process of publication. With the repository it is possible to automatically recreate the dataset. It is also possible to add more speakers to the processing pipeline.
          • imdatceleste/m-ailabs-dataset: This is the M-AILABS Speech Dataset
          • Multilingual Spoken Words Dataset | MLCommons Datasets
          • OpenMIC-2018
          • People's Speech Dataset | MLCommons Datasets
          • PleIAs/common_corpus · Datasets at Hugging Face
          • RecipeNLG
          • RedPajama-Data-v2 An open dataset with 30 trillion tokens for training large language models
          • The LJ Speech Dataset
          • TIMIT Acoustic-Phonetic Continuous Speech Corpus - Linguistic Data Consortium
          • VCTK
          • Language Models
          • Language Models - Evaluation and Leaderboards
          • Language Models - Notes
          • Language Models - PEFT
          • ELIAS-ELLIS-VISMAC Winter School 2025 | elias-ai
          • ELLIS Winter School on Foundation Models - Amsterdam 2024
          • LxMLS 2024
          • Speech and Audio
          • Speech and Audio - Formats and Encodings
          • Speech and Audio - Formulae and Code Snippets
          • Speech and Audio - Glossary
          • Speech and Audio - Rolodex - Papers, Models and Releases
          • Speech and Audio - Signal Processing
          • Speech and Audio - Tokenizers (Tokenisers)
          • Speech and Audio - Tools
        • AI and Society
        • Bayesian Neural Networks
        • Causal Inference
        • Datasets
        • Diffusion Models
        • Efficient Machine Learning
        • Embeddings
        • Energy Based Models
        • eXplainability
        • Flow Networks
        • Gaussian Processes
        • Generative Adversarial Networks
        • Grapheme to Phoneme (G2P) Transcription Engines and Models
        • Hardware
        • Information Retrieval
        • Information Theory
        • ISO Standards
        • Language Identification
        • Llamas 🩙
        • Machine Translation
        • Multimodality
        • Music
        • Natural Language Inference
        • Neuroscience
        • Optimisation
        • Optimisation - Loss Functions
        • Recommendation Systems
        • Reinforcement Learning
        • Robotics
        • Safety and Fairness
        • Statistical Learning Theory
        • Theoretical Deep Learning (Theory, Fundamentals, Seminal)
        • Variational Autoencoders
        • Variational Inference
        • Vision
        • Winograd and WinoGrande
        • Word Sense Disambiguation
        • Jobs, Careers, Companies
        • Kernels 🍿 & Support Vector Machines
        • LaTeX and Math Typesetting
        • Machine Learning
        • Math
        • Natural Language Processing
        • Neural Networks
        • Physics
        • Read
        • Signal Processing
        • Statistics and Probability
        • Unsorted
          • Conversational AI Reading Group
          • Launchpad
          • Monthy online Linguistique Informatique, Formelle et de Terrain (LIFT) Seminar
        • Analysing & Summarizing Movies via Turning Point Identification in Screenplays - Frank Keller
        • Designing efficient and modular neural networks - Simone Scardapane
        • Discrete Audio Tokens for Multimodal LLMs - Mirco Ravanelli
        • Efficient Transformers - Ɓukasz Kaiser
        • Hurdles at WMT - Keeping up with the MT progress - Tom Kocmi
        • Improving Universal Access to Modern Speech Technology - Martijn Bartelds
    Home

    ❯

    Linguistics

    ❯

    Languages

    ❯

    Japanese

    Japanese

    05 Oct 20251 min read

    • Is Japanese a Tonal Language? ALTA Language Services - does not seem like a reliable / well-referenced source, but still useful
    • Japanese is apparently a pitch-accent language
      • Pitch-accent language - Wikipedia
      • Pitch accent (intonation) - Wikipedia
      • What’s the difference between a tonal language and a pitch accent language?
      • Word-prosodic typology by Larry M. Hyman
      • How (not) to do Phonological Typology The Case of Pitch-Accent by Larry M. Hyman

    Graph View

    Backlinks

    • No backlinks found
    • Website
    • Bluesky
    • Twitter/X
    • GitHub
    • LinkedIn
    • Instagram
    • Goodreads
    • Letterboxd
    • 🍋