đȘŽ Anil's Garden
Search
Search
Dark mode
Light mode
Explorer
CS
Languages
Assembly
AWK
Bash - Notes
Bash - Resources
Bash - Snippets
C
C++
Carbon
Dart
Erlang
Go
Haskell
Java
JavaScript
Lua
Perl
Python - Best Practices
Python - Internals
Python - Notes
Python - Resources
R
Rust
Scala
Swift, SwiftUI and Developing for macOS
TOML
TypeScript
WASM Web Assembly
YAML
Zig
zsh
Algorithms and Data Structures
Arch Linux
Asynchronous Programming & Concurrency
Build Systems
Compilers, Interpreters and Binaries
Compression, Encoding, Codecs, Text Encodings and Communication
Computer Architecture
Computer Science
Conda
Copilot (GitHub Copilot)
cron
Cryptography and Cybersecurity
CUDA
Databases and Data Interchange
Debugging
Development Containers
DevOps and MLOps
Distributed Computing, Distributed and Multi-GPU Training
Documentation (Maintaining Docs)
Email and SMTP
Fuzzing and Fuzzers
Git - Notes
Git - Resources
GitHub
Globbing
Graphs
Hugging Face
Machine Learning Engineering (Implementation Best Practices)
Make
MLX
Networking and Computer Networks
Operating Systems (OS), Kernels, Linux and Unix
PyTorch - Functions
PyTorch - Notes
PyTorch - Resources
Questions
Regex
Reverse Engineering
Software Development
Software Licences (Licenses) and Licensing
tmux
Vim
VSCode
Linguistics
Etymologies
abstract
Blighty - Etymology, Origin & Meaning
renminbi
Glossary Terms
anaphora
clitic
evidentiality
realis
selection
Languages
Italian
Japanese
Latin
Mandarin
Russian
Turkish
Languages of the World
Linguistics
Linguistics Glossary
Phonetics vs Phonemics (Phonology)
Writing Systems
Notes
A Line in the Sand - Britain, France and the Struggle that Shaped the Middle East
Advertising
Ancient History, Classics, Classical Literature and Theology
Blender
Bluesky
Books
Bracket City, Crosswords
Candide
Chess
Cinema (Film; Movies) and Television (TV)
Codebase Visualiser
Coding Projects for Development
Commercial LLMs (inc APIs)
Core Dumped (channel)
Creative Coding
Creative Coding Crafts Space (C3S)
CS Memes and Culture
D3 Health Dashboard
Darknet Diaries
Data Analysis and Visualisation (Data Viz)
Design
Diabetes
Digital Garden
DIY and Construction
DNS Server (Domain Name System Server)
Dreams from My Father
Economics and Finance
Edinburgh Guide
Education
Electoral Systems
F1
Figma
Finance and Trading
Fitness
Flags of the World
Flights
Fonts
Food
Free Speech
Goodreads
Healthcare, Biomedical, Medicine
History
Home Server
Homebrew
Housing and Rents
Immich
Investing
Istanbul Guide
Journocoders
Kagi
Kids
Law and Justice
London Guide
MacBook and macOS
MacBook Setup Checklist
Mental Anchors
Metamorphoses - A New Play
Model Context Protocol
Music
Music Theory
Music Understanding and Analysis, and Spotify Fun
NotebookLM and Automated Podcasting
Obsidian
Obsidian - Installing Plugins Manually
Obsidian Clone or Note-taking App
Online Safety Act (UK)
OSINT
Overview of Company Valuation Methods
Palettes
Pareto Efficiency
Pegasus How a Spy in Your Pocket Threatens the End of Privacy, Dignity, and Democracy - Laurent Richard, Sandrine Rigaud
Photography
Printing, Stamps and Heraldry
Privacy - Staying Secure Online
PyTorch's Transformer and Multi-Head Attention Implementation
Reading
Reading with a Motive vs Reading
Retro Tech
Semantic Querying of Obsidian
Small Web
Spaced Repetition Learning
Speech LLM-based Language Learning
Streaming, Twitch, YouTube, Videography
The Artist - Lucy Steeds
The Panama Papers - Breaking the Story of How the Rich and Powerful Hide Their Money
The Secret Barrister - Stories of the Law and How It's Broken
Time Tracking App - Single User, Native Swift
UK Law and Justice Podcast Recommendations (Perplexity)
UTM
Vibe Coding and Agents
Volts, Watts, Amps
Web Browsers
Web Development and Building a Website
Wordle-bot
YouTube Automated Uploader
Notes - ML CS
Base64 Encoding
Bilinear Interpolation
ChatML
Connectionist Temporal Classification
Content Addressability
Cosine Similarity vs Pearson Moment Correlation Coefficiant
Decaying Learning Rate Exponentially when Scaling Batch Size and Base Learning Rate
Differential Privacy in Machine Learning and Stats Lectures
EinOps
Exiting Early from Nested Functions - Case Study with Epoch and Batch-wise Training Loops
Expectation Maximisation Algorithm
Fisher Information
Generating from LLMs
Gibberlink
Gram Matrix and Linear Regression
Graphs Spectral Clustering
Hidden Markov Models
How many iterations will a training run last?
Kalman Filtering
Learning Rate Warmup
Multiclass vs multilabel classification
RSA Encryption-Decryption Identity Proof via Euler's Theorem
Sampling for Text Generation, Nucleus Sampling (top-$p$), the need for top-$k$ and Beam Search
Singular Value Decomposition
Typing for PyTorch
Vector Projection
Vector Quantization
Weight Initialisation
What are the differences between a digital signature, a MAC and a hash?
Whitening, sharpening & smoothing
Papers
"My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit's AI Community
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
$\infty$-former: Infinite Memory Transformer
$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
2 OLMo 2 Furious
100,000 Podcasts: A Spoken English Document Corpus
A Bayesian approach to translators' reliability assessment
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
A Brief Overview of Unsupervised Neural Speech Representation Learning
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
A Call for Clarity in Reporting BLEU Scores
A Causal Bayesian Networks Viewpoint on Fairness
A Closer Look at Few-shot Classification
A Closer Look at Spatiotemporal Convolutions for Action Recognition
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
A Comprehensive Survey of Machine Translation Approaches
A Comprehensive Survey on Long Context Language Modeling
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
A Convergence Theory for Deep Learning via Over-Parameterization
A Cookbook of Self-Supervised Learning
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
A Cross-Language Perspective On Speech Information Rate
A Diagnostic Study of Explainability Techniques for Text Classification
A firm foundation for private data analysis
A Generalized EigenGame with Extensions to Multiview Representation Learning
A guide to convolution arithmetic for deep learning
A halo model approach for mock catalogs of time-variable strong gravitational lenses
A Kernel-Based View of Language Model Fine-Tuning
A Large-Scale Evaluation of Speech Foundation Models
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
A Mathematical Theory of Communication
A method to convert neural signals into sound sequences
A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
A Neural Algorithm of Artistic Style
A Neural Probabilistic Language Model
A new algorithm for data compression
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
A practical tutorial on Variational Bayes
A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
A Primer on Bayesian Neural Networks: Review and Debates
A Primer on Causal Analysis
A Probabilistic Neuro-symbolic Layer for Algebraic Constraint Satisfaction
A Review of Deep Learning Techniques for Speech Processing
A Review of Sparse Expert Models in Deep Learning
A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
A Simple Framework for Contrastive Learning of Visual Representations
A Suite for Acoustic Language Model Evaluation
A Survey of Large Language Models
A Survey of Mamba
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
A Survey of Visual Transformers
A Survey on Evaluation of Large Language Models
A Survey on In-context Learning
A Survey on Language Models for Code
A Survey on Large Language Models for Code Generation
A Survey on LLM-as-a-Judge
A Survey on Multimodal Large Language Models
A Survey on Neural Speech Synthesis
A Survey on Retrieval-Augmented Text Generation for Large Language Models
A Survey on Speech Large Language Models
A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets
A Tutorial on Fisher Information
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
A unified architecture for natural language processing: deep neural networks with multitask learning
A unified view of entropy-regularized Markov decision processes
A Universal Law of Robustness via Isoperimetry
A Vulnerability in Implementations of SHA-3, SHAKE, EdDSA, and Other NIST-Approved Algorithms
A Watermark for Large Language Models
Accelerating Large Language Model Decoding with Speculative Sampling
Accelerating t-SNE using Tree-Based Algorithms
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Acoustic BPE for Speech Generation with Discrete Tokens
Active Data Curation Effectively Distills Large-Scale Multimodal Models
Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
Adam-mini: Use Fewer Learning Rates To Gain More
Adam: A Method for Stochastic Optimization
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
Adapting Language Models to Compress Contexts
Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference
Adaptive Computation Time for Recurrent Neural Networks
Adaptive deconvolutional networks for mid and high level feature learning
Adaptive Machine Translation with Large Language Models
Adaptive Prototype Learning and Allocation for Few-Shot Segmentation
Adaptive Retrieval-Augmented Generation for Conversational Systems
Adaptive Semiparametric Language Models
Adaptively Sparse Transformers
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaSplash: Adaptive Sparse Flash Attention
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize
Adversarial Attacks and Defences: A Survey
Adversarial Feature Learning
Adversarial NLI: A New Benchmark for Natural Language Understanding
AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages
Agent Skill Acquisition for Large Language Models via CycleQD
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AI and Memory Wall
AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
ALBA : Reinforcement Learning for Video Object Segmentation
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists
Alice's Adventures in a Differentiable Wonderland -- Volume I, A Tour of the Land
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
Aligning Speech to Languages to Enhance Code-switching Speech Recognition
Aligning to Adults Is Easy, Aligning to Children Is Hard: A Study of Linguistic Alignment in Dialogue Systems
Alpaca: A Strong, Replicable Instruction-Following Model
An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition
An Analysis of Energy Consumption and Carbon Footprints of Cryptocurrencies and Possible Solutions
An Attention Free Transformer
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
An Empirical Exploration of Curriculum Learning for Neural Machine Translation
An Empirical Study of Mamba-based Language Models
An Empirical Study of Translation Hypothesis Ensembling with Large Language Models
An Emulator for Fine-Tuning Large Language Models using Small Language Models
An End-to-End Transformer Model for 3D Object Detection
An engine not a camera: Measuring performative power of online search
An Evolved Universal Transformer Memory
An Explanation of In-context Learning as Implicit Bayesian Inference
An Exploration of Neural Sequence-to-Sequence Architectures for Automatic Post-Editing
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech
An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition
An introduction to graph theory
An Introduction to Variational Autoencoders
An Introduction to Vision-Language Modeling
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Analyzing Context Contributions in LLM-based Machine Translation
Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Apple Intelligence Foundation Language Models
Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods
Architectures of Topological Deep Learning: A Survey on Topological Neural Networks
Are aligned neural networks adversarially aligned?
Are All Good Word Vector Spaces Isomorphic?
Are discrete units necessary for Spoken Language Modeling?
Are Sixteen Heads Really Better than One?
Are We Done with MMLU?
Areas of Attention for Image Captioning
Arithmetic coding for data compression
Artificial Kuramoto Oscillatory Neurons
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
Associative Embedding: End-to-End Learning for Joint Detection and Grouping
AST: Audio Spectrogram Transformer
Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability
Attention as a Guide for Simultaneous Speech Translation
Attention Is All You Need
Attention-Based Models for Speech Recognition
Audio Editing with Non-Rigid Text Prompts
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Audio-Language Models for Audio-Centric Tasks: A survey
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
AudioGen: Textually Guided Audio Generation
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
AudioLM: a Language Modeling Approach to Audio Generation
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling
Augmented Language Models: a Survey
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Auto-Encoding Variational Bayes
Autoregressive Image Generation using Residual Quantization
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Avocodo: Generative Adversarial Network for Artifact-free Vocoder
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Aya Vision: Advancing the Frontier of Multilingual Multimodality
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Bag of Tricks for Efficient Text Classification
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Bayesian Learning for Neural Networks: an algorithmic survey
Bayesian Measures of Model Complexity and Fit
Benchmarking Attacks on Learning with Errors
BERT Learns to Teach: Knowledge Distillation with Meta Learning
BERT Rediscovers the Classical NLP Pipeline
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERTScore: Evaluating Text Generation with BERT
BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
Better & Faster Large Language Models via Multi-token Prediction
Better Instruction-Following Through Minimum Bayes Risk
Better speech synthesis through scaling
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
Beyond Left and Right: The Role of System Trust in COVID-19 Attitudes and Behaviors
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
Beyond Text Compression: Evaluating Tokenizers Across Scales
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
Big Bird: Transformers for Longer Sequences
Big Self-Supervised Models are Strong Semi-Supervised Learners
Big Transfer (BiT): General Visual Representation Learning
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Billion-scale semi-supervised learning for image classification
BLAB: Brutally Long Audio Bench
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Blockwise Parallel Decoding for Deep Autoregressive Models
Boltzmann Exploration Done Right
Boosting Distributed Training Performance of the Unpadded BERT Model
Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning
Bootstrap your own latent: A new approach to self-supervised Learning
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition
Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo)
Building Bridges between Regression, Clustering, and Classification
Building Machine Translation Systems for the Next Thousand Languages
Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings
BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems
ByT5 model for massively multilingual grapheme-to-phoneme conversion
Byte Latent Transformer: Patches Scale Better Than Tokens
Byte Pair Encoding is Suboptimal for Language Model Pretraining
Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits
Can Automatic Metrics Assess High-Quality Translations?
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Can language models learn from explanations in context?
Can Large Language Models Reason and Plan?
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks
Can Whisper Perform Speech-Based In-Context Learning?
Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Canonical Capsules: Self-Supervised Capsules in Canonical Pose
Careless Whisper: Speech-to-Text Hallucination Harms
Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?
CASPER: A Large Scale Spontaneous Speech Dataset
CAT: Content-Adaptive Image Tokenization
Categorical Reparameterization with Gumbel-Softmax
Causal inference with Bayes rule
Causal Reasoning for Algorithmic Fairness
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
CDXFormer: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory
Cem Mil Podcasts: A Spoken Portuguese Document Corpus For Multi-modal, Multi-lingual and Multi-Dialect Information Access Research
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting for Speech Translation
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems
Character-Aware Neural Language Models
Character-level Convolutional Networks for Text Classification
Character-Level Language Modeling with Deeper Self-Attention
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
ChatMusician: Understanding and Generating Music Intrinsically with LLM
ChipNeMo: Domain-Adapted LLMs for Chip Design
CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
Clotho: An Audio Captioning Dataset
CMU's IWSLT 2024 Simultaneous Speech Translation System
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
CoCa: Contrastive Captioners are Image-Text Foundation Models
Cockpit: A Practical Debugging Tool for the Training of Deep Neural Networks
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
CodeRAG-Bench: Can Retrieval Augment Code Generation?
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Coding Theorems for a Discrete Source With a Fidelity Criterion
Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
COMET: A Neural Framework for MT Evaluation
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
Common Voice: A Massively-Multilingual Speech Corpus
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Compact Speech Translation Models via Discrete Speech Units Pretraining
Comparative layer-wise analysis of self-supervised speech models
Comparing Discrete and Continuous Space LLMs for Speech Recognition
Competence-based Curriculum Learning for Neural Machine Translation
Compositional Entailment Learning for Hyperbolic Vision-Language Models
Computational Optimal Transport
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Condita: A state machine like architecture for multimodal task bots
Conditional Image Generation with PixelCNN Decoders
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Confidence-Aware Scheduled Sampling for Neural Machine Translation
Confident Adaptive Language Modeling
Conformal Prediction for Natural Language Processing: A Survey
Conformer: Convolution-augmented Transformer for Speech Recognition
Connecting Speech Encoder and Large Language Model for ASR
Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
ConSeC: Word Sense Disambiguation as Continuous Sense Comprehension
Consent in Crisis: The Rapid Decline of the AI Data Commons
Context Encoders: Feature Learning by Inpainting
Context Encoding for Semantic Segmentation
Context-aware Neural Machine Translation for English-Japanese Business Scene Dialogues
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Continuous Audio Language Models
Continuous Learning from Human Post-Edits for Neural Machine Translation
Continuous Speech Tokenizer in Text To Speech
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Contrastive language and vision learning of general fashion concepts
Contrastive Language-Image Pre-training for the Italian Language
Contrastive Learning with Hard Negative Samples
Contrastive Multiview Coding
Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
Contrastive Representation Learning: A Framework and Review
Controllable Speech Representation Learning Via Voice Conversion and AIC Loss
Controlling Neural Networks with Rule Representations
ConvMLP: Hierarchical Convolutional MLPs for Vision
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
Counterfactual Fairness
Counterfactual harm
Counterfactual Reasoning and Learning Systems
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
CroissantLLM: A Truly Bilingual French-English Language Model
CroMo: Cross-Modal Learning for Monocular Depth Estimation
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Cross-lingual Language Model Pretraining
Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training
Cross-task weakly supervised learning from instructional videos
Cryptanalytic Extraction of Neural Network Models
CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages
CTC-based Compression for Direct Speech Translation
CTCBERT: Advancing Hidden-unit BERT with CTC Objectives
Current Limitations of Language Models: What You Need is Retrieval
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech
DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
DASB - Discrete Audio and Speech Benchmark
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
Data Augmentation Approaches in Natural Language Processing: A Survey
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
Data Efficient Reflow for Few Step Audio Generation
Data Selection for Language Models via Importance Resampling
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Dataset Distillation: A Comprehensive Review
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Decoding speech perception from non-invasive brain recordings
Decoupled Weight Decay Regularization
Deep Biaffine Attention for Neural Dependency Parsing
Deep Clustering for Unsupervised Learning of Visual Features
Deep contextualized word representations
Deep Ensemble as a Gaussian Process Approximate Posterior
Deep Ensembles: A Loss Landscape Perspective
Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond
Deep Learning with Differential Privacy
Deep Mask Memory Network with Semantic Dependency and Context Moment for Aspect Level Sentiment Classification
Deep Neural Networks and Tabular Data: A Survey
Deep reinforcement learning from human preferences
Deep Residual Learning for Image Recognition
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Deep Voice: Real-time Neural Text-to-Speech
DeepGaze II: Reading fixations from deep features trained on object recognition
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V3 Technical Report
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSpace: Dynamic Spatial and Source Cue Based Source Separation for Dialog Enhancement
Defeating Prompt Injections by Design
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021
DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
DEMix Layers: Disentangling Domains for Modular Language Modeling
Dense Associative Memory for Pattern Recognition
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
Depthwise Convolution is All You Need for Learning Multiple Visual Domains
Describing Multimedia Content using Attention-based Encoder--Decoder Networks
Designing and Interpreting Probes with Control Tasks
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement
DETRs with Collaborative Hybrid Assignments Training
DeVAn: Dense Video Annotation for Video-Language Models
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Did Translation Models Get More Robust Without Anyone Even Noticing?
Difference-Masking: Choosing What to Mask in Continued Pretraining
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Direct speech-to-speech translation with a sequence-to-sequence model
Direct speech-to-speech translation with discrete units
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
Discovery of Unstable Singularities
Discrete Audio Tokens: More Than a Survey!
Discrete Latent Structure in Neural Networks
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
Disentangling Textual and Acoustic Features of Neural Speech Representations
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
Distillation Scaling Laws
Distilling the Knowledge in a Neural Network
Distributed Representations of Words and Phrases and their Compositionality
Distribution Fields for Tracking
Distributional term representations: an experimental comparison
Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization
dMel: Speech Tokenization made Simple
DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors
DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Do Context-Aware Translation Models Pay the Right Attention?
Do Multi-Sense Embeddings Improve Natural Language Understanding?
DOCE: Finding the Sweet Spot for Execution-Based Code Generation
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Does Simultaneous Speech Translation need Simultaneous Models?
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation
Don't Decay the Learning Rate, Increase the Batch Size
Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
Don't Read Too Much into It: Adaptive Computation for Open-Domain Question Answering
DoWhy: An End-to-End Library for Causal Inference
DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
DRAW: A Recurrent Neural Network For Image Generation
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
DTrOCR: Decoder-only Transformer for Optical Character Recognition
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Ecco: An Open Source Library for the Explainability of Transformer Language Models
Effective Approaches to Attention-based Neural Machine Translation
Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform
Efficient Compression of Multitask Multilingual Speech Models
Efficient Estimation of Word Representations in Vector Space
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Efficient Memory Management for Large Language Model Serving with PagedAttention
Efficient Methods for Natural Language Processing: A Survey
Efficient Neural Audio Synthesis
Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space
Efficient Parallel Audio Generation using Group Masked Language Modeling
Efficient Pre-training for Localized Instruction Generation of Videos
Efficient Representation Learning via Adaptive Context Pooling
Efficient softmax approximation for GPUs
Efficient Stagewise Pretraining via Progressive Subnetworks
Efficient Tool Use with Chain-of-Abstraction Reasoning
Efficient Training of Language Models to Fill in the Middle
Efficient Transformers: A Survey
Efficient Visual Pretraining with Contrastive Detection
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Efficiently Programming Large Language Models using SGLang
Efficiently Scaling Transformer Inference
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Elucidating the Design Space of Diffusion-Based Generative Models
Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric
Emergent and Predictable Memorization in Large Language Models
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Emerging Properties in Self-Supervised Vision Transformers
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
EMMeTT: Efficient Multimodal Machine Translation Training
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Encoding of speech in convolutional layers and the brain stem based on language experience
Encoding sound in the cochlea: from receptor potential to afferent discharge
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
End-to-End Dense Video Captioning with Parallel Decoding
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
End-to-End Object Detection with Transformers
End-to-End Simultaneous Speech Translation with Differentiable Segmentation
End-to-End Speech Recognition: A Survey
End-to-End Speech-to-Text Translation: A Survey
End-to-end Temporal Action Detection with Transformer
End-to-End Text-Dependent Speaker Verification
Energy and Policy Considerations for Deep Learning in NLP
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
Enriching Word Vectors with Subword Information
eP-ALM: Efficient Perceptual Augmentation of Language Models
Epitran: Precision G2P for Many Languages
Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation
Error detecting and error correcting codes
ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
ESPnet-ST: All-in-One Speech Translation Toolkit
ESPnet: End-to-End Speech Processing Toolkit
Estimating the Completeness of Discrete Speech Units
Estimating Training Data Influence by Tracing Gradient Descent
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
Estimation of Non-Normalized Statistical Models by Score Matching
ETC: Encoding Long and Structured Inputs in Transformers
Euclidean Embedding of Co-occurrence Data
EuroBERT: Scaling Multilingual Encoders for European Languages
EuroLLM-9B: Technical Report
EuroLLM: Multilingual Language Models for Europe
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization
Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates
Evaluating deep learning architectures for speech emotion recognition
Evaluating Frontier Models for Dangerous Capabilities
Evaluating Language Model Agency through Negotiations
Evaluating language models as risk scores
Evaluating Large Language Models Trained on Code
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
Evaluating the Stability of Embedding-based Word Similarities
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
Evasion Attacks against Machine Learning at Test Time
EVE: Explainable Vector Based Embedding Technique Using Wikipedia
Evolution through Large Models
Explainability for Large Language Models: A Survey
Explainability for Speech Models: On the Challenges of Acoustic Feature Selection
Explainability Via Causal Self-Talk
Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Exploiting Similarities among Languages for Machine Translation
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning
Exploration on HuBERT with Multiple Resolutions
Exploring Simple Siamese Representation Learning
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Exploring the Limits of Language Modeling
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
Extracting Training Data from Diffusion Models
Extracting Training Data from Large Language Models
Extraction of Salient Sentences from Labelled Documents
Extreme Masking for Learning Instance and Distributed Visual Representations
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Facebook AI WMT21 News Translation Task Submission
fairseq S2T: Fast Speech-to-Text Modeling with fairseq
Faith and Fate: Limits of Transformers on Compositionality
Falcon2-11B Technical Report
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Fast and Simplex: 2-Simplicial Attention in Triton
Fast and Vectorizable Alternative to Binary Search in O(1) Applicable to a Wide Domain of Sorted Arrays of Floating Point Numbers
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
Fast Inference from Transformers via Speculative Decoding
Fast Model Editing at Scale
Fast Transformer Decoding: One Write-Head is All You Need
FastPitch: Parallel Text-to-speech with Pitch Prediction
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech: Fast, Robust and Controllable Text to Speech
Fauno: The Italian Large Language Model that will leave you senza parole!
Federated Learning: Strategies for Improving Communication Efficiency
Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning
Fermat Factorization in the Wild
FEVER: a large-scale dataset for Fact Extraction and VERification
Few-Shot Keyword Spotting in Any Language
Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Fine-tuning Language Models for Factuality
Finetuned Language Models Are Zero-Shot Learners
Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Flamingo: a Visual Language Model for Few-Shot Learning
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Flow Matching for Generative Modeling
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Flying and swimming animals cruise at a Strouhal number tuned for high power efficiency
FNet: Mixing Tokens with Fourier Transforms
Focal Loss for Dense Object Detection
Focal Modulation Networks
Focal Modulation Networks for Interpretable Sound Classification
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Following the Human Thread in Social Navigation
Formal Limitations on the Measurement of Mutual Information
Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis
Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation
From Recognition to Cognition: Visual Commonsense Reasoning
From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
From Sparse to Soft Mixtures of Experts
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Full Parameter Fine-tuning for Large Language Models with Limited Resources
Fully Character-Level Neural Machine Translation without Explicit Segmentation
Fully Convolutional Networks for Semantic Segmentation
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Fundamentals of Grammatology
GAIA: a benchmark for General AI Assistants
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Gaussian Mixture Latent Vector Grammars
GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini: A Family of Highly Capable Multimodal Models
Gemma 2: Improving Open Language Models at a Practical Size
Gemma: Open Models Based on Gemini Research and Technology
Gender Bias in Contextualized Word Embeddings
Gender Bias in Coreference Resolution
Generalization Ability of MOS Prediction Networks
Generalization in diffusion models arises from geometry-adaptive harmonic representations
Generalization through Memorization: Nearest Neighbor Language Models
Generalized Shape Metrics on Neural Representations
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Long Sequences with Sparse Transformers
Generative Adversarial Networks
Generative Models: What do they know? Do they know things? Let's find out!
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
Generative Spoken Dialogue Language Modeling
Generative Spoken Language Modeling from Raw Audio
Generator Matching: Generative modeling with arbitrary Markov processes
Genie: Generative Interactive Environments
Geographic Adaptation of Pretrained Language Models
Geographic and Geopolitical Biases of Language Models
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
GFlowNet Foundations
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
Git Re-Basin: Merging Models modulo Permutation Symmetries
Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Globally Normalized Transition-Based Neural Networks
GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Glow: Generative Flow with Invertible 1x1 Convolutions
GLU Variants Improve Transformer
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Goku: Flow Based Video Generative Foundation Models
Good Night at 4 pm?! Time Expressions in Different Cultures
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Gorilla: Large Language Model Connected with Massive APIs
GPT-4 Technical Report
gpt-oss-120b & gpt-oss-20b Model Card
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
Gradient Descent Converges to Minimizers
Granary: Speech Recognition and Translation Dataset in 25 European Languages
Grandmaster-Level Chess Without Search
Graph Pre-training for AMR Parsing and Generation
Grapheme-to-Phoneme Models for (Almost) Any Language
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Group Normalization
Group Robust Preference Optimization in Reward-free RLHF
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Guess What I Think: Streamlined EEG-to-Image Generation with Latent Diffusion Models
Guiding a Diffusion Model with a Bad Version of Itself
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users
HellaSwag: Can a Machine Really Finish Your Sentence?
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
HGRN2: Gated Linear RNNs with State Expansion
Hi-Fi Multi-Speaker English TTS Dataset
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models
Hierarchical nucleation in deep neural networks
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec
HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
High Fidelity Neural Audio Compression
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
High-Fidelity Audio Compression with Improved RVQGAN
High-Fidelity Simultaneous Speech-To-Speech Translation
High-speed high-security signatures
HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
Highly accurate protein structure prediction with AlphaFold
Highway Networks
Holistic Evaluation of Language Models
Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval
Houdini: Fooling Deep Structured Prediction Models
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
How (not) to do Phonological Typology: The Case of Pitch-Accent
How Context Affects Language Models' Factual Predictions
How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena
How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations
How Does Batch Normalization Help Optimization?
How Effective are State Space Models for Machine Translation?
How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation
How many degrees of freedom do we need to train deep networks: a loss landscape perspective
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
How to represent part-whole hierarchies in a neural network
How to Train Your Energy-Based Models
How transferable are features in deep neural networks?
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
How well can VMEC predict the initial saturation of external kink modes in near circular tokamaks and $l=2$ stellarators?
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Human Action Localization with Sparse Spatial Supervision
Human-in-the-Loop Causal Discovery under Latent Confounding using Ancestral GFlowNets
Humanity's Last Exam
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Hyena Hierarchy: Towards Larger Convolutional Language Models
HyperAttention: Long-context Attention in Near-Linear Time
Hyperbolic Active Learning for Semantic Segmentation under Domain Shift
Hyperbolic Deep Neural Networks: A Survey
Hyperbolic Geometry
Hyperbolic Learning with Multimodal Large Language Models
Hyperbolic Neural Networks
HYperbolic Self-Paced Learning for Self-Supervised Skeleton-based Action Representations
HyperCLOVA X Technical Report
Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
I3D: Transformer architectures with input-dependent dynamic depth for speech recognition
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Im2Text: Describing Images Using 1 Million Captioned Photographs
Image and Video Tokenization with Binary Spherical Quantization
Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
ImageBind: One Embedding Space To Bind Them All
ImageNet Large Scale Visual Recognition Challenge
Imitation Learning as $f$-Divergence Minimization
Impact of Tokenization on Language Models: An Analysis for Turkish
Implicit Generation and Generalization in Energy-Based Models
Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation
Improved Baselines with Momentum Contrastive Learning
Improved Baselines with Visual Instruction Tuning
Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback
Improving language models by retrieving from trillions of tokens
Improving Language Understanding by Generative Pre-Training
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Improving Neural Language Models with a Continuous Cache
Improving Neural Machine Translation Models with Monolingual Data
Improving neural networks by preventing co-adaptation of feature detectors
Improving Personalized Explanation Generation through Visualization
Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
Improving Word Representations via Global Context and Multiple Word Prototypes
Improving Zero-Shot Translation by Disentangling Positional Information
Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning
In Defense of Grid Features for Visual Question Answering
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Inferring and Executing Programs for Visual Reasoning
InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization
InfoNCE: Identifying the Gap Between Theory and Practice
Information Theory and Statistics: an overview
Information-Theoretic Probing for Linguistic Structure
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Inseq: An Interpretability Toolkit for Sequence Generation Models
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Instruction Tuning for Large Language Models: A Survey
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
INTELLECT-1 Technical Report
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Interpolating Compressed Parameter Subspaces
Interpretable Convolutional Filters with SincNet
Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings
Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
Intriguing properties of neural networks
Intrinsic dimension of data representations in deep neural networks
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Intrusive And Non-Intrusive Perceptual Speech Quality Assessment Using A Convolutional Neural Network
Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud Forecasting for Sequential Pose Forecasting
Investigating Backtranslation in Neural Machine Translation
Investigating Decoder-only Large Language Models for Speech-to-text Translation
Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages
Investigating Multilingual NMT Representations at Scale
Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Is Context Helpful for Chat Translation Evaluation?
Is Feedback All You Need? Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
ITALIC: An Italian Intent Classification Dataset
ITU-T coders for wideband, superwideband, and fullband speech communication [Series Editorial]
Jamba: A Hybrid Transformer-Mamba Language Model
Jasper: An End-to-End Convolutional Neural Acoustic Model
JetFormer: An Autoregressive Generative Model of Raw Images and Text
Johnson-Lindenstrauss Lemma, Linear and Nonlinear Random Projections, Random Fourier Features, and Random Kitchen Sinks: Tutorial and Survey
Joint-task Self-supervised Learning for Temporal Correspondence
JOREK3D: An extension of the JOREK nonlinear MHD code to stellarators
JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP
KAN: Kolmogorov-Arnold Networks
Kimi-Audio Technical Report
KIT's Multilingual Speech Translation System for IWSLT 2023
kNN For Whisper And Its Effect On Bias And Speaker Adaptation
Knowledge Conflicts for LLMs: A Survey
Knowledge distillation: A good teacher is patient and consistent
Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges
LAION-5B: An open large-scale dataset for training next generation image-text models
LaMP: When Large Language Models Meet Personalization
Language agents achieve superhuman synthesis of scientific knowledge
Language Agnostic Speech Embeddings for Emotion Classification
Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Language Model Can Listen While Speaking
Language Modeling with Deep Transformers
Language Modeling with Gated Convolutional Networks
Language Models are Few-Shot Learners
Language Models are Multilingual Chain-of-Thought Reasoners
Language Models are Realistic Tabular Data Generators
Language Models are Unsupervised Multitask Learners
Language Models as Knowledge Bases?
Language Models Represent Space and Time
Language Models: A Guide for the Perplexed
Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Laplace Redux -- Effortless Bayesian Deep Learning
Large Associative Memory Problem in Neurobiology and Machine Learning
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Large Batch Training of Convolutional Networks
Large Concept Models: Language Modeling in a Sentence Representation Space
Large Language Diffusion Models
Large Language Model Influence on Diagnostic Reasoning A Randomized Clinical Trial
Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Large Language Models As Evolution Strategies
Large Language Models for Compiler Optimization
Large Language Models for Data Annotation: A Survey
Large Language Models: A Survey
Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
Large-Scale Automatic Audiobook Creation
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification
Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling
Layer Normalization
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition
Learnability and the Vapnik-Chervonenkis dimension
Learned feature representations are biased by complexity, learning order, position, and more
Learning a similarity metric discriminatively, with application to face verification
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
Learning and Evaluating General Linguistic Intelligence
Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Learning Correspondence from the Cycle-Consistency of Time
Learning Differentially Private Recurrent Language Models
Learning Filterbanks from Raw Speech for Phone Recognition
Learning Interactive Real-World Simulators
Learning Language-Specific Layers for Multilingual Machine Translation
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
Learning Source Disentanglement in Neural Audio Codec
Learning Sparse Neural Networks through $L_0$ Regularization
Learning Speaker Representations with Mutual Information
Learning Temporal Dynamics from Cycles in Narrated Video
Learning Temporal Sentence Grounding From Narrated EgoVideos
Learning the Predictability of the Future
Learning to Compress Prompts with Gist Tokens
Learning to Generate Reviews and Discovering Sentiment
Learning to Merge Word Senses
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining
Learning to summarize from human feedback
Learning Transferable Visual Models From Natural Language Supervision
Learning with Fenchel-Young Losses
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Leveraging Audio-Only Data for Text-Queried Target Sound Extraction
Leveraging Content and Acoustic Representations for Speech Emotion Recognition
Leveraging Gloss Knowledge in Neural Word Sense Disambiguation by Hierarchical Co-Attention
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Librispeech An ASR corpus based on public domain audio books
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Lifting the Curse of Multilinguality by Pre-training Modular Transformers
Lightweight and Efficient Spoken Language Identification of Long-form Audio
Lightweight Audio Segmentation for Long-form Speech Translation
LIMO: Less is More for Reasoning
Linear Connectivity Reveals Generalization Strategies
Linear-time Minimum Bayes Risk Decoding with Reference Aggregation
Linformer: Self-Attention with Linear Complexity
Linguini: A benchmark for language-agnostic linguistic reasoning
Linguistic Regularities in Sparse and Explicit Word Representations
Liquid Time-constant Networks
Liquid: Language Models are Scalable Multi-modal Generators
Listen, Think, and Understand
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Listenable Maps for Audio Classifiers
LiT: Zero-Shot Transfer with Locked-image text Tuning
LL3M: Large Language 3D Modelers
Llama 2: Open Foundation and Fine-Tuned Chat Models
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
LLaMA: Open and Efficient Foundation Language Models
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
LLaSM: Large Language and Speech Model
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM4Eval: Large Language Model for Evaluation in IR
LM-Polygraph: Uncertainty Estimation for Language Models
LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models
Localizing Objects with Self-Supervised Transformers and no Labels
LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
Locating and Editing Factual Associations in GPT
Logits of API-Protected LLMs Leak Proprietary Information
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Long-Context Generalization with Sparse Attention
Long-Context Language Modeling with Parallel Context Encoding
Longformer: The Long-Document Transformer
LongNet: Scaling Transformers to 1,000,000,000 Tokens
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
LoRA: Low-Rank Adaptation of Large Language Models
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
Lost in the Middle: How Language Models Use Long Contexts
LRS3-TED: a large-scale dataset for visual speech recognition
LSSED: a large-scale dataset and benchmark for speech emotion recognition
Lumiere: A Space-Time Diffusion Model for Video Generation
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models
M-Prometheus: A Suite of Open Multilingual LLM Judges
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Making AI Forget You: Data Deletion in Machine Learning
Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models
Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game
Making Pre-trained Language Models Better Few-shot Learners
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Mamba in Speech: Towards an Alternative to Self-Attention
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Many-Shot In-Context Learning
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Marian: Fast Neural Machine Translation in C++
MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders that Listen
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
MaskGIT: Masked Generative Image Transformer
MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Massively Multilingual Neural Grapheme-to-Phoneme Conversion
Massively Multilingual Neural Machine Translation
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Matrix Decomposition and Applications
Matryoshka Diffusion Models
Matryoshka Quantization
Matryoshka Representation Learning
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
MAWPS: A Math Word Problem Repository
MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Measuring and Increasing Context Usage in Context-Aware Machine Translation
Measuring Massive Multitask Language Understanding
Measuring the Effects of Data Parallelism on Neural Network Training
Measuring the Intrinsic Dimension of Objective Landscapes
Measuring the Mixing of Contextual Information in the Transformer
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf
MEG-MASC: a high-quality magneto-encephalography dataset for evaluating natural speech processing
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
Membership Inference Attacks on Machine Learning: A Survey
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory
Memory Layers at Scale
Memory Performance Attacks: Denial of Memory Service in {Multi-Core} Systems
MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
MERLOT: Multimodal Neural Script Knowledge Models
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Meta-Learning Online Adaptation of Language Models
Meta-Transformer: A Unified Framework for Multimodal Learning
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
MEXMA: Token-level objectives improve sentence representations
MFPP: Morphological Fragmental Perturbation Pyramid for Black-Box Model Explanations
mGeNTE: A Multilingual Resource for Gender-Neutral Language and Translation
mHuBERT-147: A Compact Multilingual HuBERT Model
Microsoft COCO: Common Objects in Context
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Minimum Bayes-Risk Decoding for Statistical Machine Translation
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
MIO: A Foundation Model on Multimodal Tokens
Mistral 7B
Mixed Precision Training
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings
Mixtral of Experts
Mixture-of-Experts Graph Transformers for Interpretable Particle Collision Detection
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
MLP-Mixer: An all-MLP Architecture for Vision
MLS: A Large-Scale Multilingual Dataset for Speech Research
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Model Editing with Canonical Examples
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation
Modelling low-resource accents without accent-specific TTS frontend
Modelling of saturated external MHD instabilities in tokamaks: a comparison of 3D free boundary equilibria and nonlinear stability calculations
Modular Deep Learning
Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
ModuleFormer: Modularity Emerges from Mixture-of-Experts
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Momentum Contrast for Unsupervised Visual Representation Learning
Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods
MoonCast: High-Quality Zero-Shot Podcast Generation
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Moshi: a speech-text foundation model for real-time dialogue
MOSNet: Deep Learning based Objective Assessment for Voice Conversion
MouSi: Poly-Visual-Expert Vision-Language Models
Movie Gen: A Cast of Media Foundation Models
MovieNet: A Holistic Dataset for Movie Understanding
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
mSLAM: Massively multilingual joint pre-training for speech and text
MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
MuLan: A Joint Embedding of Music Audio and Natural Language
Multi-Prototype Vector-Space Models of Word Meaning
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
Multi-Scale Context Aggregation by Dilated Convolutions
Multi-sense embeddings through a word sense disambiguation process
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
Multi-task self-supervised learning for Robust Speech Recognition
Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models
Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language
Multilingual Speech Models for Automatic Speech Recognition Exhibit Gender Performance Gaps
Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Multimodal Few-Shot Learning with Frozen Language Models
Multimodal Machine Learning: A Survey and Taxonomy
Multimodal Neural Databases
Multiple Importance Sampling ELBO and Deep Ensembles of Variational Approximations
Multiple Object Recognition with Visual Attention
Multitask Prompted Training Enables Zero-Shot Task Generalization
Muon is Scalable for LLM Training
Muon Optimizer Accelerates Grokking
Music Transformer
MusicLM: Generating Music From Text
MuST-C: A multilingual corpus for end-to-end speech translation
MuST-C: a Multilingual Speech Translation Corpus
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Natural language guidance of high-fidelity text-to-speech with synthetic annotations
Natural Language Processing (almost) from Scratch
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics
NBDT: Neural-Backed Decision Trees
Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs
Needle In A Multimodal Haystack
Network In Network
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Neural Collaborative Filtering
Neural Combinatorial Optimization with Reinforcement Learning
Neural Discrete Representation Learning
Neural Grapheme-to-Phoneme Conversion with Pre-trained Grapheme Models
Neural Language Model Pruning for Automatic Speech Recognition
Neural Linguistic Steganography
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation: A Review and Survey
Neural Machine Translation: Challenges, Progress and Future
Neural Motifs: Scene Graph Parsing with Global Context
Neural Network Acceptability Judgments
Neural Networks are Decision Trees
Neural Networks Fail to Learn Periodic Functions and How to Fix It
Neural Sequence Learning Models for Word Sense Disambiguation
Neural Speech Synthesis with Transformer Network
Neural Voice Cloning with a Few Samples
Neural Word Embedding as Implicit Matrix Factorization
NeuralDEM - Real-time Simulation of Industrial Particulate Flows
Neurosymbolic AI -- Why, What, and How
NeurST: Neural Speech Translation Toolkit
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages
No Language Left Behind: Scaling Human-Centered Machine Translation
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
NoLiMa: Long-Context Evaluation Beyond Literal Matching
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Non-Autoregressive Neural Machine Translation
Non-Exchangeable Conformal Language Generation with Nearest Neighbors
Non-Exchangeable Conformal Risk Control
Non-intrusive Speech Quality Assessment Using Neural Networks
Nonlinear Dimensionality Reduction by Locally Linear Embedding
Nonlinear MHD modeling of soft $ÎČ$ limits in W7-AS
Nonlinear MHD simulations of external kinks in quasi-axisymmetric stellarators using an axisymmetric external rotational transform approximation
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
Not Just a Black Box: Learning Important Features Through Propagating Activation Differences
Nougat: Neural Optical Understanding for Academic Documents
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers
NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
NVLM: Open Frontier-Class Multimodal LLMs
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
OLMo: Accelerating the Science of Language Models
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
OmniParser for Pure Vision Based GUI Agent
On Compositions of Transformations in Contrastive Self-Supervised Learning
On Divergence Measures for Training GFlowNets
On Information and Sufficiency
On Instruction-Finetuning Neural Machine Translation Models
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
On Layer Normalization in the Transformer Architecture
On the cyclic nature of perception in vision versus audition
On the difficulty of training Recurrent Neural Networks
On the Effectiveness of Acoustic BPE in Decoder-Only TTS
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
On the Fundamental Impossibility of Hallucination Control in Large Language Models
On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
On the Integration of Optical Flow and Action Recognition
On The Landscape of Spoken Language Models: A Comprehensive Survey
On the Limitations of Compute Thresholds as a Governance Strategy
On the Measure of Intelligence
On the Number of Linear Regions of Deep Neural Networks
On the Opportunities and Risks of Foundation Models
On the Out-of-distribution Generalization of Probabilistic Image Modelling
On the Representation Collapse of Sparse Mixture of Experts
One Mind, Many Tongues: A Deep Dive into Language-Agnostic Knowledge Neurons in Large Language Models
One ruler to measure them all: Benchmarking multilingual long-context language models
One TTS Alignment To Rule Them All
One Wide Feedforward is All You Need
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
One-Shot Open Affordance Learning with Foundation Models
One-To-Many Multilingual End-to-end Speech Translation
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
OneLLM: One Framework to Align All Modalities with Language
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
Open-Source Conversational AI with SpeechBrain 1.0
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
OpenVoice: Versatile Instant Voice Cloning
OPT: Open Pre-trained Transformer Language Models
Optical Flow with Semantic Segmentation and Localized Layers
Optimal Bounds for Open Addressing Without Reordering
Optimization Methods for Large-Scale Machine Learning
OpusLM: A Family of Open Unified Speech Language Models
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Our data, ourselves: privacy via distributed noise generation
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation
Overcoming catastrophic forgetting in neural networks
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma: A versatile 3B VLM for transfer
PaLM 2 Technical Report
PaLM: Scaling Language Modeling with Pathways
PALO: A Polyglot Large Multimodal Model for 5B People
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Parakeet A natural sounding, conversational text-to-speech model
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
Parallel Scheduled Sampling
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
Parallel Tacotron: Non-Autoregressive and Controllable TTS
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Parameter-efficient fine-tuning of large-scale pre-trained language models
Parameter-Efficient Transfer Learning for NLP
Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
Parsing with Compositional Vector Grammars
PaSS: Parallel Speculative Sampling
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Pay Attention to MLPs
PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols
Pengi: An Audio Language Model for Audio Tasks
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Perceiver: General Perception with Iterative Attention
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines
Phase behavior of Cacio and Pepe sauce
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-4 Technical Report
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phonetic Analysis of Self-supervised Representations of English Speech
Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors
Pitfalls and Outlooks in Using COMET
PIXAR: Auto-Regressive Language Modeling in Pixel Space
PLACEHOLDER hertz-dev - Standard Intelligence
Playing Atari with Deep Reinforcement Learning
Playing Language Game with LLMs Leads to Jailbreaking
Poisoning Language Models During Instruction Tuning
Poisoning Web-Scale Training Datasets is Practical
PolyLM: An Open Source Polyglot Large Language Model
PolyVoice: Language Models for Speech to Speech Translation
Position: Categorical Deep Learning is an Algebraic Theory of All Architectures
Practical recommendations for gradient-based training of deep architectures
Prediction and Entropy of Printed English
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Preliminary WMT24 Ranking of General MT Systems and LLMs
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions
Prime Collective Communications Library -- Technical Report
Principles of Visual Tokens for Efficient Video Understanding
Probabilistic Artificial Intelligence
Probabilistic encryption & how to play mental poker keeping secret all partial information
Probing the phonetic and phonological knowledge of tones in Mandarin TTS models
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Progress Report: Towards European LLMs
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models
Prompting Large Language Models with Speech Recognition Abilities
Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages
Property Neurons in Self-Supervised Speech Transformers
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases
Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
Proximal Policy Optimization Algorithms
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Pushing the Limits of Zero-shot End-to-End Speech Translation
Pyramid Feature Attention Network for Saliency detection
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
Qualitatively characterizing neural network optimization problems
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Quality-Aware Decoding for Neural Machine Translation
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Quantifying Memorization Across Neural Language Models
Quantifying the Plausibility of Context Reliance in Neural Machine Translation
Quantifying the Uniqueness and Divisiveness of Presidential Discourse
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Qwen Technical Report
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen2 Technical Report
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2.5 Technical Report
Qwen3 Technical Report
Randomized Approximation of the Gram Matrix: Exact Computation and Probabilistic Bounds
Re-ranking Person Re-identification with k-reciprocal Encoding
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
Reading Digits in Natural Images with Unsupervised Feature Learning
Real Time Speech Enhancement in the Waveform Domain
ReALM: Reference Resolution As Language Modeling
Recent Advances in Direct Speech-to-text Translation
Recent Advances in Discrete Speech Tokens: A Review
Recent Advances in Speech Language Models: A Survey
Recent Developments on ESPnet Toolkit Boosted by Conformer
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
Recurrent Memory Transformer
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Reducing Activation Recomputation in Large Transformer Models
Reducing the Dimensionality of Data with Neural Networks
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
Reformer: The Efficient Transformer
Reframing Human-AI Collaboration for Generating Free-Text Explanations
Regularized Evolution for Image Classifier Architecture Search
Reinforcement Learning: An Overview
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Relative representations enable zero-shot latent space communication
Replacing the do-calculus with Bayes rule
Representation Learning with Contrastive Predictive Coding
Representational dissimilarity metric spaces for stochastic neural networks
Representational similarity analysis â connecting the branches of systems neuroscience
Representations of language in a model of visually grounded speech signal
Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Reranking Laws for Language Generation: A Communication-Theoretic Perspective
ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
Residual Contrastive Learning for Image Reconstruction: Learning Transferable Representations from Noisy Images
Retentive Network: A Successor to Transformer for Large Language Models
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation
Rethinking Attention with Performers
Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Revisiting Acoustic Features for Robust ASR
Revisiting Feature Prediction for Learning Visual Representations from Video
Revisiting minimum description length complexity in overparameterized models
Revisiting Model Stitching to Compare Neural Representations
Revisiting Over-Smoothness in Text to Speech
Revisiting Self-Distillation
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Rho-1: Not All Tokens Are What You Need
Risks from Learned Optimization in Advanced Machine Learning Systems
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS
Robust Speech Recognition via Large-Scale Weak Supervision
Robustness May Be at Odds with Accuracy
RoFormer: Enhanced Transformer with Rotary Position Embedding
Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts
RULER: What's the Real Context Size of Your Long-Context Language Models?
RWKV: Reinventing RNNs for the Transformer Era
S2ORC: The Semantic Scholar Open Research Corpus
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Sample Efficient Adaptive Text-to-Speech
SaulLM-7B: A pioneering Large Language Model for Law
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
Scalable Diffusion Models with Transformers
Scalable Expectation Estimation with Subtractive Mixture Models
Scalable-Softmax Is Superior for Attention
Scaling Analysis of Interleaved Speech-Text Language Models
Scaling Instructable Agents Across Many Simulated Worlds
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Scaling Laws for Generative Mixed-Modal Language Models
Scaling Laws for Multilingual Neural Machine Translation
Scaling Laws for Neural Language Models
Scaling Laws for Reward Model Overoptimization
Scaling Laws for Transfer
Scaling Properties of Speech Language Models
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Scaling Speech Technology to 1,000+ Languages
Scaling Transformer to 1M tokens and beyond with RMT
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
Scaling Up Influence Functions
Scaling Up Online Speech Recognition Using ConvNets
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Scaling Vision with Sparse Mixture of Experts
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
Score-Based Generative Modeling through Stochastic Differential Equations
Seamless: Multilingual Expressive and Streaming Speech Translation
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
SEANet: A Multi-modal Speech Enhancement Network
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Selective State Space Model for Monaural Speech Enhancement
Self-Alignment with Instruction Backtranslation
Self-Attention with Relative Position Representations
Self-Chained Image-Language Model for Video Localization and Question Answering
Self-critical Sequence Training for Image Captioning
Self-Instruct: Aligning Language Model with Self Generated Instructions
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Self-labelling via simultaneous clustering and representation learning
Self-Rewarding Language Models
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Self-Supervised Learning of Pretext-Invariant Representations
Self-Supervised Speech Representation Learning: A Review
Self-Supervised Speech Representations are More Phonetic than Semantic
Self-supervised Video Object Segmentation by Motion Grouping
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Self-Taught Evaluators
SELM: Speech Enhancement Using Discrete Tokens and Language Models
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
Sentence Length
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Sequence Level Training with Recurrent Neural Networks
Sequence Transduction with Recurrent Neural Networks
Sequence-Level Knowledge Distillation
SGDR: Stochastic Gradient Descent with Warm Restarts
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models
Shortcut Learning in Deep Neural Networks
Shortformer: Better Language Modeling using Shorter Inputs
Should You Mask 15% in Masked Language Modeling?
SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
Sigmoid Loss for Language Image Pre-Training
Similarity of Neural Network Representations Revisited
Simple and Controllable Music Generation
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning
Simple, Scalable Adaptation for Neural Machine Translation
Simplifying Transformer Blocks
Skip-Thought Vectors
SLIC Superpixels Compared to State-of-the-Art Superpixel Methods
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
SLURP: A Spoken Language Understanding Resource Package
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
SNAC: Multi-Scale Neural Audio Codec
Snapshot Ensembles: Train 1, get M for free
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
SODA: Story Oriented Dense Video Captioning Evaluation Framework
Soft Merging of Experts with Adaptive Routing
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
softmax is not enough (for sharp out-of-distribution)
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
SoundStorm: Efficient Parallel Audio Generation
SoundStream: An End-to-End Neural Audio Codec
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Space-Time Correspondence as a Contrastive Random Walk
SpanBERT: Improving Pre-training by Representing and Predicting Spans
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparse and Continuous Attention Mechanisms
Sparse and Structured Hopfield Networks
Sparse Attention with Linear Units
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse Communication via Mixed Distributions
Sparse continuous distributions and Fenchel-Young losses
Sparse Sequence-to-Sequence Models
Sparse Text Generation
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
Speakers of different languages remember visual scenes differently
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
Speech Translation with Large Language Models: An Industrial Practice
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Speech-to-Speech Translation For A Real-world Unwritten Language
SpeechAlign: Aligning Speech Generation to Human Preferences
SpeechBrain-MOABB: An open-source Python library for benchmarking deep neural networks applied to EEG signals
SpeechBrain: A General-Purpose Speech Toolkit
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
SpeechQE: Estimating the Quality of Direct Speech Translation
SpeechT: Findings of the First Mentorship in Speech Translation
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
SpeechVerse: A Large-scale Generalizable Audio Language Model
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Speed/accuracy trade-offs for modern convolutional object detectors
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training
SpiRit-LM: Interleaved Spoken and Written Language Model
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech
Spoken Language Modeling with Duration-Penalized Self-Supervised Units
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
Spread Flows for Manifold Modelling
SQ-GAN: Semantic Image Communications Using Masked Vector Quantization
SQuId: Measuring Speech Naturalness in Many Languages
ST-LLM: Large Language Models Are Effective Temporal Learners
Stabilising and accelerating light gated recurrent units for automatic speech recognition
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
Stacked Quantizers for Compositional Vector Compression
STAR: A Benchmark for Situated Reasoning in Real-World Videos
StarSpace: Embed All The Things!
State Spaces Aren't Enough: Machine Translation Needs Attention
Statistical Rejection Sampling Improves Preference Optimization
Stealing Part of a Production Language Model
Stealing User Prompts from Mixture of Experts
Steerable CNNs
Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
StegaStamp: Invisible Hyperlinks in Physical Photographs
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Step-by-Step Diffusion: An Elementary Tutorial
STLight: a Fully Convolutional Approach for Efficient Predictive Learning by Spatio-Temporal joint Processing
Stochastic Average Gradient : A Simple Empirical Investigation
Stochastic Neighbor Embedding
Stochastic Neighbor Embedding with Gaussian and Student-t Distributions: Tutorial and Survey
Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators
Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
Structured Neural Summarization
Structured Pruning of Large Language Models
Structured Training for Neural Network Transition-Based Parsing
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Super Tiny Language Models
SUPERB: Speech processing Universal PERformance Benchmark
SuperBPE: Space Travel for Language Models
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Supervised Contrastive Learning
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Surrogate Gradient Learning in Spiking Neural Networks
Survey of Automatic Metrics for Evaluating Machine Translation at the Document Level
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
Surveying the MLLM Landscape: A Meta-Review of Current Surveys
SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
SWEb: A Large Web Dataset for the Scandinavian Languages
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Symbolic Discovery of Optimization Algorithms
Synthetic DNA applications in information technology
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
Tacotron: Towards End-to-End Speech Synthesis
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
Taming Transformers for High-Resolution Image Synthesis
Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
Task Singular Vectors: Reducing Task Interference in Model Merging
Task Vectors are Cross-Modal
Task-aware Retrieval with Instructions
Task-Aware Unified Source Separation
TASTY: A Transformer based Approach to Space and Time complexity
Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training
TEARS: Textual Representations for Scrutable Recommendations
TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation
TED-LIUM: an Automatic Speech Recognition dedicated corpus
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Text and Code Embeddings by Contrastive Pre-Training
Text-Free Prosody-Aware Generative Spoken Language Modeling
Textbooks Are All You Need
Textless Speech-to-Speech Translation on Real Data
Textually Pretrained Speech Language Models
Texygen: A Benchmarking Platform for Text Generation Models
TGIF: A New Dataset and Benchmark on Animated GIF Description
The "something something" video database for learning and evaluating visual common sense
The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The Algorithmic Foundations of Differential Privacy
The AMI Meeting Corpus
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
The Biological Basis of Audition
The boundary of neural network trainability is fractal
The case for 4-bit precision: k-bit Inference Scaling Laws
The Causal-Neural Connection: Expressiveness, Learnability, and Inference
The challenge of realistic music generation: modelling raw audio at scale
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
The Curious Case of Neural Text Degeneration
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
The Defeat of the Winograd Schema Challenge
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
The distributional hypothesis
The Elements of Differentiable Programming
The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
The first collision for full SHA-1
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
The Forward-Forward Algorithm: Some Preliminary Investigations
The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
The Goldilocks zone: Towards better understanding of neural network loss landscapes
The Hardware Lottery
The Hungarian Method for the Assignment Problem
The Impact of Positional Encoding on Length Generalization in Transformers
The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
The JOREK non-linear extended MHD code and applications to large-scale instabilities and their control in magnetically confined fusion plasmas
The Kinetics Human Action Video Dataset
The Leaderboard Illusion
The Llama 3 Herd of Models
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
The Marginal Value of Adaptive Gradient Methods in Machine Learning
The Matrix Calculus You Need For Deep Learning
The Metropolis-Hastings algorithm
The Modern Mathematics of Deep Learning
The Multimodal Universe: Enabling Large-Scale Machine Learning with 100TB of Astronomical Scientific Data
The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The pitfalls of next-token prediction
The Power of Scale for Parameter-Efficient Prompt Tuning
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
The Relativity of Causal Knowledge
The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks
The Semantic Scholar Open Data Platform
The semantics of the (so-called) clausal determiner nÂŽo in Akan (Kwa)
The Seven Tools of Causal Inference with Reflections on Machine Learning
The Spotify Podcast Dataset
The sun compass revisited
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
The THUMOS Challenge on Action Recognition for Videos "in the Wild"
The Topological BERT: Transforming Attention into Topology for Natural Language Processing
The unreasonable effectiveness of few-shot learning for machine translation
The VoiceMOS Challenge 2022
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
The Winograd schema challenge
The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling
The Zero Resource Speech Challenge 2019: TTS without T
The Zero Resource Speech Challenge 2021: Spoken language modelling
Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Three models for the description of language
Time-Contrastive Networks: Self-Supervised Learning from Video
Tiny Pointers
tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models
TinyLlama: An Open-Source Small Language Model
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Titans: Learning to Memorize at Test Time
TLDR: Extreme Summarization of Scientific Documents
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Toolformer: Language Models Can Teach Themselves to Use Tools
TopoBenchmarkX: A Framework for Benchmarking Topological Deep Learning
Toward Joint Language Modeling for Speech Units and Text
Towards a definition of transcreation: a systematic literature review
Towards audio language modeling -- an overview
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
Towards Causal Representation Learning
Towards Deep Learning Models Resistant to Adversarial Attacks
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Towards Expert-Level Medical Question Answering with Large Language Models
Towards Learning a Universal Non-Semantic Representation of Speech
Towards Measuring Fairness in AI: the Casual Conversations Dataset
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
Towards Robust Speech Representation Learning for Thousands of Languages
Towards Understanding Grokking: An Effective Theory of Representation Learning
Towards Understanding Sycophancy in Language Models
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training Adaptive Computation for Open-Domain Question Answering with Computational Constraints
Training Compute-Optimal Large Language Models
Training data-efficient image transformers & distillation through attention
Training Deep Nets with Sublinear Memory Cost
Training language models to follow instructions with human feedback
Training Language Models with Memory Augmentation
Training Neural Networks from Scratch with Parallel Low-Rank Adapters
Training Verifiers to Solve Math Word Problems
Transcendence: Generative Models Can Outperform The Experts That Train Them
Transductive Active Learning: Theory and Applications
Transferable speech-to-text large language model alignment module
Transformation of Mean Opinion Scores to Avoid Misleading of Ranked based Statistical Techniques
Transformer Feed-Forward Layers Are Key-Value Memories
Transformer Networks for Trajectory Forecasting
Transformer-Squared: Self-adaptive LLMs
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
TransformerFAM: Feedback attention is working memory
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers learn in-context by gradient descent
Transformers need glasses! Information over-squashing in language tasks
Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts
Translation in the Hands of Many:Centering Lay Users in Machine Translation Interactions
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
Translatotron 3: Speech to Speech Translation with Monolingual Data
Transparent and Scrutable Recommendations Using Natural Language User Profiles
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
TruthfulQA: Measuring How Models Mimic Human Falsehoods
TS3-Codec: Transformer-Based Simple Streaming Single Codec
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring
TVQA: Localized, Compositional Video Question Answering
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
u-$Ό$P: The Unit-Scaled Maximal Update Parametrization
U-Net: Convolutional Networks for Biomedical Image Segmentation
UL2: Unifying Language Learning Paradigms
UltraFeedback: Boosting Language Models with Scaled AI Feedback
UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition
Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation
Uncovering Latent Style Factors for Expressive Speech Synthesis
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
Understanding Black-box Predictions via Influence Functions
Understanding deep learning requires rethinking generalization
Understanding Intra-Class Knowledge Inside CNN
Understanding natural language
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control
Unified Language Model Pre-training for Natural Language Understanding and Generation
Unified Speech-Text Pretraining for Spoken Dialog Modeling
Unified Video-Language Pre-training with Synchronized Audio
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
Unitary Evolution Recurrent Neural Networks
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
Universal Language Model Fine-tuning for Text Classification
Universal principles justify the existence of concept cells
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
Universal Transformers
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
Unlimiformer: Long-Range Transformers with Unlimited Length Input
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Unsupervised Cross-lingual Representation Learning at Scale
Unsupervised Deep Tracking
Unsupervised Dense Information Retrieval with Contrastive Learning
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
Unsupervised Learning by Competing Hidden Units
Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Unsupervised Neural Machine Translation
Unsupervised Source Separation via Bayesian Inference in the Latent Domain
Unsupervised Translation of Programming Languages
Unsupervised Visual Representation Learning by Context Prediction
Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism
Unveiling the Role of Pretraining in Direct Speech Translation
URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors
Using Forced Alignment for Phonetics Research
Using the Output Embedding to Improve Language Models
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
VALHALLA: Visual Hallucination for Machine Translation
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
Variational Bayes: A report on approaches and applications
Variational Inference: A Review for Statisticians
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Vec-Tok Speech: speech vectorization and tokenization for neural speech generation
Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge
VeLO: Training Versatile Learned Optimizers by Scaling Up
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Video as the New Language for Real-World Decision Making
Video Instruction Tuning With Synthetic Data
Video Swin Transformer
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
VideoBERT: A Joint Model for Video and Language Representation Learning
VideoChat: Chat-Centric Video Understanding
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
VIMA: General Robot Manipulation with Multimodal Prompts
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Vision Transformers Need Registers
Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
Vision-Speech Models: Teaching Speech Models to Converse about Images
ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Visual Instruction Tuning
Visual Prompt Tuning
Visualizing and Understanding Convolutional Networks
Visualizing Data using t-SNE
Visualizing the Loss Landscape of Neural Nets
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
Voice Conversion With Just Nearest Neighbors
VoiceBench: Benchmarking LLM-Based Voice Assistants
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
VoxCeleb2: Deep Speaker Recognition
VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
Wasserstein GAN
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
Watt For What: Rethinking Deep Learning's Energy-Performance Relationship
wav2letter++: The Fastest Open-source Speech Recognition System
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
wav2vec: Unsupervised Pre-training for Speech Recognition
WavChat: A Survey of Spoken Dialogue Models
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
WaveGlow: A Flow-based Generative Network for Speech Synthesis
WaveNet: A Generative Model for Raw Audio
WavLLM: Towards Robust and Adaptive Speech Large Language Model
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Weighted Voronoi Stippling
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
What Are They Doing? Joint Audio-Speech Co-Reasoning
What Are Tools Anyway? A Survey from the Language Model Perspective
What Do Speech Foundation Models Not Learn About Speech?
What Does BERT Look At? An Analysis of BERT's Attention
What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning
What matters when building vision-language models?
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
What Should Not Be Contrastive in Contrastive Learning
What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
What's In My Big Data?
When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion
When Do Neural Networks Outperform Kernel Methods?
When Does Translation Require Context? A Data-driven, Multilingual Exploration
When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
Why Larger Language Models Do In-context Learning Differently?
Why should we add early exits to neural networks?
Why Warmup the Learning Rate? Underlying Mechanisms and Improvements
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge
Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective
Word Emdeddings through Hellinger PCA
Word Translation Without Parallel Data
Word-prosodic typology
word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method
WT5?! Training Text-to-Text Models to Explain their Predictions
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
XGBoost: A Scalable Tree Boosting System
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
XL-WSD An Extra-Large and Cross-Lingual Evaluation Framework for Word Sense Disambiguation
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement
xLSTM: Extended Long Short-Term Memory
XNLI: Evaluating Cross-lingual Sentence Representations
XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
Yet Another Algorithm for Pitch Tracking
Yi: Open Foundation Models by 01.AI
YODAS: Youtube-Oriented Dataset for Audio and Speech
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Zephyr: Direct Distillation of LM Alignment
Zero-shot Speech Translation
Zero-Shot Tokenizer Transfer
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
Zoology: Measuring and Improving Recall in Efficient Language Models
ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
People
Aaron van den Oord
Abdelrahman Mohamed
Adam Polyak
Adel Moumen
Afra Alishahi
Agustinus Kristiadi
Akari Asai
Alan Jeffares
Aldo Lipani
Alec Radford
Aleksa GordiÄ
Alessio Devoto
Alex Graves
Alex H. Williams
Alex Krizhevsky
Alexander Kolesnikov
Alexander M. Rush
Alexandra Birch
Alexandre Défossez
Alexei A. Efros
Alexey Dosovitskiy
Alexis Conneau
Alicia Curth
Amélie Royer
André F. T. Martins
André Martins
Andrea Bacciu
Andrej Karpathy
Andrew K. Lampinen
Andrew Zisserman
Anil Batra
Anil Keshwani
Anna Rogers
AntĂłnio Farinhas
Antonio Vergari
Ari Holtzman
Armand Joulin
Artem Ploujnikov
Badr M. Abdullah
Barry Haddow
Beatrice Savoldi
Belen Alastruey
Ben Peters
Benjamin Minixhofer
Benjamin van Niekerk
Beomseok Lee
Boris Ginsburg
Bruno Martins
Cagri Toraman
Carla Bombi
Celestine Mendler-DĂŒnner
Cem Subakan
Christian Szegedy
Christopher D. Manning
Chrysoula Zerva
Chung-Ming Chien
Claude E. Shannon
Cynthia Dwork
Daniele Venturi
Dario Amodei
David Duvenaud
David Ha
David R. Mortensen
David Silver
Dennis Fucci
Diederik P. Kingma
Dietrich Klakow
Donato Crisostomi
Dong Zhang
Douwe Kiela
Duarte M. Alves
Edoardo Debenedetti
Edoardo Maria Ponti
Edouard Grave
Edward Grefenstette
Ekaterina Shutova
Eliezer de Souza da Silva
Emanuele RodolĂ
Emine Yilmaz
Emmanouil Zaranis
Emmanuel Dupoux
Essam Sleiman
Eugene Kharitonov
Fabio Galasso
Fabrizio Silvestri
Felix Kreuk
Ferenc HuszĂĄr
Francesco Cariaggi
Francesco Paissan
Frank Keller
Gabriel Synnaeve
Gabriele Sarti
Gautier Izacard
Geoffrey Hinton
Gergely Neu
Giuseppe Attanasio
Graham K. Taylor
Graham Neubig
Grzegorz ChrupaĆa
Guillaume Lample
H. W. Kuhn
Haibin Wu
Hao Tang
Haytham M Fayek
Hector J. Levesque
Herman Kamper
Holger Schwenk
Hosein Mohebbi
Hossein A. Rahmani
Hugo Pitorro
Hung-Yi Lee
Ian Goodfellow
Ian J. Goodfellow
Ilya Feige
Ilya Sutskever
Ishan Misra
Itai Gat
Jade Copet
James Allen
James Chapman
Jan Leike
Jan Niehues
Jarod Duret
Jason Li
Javier Iranzo-SĂĄnchez
Jay Alammar
Jean-Baptiste Alayrac
Jeremy Howard
Jonas HĂŒbotter
José G. C. de Souza
José Pombal
Joshua Ainslie
Judea Pearl
Julia Kempe
Julian D Parker
JĂŒrgen A. Schmidhuber
Kai-Wei Chang
Karen Livescu
Kevin Flanagan
Kevin Murphy
Kohei Saijo
Kshitij Ambilduke
Kushal Lakhotia
Kyunghyun Cho
Larry M. Hyman
Laura Ruis
Laura Sevilla-Lara
Laurens van der Maaten
Laurent Besacier
Laurent Mazaré
Lianmin Zheng
Lilian Weng
Luca Della Libera
Luca Franco
Luca Soldaini
Lucas Beyer
Luisa Bentivogli
Ćukasz Kaiser
Luke Zettlemoyer
Maarten Sap
Marc Stevens
Marcely Zanon Boito
Marco Gaido
Marco Tagliasacchi
Marcos Treviso
Marcus Rohrbach
Maria Antoniak
Maria Sofia Bucarelli
Mark Mazumder
Martijn Bartelds
Mathilde Caron
Matteo Negri
Matthew D Zeiler
Matthias Gerstgrasser
Mauro Cettolo
Max Bartolo
Max Welling
Michael Hassid
Michele Miranda
Mihaela van der Schaar
Miles Cranmer
Miljan Martic
Mirco Ravanelli
Moritz Böhle
Nathan Lambert
Neil Zeghidour
Nicholas Carlini
Nils Reimers
Nina Miolane
Nuno M. Guerreiro
Oleksii Hrinchuk
Onur Mutlu
Oriol Vinyals
Paolo Mandica
Pasquale Minervini
Patrick Fernandes
Paul Christiano
Paul Röttger
Paul-Ambroise Duquenne
Pavlo Vasylenko
Petar VeliÄkoviÄ
Peter Holderrieth
Pierre-Carl Langlais
Pieter Abbeel
Pooneh Mousavi
Quoc Le
Quoc V. Le
Rafael Rafailov
RamĂłn Fernandez Astudillo
Razvan Pascanu
Ricardo Rei
Rico Sennrich
Rob Fergus
Roberto Navigli
Rohan Ramasamy
Ronan Collobert
Rongjie Huang
Rowan Zellers
Ruoming Pang
Salah Zaiem
Samuel R. Bowman
Sander Land
Sanyuan Chen
Sara Papi
Saul Santos
Sebastian Raschka
Sebastian Riedel
Sebastian Ruder
Sergey Ioffe
Shane Legg
Shay B. Cohen
Shayne Longpre
Shinji Watanabe
Shital Shah
Shreyank N Gowda
Siddhant Arora
Simon Willison
Simone Conia
Simone Scardapane
Sonal Sannigrahi
Stanislav Fort
Steven McDonagh
Taku Kudo
Tal Remez
Tatsunori B. Hashimoto
Telmo Pessoa Pires
Thomas Palmeira Ferraz
Tim Dettmers
Tim RocktÀschel
Titouan Parcollet
Tom B. Brown
Tsz Kin Lam
Tu-Anh Nguyen
Vadim Borisov
Vaishnavh Nagarajan
Vijay Janapa Reddi
Vivek Iyer
Vlad Niculae
Wei-Ning Hsu
Wojciech Zaremba
Xin Zhang
Xinyue Hao
Xipeng Qiu
Xubo Liu
Yair Lakretz
Yann LeCun
Yifan Peng
Yonatan Belinkov
Yoshua Bengio
Yossi Adi
Zalan Borsos
ZalĂĄn Borsos
Posts
An Evolutionary Perspective on Language
Animal Navigation Systems
Bayes: Conjugate Inference
CPC: Representation Learning with Contrastive Predictive Coding
Four Early Lessons from Working on Machine Learning Projects
Generalized Linear Models and the Exponential Family
Graphs: Community Structure
Graphs: Motifs, Graphlets and Structural Roles in Networks
Jabri, Owens and Efros (2020) Space-Time Correspondence as a Contrastive Random Walk
LSTMs + Grammar as a Foreign Language
Mean, Median and Mode as Representatives
Self-Supervised Visual Representation Learning
Some Information Theory
The Hierarchical Softmax
The Probability Distributions
The Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Research
Conferences
ICASSP 2025
2025 IEEE International Conference on Acoustics, Speech, and Signal Processing â Celebrating Signal Processing
Author kit instructions â 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
Important Dates â 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
Publishing and Paper Presentation Options â 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
ICASSP 2026
Call for Papers
ICLR 2024
2024 Conference
Blogposts Track ICLR 2024 Announcing Accepted Blogposts â ICLR Blog
ICLR 2024 Outstanding Paper Awards â ICLR Blog
ICLR 2024 Papers
ICLR 2024 Test of Time Award â ICLR Blog
ICLR2024 Papers - a Hugging Face Space by ICLR2024
ICLR 2025
2025 Dates and Deadlines
NeurIPS 2024
Announcing the NeurIPS 2024 Test of Time Paper Awards â NeurIPS Blog
Dynamic Sparsity in Machine Learning NeurIPS 2024 Tutorial
NeurIPS 2024 Call for Papers
ACAIN 2025 â Advanced Course & Symposium on Artificial Intelligence and Neuroscience
Conferences
I Canât Believe Itâs Not Better Initiative - ICLR Workshop 2025 - Call for Papers
ICLR
ICTIR 2024
International Conference on the Theory of Information Retrieval (ICTIR) - SIGIR
Interspeech (International Speech Communication Association)
Interspeech 2025 - Call for Papers
Interspeech 2025 - Challenges
Interspeech 2025 - Home
NLP4DH - NLP4DH & IWCLUL 2023
SIGdial â Special Interest Group on Discourse and Dialogue
SIGIR 2024
Dataset Cards
Buckeye Corpus Information
DoReCo - Homepage
HuggingFaceM4/the_cauldron · Datasets at Hugging Face
iisys-hof/HUI-Audio-Corpus-German: This is the official repository for the HUI-Audio-Corpus-German. The corresponding paper is in the process of publication. With the repository it is possible to automatically recreate the dataset. It is also possible to add more speakers to the processing pipeline.
imdatceleste/m-ailabs-dataset: This is the M-AILABS Speech Dataset
Multilingual Spoken Words Dataset | MLCommons Datasets
OpenMIC-2018
People's Speech Dataset | MLCommons Datasets
PleIAs/common_corpus · Datasets at Hugging Face
RecipeNLG
RedPajama-Data-v2 An open dataset with 30 trillion tokens for training large language models
The LJ Speech Dataset
TIMIT Acoustic-Phonetic Continuous Speech Corpus - Linguistic Data Consortium
VCTK
Language Models
Language Models
Language Models - Evaluation and Leaderboards
Language Models - Notes
Language Models - PEFT
Schools
ELIAS-ELLIS-VISMAC Winter School 2025 | elias-ai
ELLIS Winter School on Foundation Models - Amsterdam 2024
LxMLS 2024
Speech and Audio
Speech and Audio
Speech and Audio - Formats and Encodings
Speech and Audio - Formulae and Code Snippets
Speech and Audio - Glossary
Speech and Audio - Rolodex - Papers, Models and Releases
Speech and Audio - Signal Processing
Speech and Audio - Tokenizers (Tokenisers)
Speech and Audio - Tools
AI and Society
Bayesian Neural Networks
Causal Inference
Datasets
Diffusion Models
Efficient Machine Learning
Embeddings
Energy Based Models
eXplainability
Flow Networks
Gaussian Processes
Generative Adversarial Networks
Grapheme to Phoneme (G2P) Transcription Engines and Models
Hardware
Information Retrieval
Information Theory
ISO Standards
Language Identification
Llamas đŠ
Machine Translation
Multimodality
Music
Natural Language Inference
Neuroscience
Optimisation
Optimisation - Loss Functions
Recommendation Systems
Reinforcement Learning
Robotics
Safety and Fairness
Statistical Learning Theory
Theoretical Deep Learning (Theory, Fundamentals, Seminal)
Variational Autoencoders
Variational Inference
Vision
Winograd and WinoGrande
Word Sense Disambiguation
Resources
Jobs, Careers, Companies
Kernels đż & Support Vector Machines
LaTeX and Math Typesetting
Machine Learning
Math
Natural Language Processing
Neural Networks
Physics
Read
Signal Processing
Statistics and Probability
Unsorted
Talks
Talk Series
Conversational AI Reading Group
Launchpad
Monthy online Linguistique Informatique, Formelle et de Terrain (LIFT) Seminar
Analysing & Summarizing Movies via Turning Point Identification in Screenplays - Frank Keller
Designing efficient and modular neural networks - Simone Scardapane
Discrete Audio Tokens for Multimodal LLMs - Mirco Ravanelli
Efficient Transformers - Ćukasz Kaiser
Hurdles at WMT - Keeping up with the MT progress - Tom Kocmi
Improving Universal Access to Modern Speech Technology - Martijn Bartelds
Home
âŻ
Linguistics
âŻ
Languages
âŻ
Latin
Latin
05 Oct 2025
1 min read
Latin Case Department of Classics
Articles
Latin in The Matrix - Temet Nosce
Graph View
Backlinks
No backlinks found