rsrch space

Excerpt

Stream of my favorite papers and links.


[Text-to-3D using Gaussian Splatting

2023-11-19

](http://arxiv.org/abs/2309.16585v3)[**Fine-tuning Language Models for Factuality**

2023-11-15

](http://arxiv.org/abs/2311.08401v1)[**MemGPT: Towards LLMs as Operating Systems**

2023-11-15

](http://arxiv.org/abs/2310.08560v1)[**Beyond Memorization: Violating Privacy Via Inference with Large Language Models**

2023-11-15

](http://arxiv.org/abs/2310.07298v1)[**The Transient Nature of Emergent In-Context Learning in Transformers**

2023-11-15

](http://arxiv.org/abs/2311.08360v1)[**Levels of AGI: Operationalizing Progress on the Path to AGI**

2023-11-13

](http://arxiv.org/abs/2311.02462v1)[**Fast and forward stable randomized algorithms for linear least-squares problems**

2023-11-13

](http://arxiv.org/abs/2311.04362v1)[**JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models**

2023-11-13

](http://arxiv.org/abs/2311.05997v1)[**Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V**

2023-11-11

](http://arxiv.org/abs/2310.11441v2)[**Transformers as Recognizers of Formal Languages: A Survey on Expressivity**

2023-11-10

](http://arxiv.org/abs/2311.00208v1)[**NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities**

2023-11-09

](http://arxiv.org/abs/2311.01454v1)[**OtterHD: A High-Resolution Multi-modality Model**

2023-11-09

](http://arxiv.org/abs/2311.04219v1)[**S-LoRA: Serving Thousands of Concurrent LoRA Adapters**

2023-11-09

](http://arxiv.org/abs/2311.03285v2)[**GLaMM: Pixel Grounding Large Multimodal Model**

2023-11-09

](http://arxiv.org/abs/2311.03356v1)[**The Linear Representation Hypothesis and the Geometry of Large Language Models**

2023-11-09

](http://arxiv.org/abs/2311.03658v1)[**Do LLMs exhibit human-like response biases? A case study in survey design**

2023-11-09

](http://arxiv.org/abs/2311.04076v1)[**GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling**

2023-11-07

](http://arxiv.org/abs/2311.01927v1)[**Learning to Compress Prompts with Gist Tokens**

2023-11-07

](http://arxiv.org/abs/2304.08467v2)[**CogVLM: Visual Expert for Pretrained Language Models**

2023-11-07

](http://arxiv.org/abs/2311.03079v1)[**Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models**

2023-11-06

](http://arxiv.org/abs/2307.14430v1)[**Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models**

2023-11-05

](http://arxiv.org/abs/2311.00871v1)[**The Matrix Calculus You Need For Deep Learning**

2023-11-04

](http://arxiv.org/abs/1802.01528v3)[**Efficient LLM Inference on CPUs**

2023-11-03

](http://arxiv.org/abs/2311.00502v1)[**What Algorithms can Transformers Learn? A Study in Length Generalization**

2023-11-01

](http://arxiv.org/abs/2310.16028v1)[**Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection**

2023-11-01

](http://arxiv.org/abs/2201.10474v2)[**Brain decoding: toward real-time reconstruction of visual perception**

2023-11-01

](http://arxiv.org/abs/2310.19812v1)[**The Impact of Depth and Width on Transformer Language Model Generalization**

2023-11-01

](http://arxiv.org/abs/2310.19956v1)[**CodeFusion: A Pre-trained Diffusion Model for Code Generation**

2023-10-30

](http://arxiv.org/abs/2310.17680v1)[**How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers**

2023-10-30

](http://arxiv.org/abs/2211.03495v1)[**ConvNets Match Vision Transformers at Scale**

2023-10-27

](http://arxiv.org/abs/2310.16764v1)[**QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models**

2023-10-26

](http://arxiv.org/abs/2310.16795v1)[**Detecting Pretraining Data from Large Language Models**

2023-10-26

](http://arxiv.org/abs/2310.16789v1)[**Matryoshka Diffusion Models**

2023-10-24

](http://arxiv.org/abs/2310.15111v1)[**AI for Mathematics: A Cognitive Science Perspective**

2023-10-24

](http://arxiv.org/abs/2310.13021v1)[**Towards Understanding Sycophancy in Language Models**

2023-10-23

](http://arxiv.org/abs/2310.13548v1)[**Large Language Models Cannot Self-Correct Reasoning Yet**

2023-10-22

](http://arxiv.org/abs/2310.01798v1)[**Chain-of-Verification Reduces Hallucination in Large Language Models**

2023-10-21

](http://arxiv.org/abs/2309.11495v2)[**Eureka: Human-Level Reward Design via Coding Large Language Models**

2023-10-20

](http://arxiv.org/abs/2310.12931v1)[**REPLUG: Retrieval-Augmented Black-Box Language Models**

2023-10-18

](http://arxiv.org/abs/2301.12652v4)[**The Efficiency Misnomer**

2023-10-17

](http://arxiv.org/abs/2110.12894v2)[**A Long Way to Go: Investigating Length Correlations in RLHF**

2023-10-17

](http://arxiv.org/abs/2310.03716v1)[**In-Context Pretraining: Language Modeling Beyond Document Boundaries**

2023-10-17

](http://arxiv.org/abs/2310.10638v1)[**TimeGPT-1**

2023-10-13

](http://arxiv.org/abs/2310.03589v1)[**Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks**

2023-10-13

](http://arxiv.org/abs/2310.02244v5)[**Large Language Models Are Zero-Shot Time Series Forecasters**

2023-10-13

](http://arxiv.org/abs/2310.07820v1)[**Text Embeddings Reveal (Almost) As Much As Text**

2023-10-13

](http://arxiv.org/abs/2310.06816v1)[**Segment Anything**

2023-10-11

](http://arxiv.org/abs/2304.02643v1)[**Mistral 7B**

2023-10-11

](http://arxiv.org/abs/2310.06825v1)[**Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models**

2023-10-09

](http://arxiv.org/abs/2310.04406v1)[**Ring Attention with Blockwise Transformers for Near-Infinite Context**

2023-10-08

](http://arxiv.org/abs/2310.01889v2)[**Decoding speech perception from non-invasive brain recordings**

2023-10-06

](http://arxiv.org/abs/2208.12266v2)[**Think before you speak: Training Language Models With Pause Tokens**

2023-10-05

](http://arxiv.org/abs/2310.02226v1)[**Large Language Models as Analogical Reasoners**

2023-10-05

](http://arxiv.org/abs/2310.01714v1)[**Language Models Represent Space and Time**

2023-10-04

](http://arxiv.org/abs/2310.02207v1)[**Efficient Streaming Language Models with Attention Sinks**

2023-10-02

](http://arxiv.org/abs/2309.17453v1)[**3D Gaussian Splatting for Real-Time Radiance Field Rendering**

2023-10-02

](http://arxiv.org/abs/2308.04079v1)[**The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)**

2023-10-02

](http://arxiv.org/abs/2309.17421v1)[**Directly Fine-Tuning Diffusion Models on Differentiable Rewards**

2023-10-02

](http://arxiv.org/abs/2309.17400v1)[**Vision Transformers Need Registers**

2023-09-30

](http://arxiv.org/abs/2309.16588v1)[**Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**

2023-09-30

](http://arxiv.org/abs/2308.12966v2)[**Orca: Progressive Learning from Complex Explanation Traces of GPT-4**

2023-09-29

](http://arxiv.org/abs/2306.02707v1)[**Jointly Training Large Autoregressive Multimodal Models**

2023-09-28

](http://arxiv.org/abs/2309.15564v1)[**OCR-free Document Understanding Transformer**

2023-09-26

](http://arxiv.org/abs/2111.15664v5)[**Language Modeling Is Compression**

2023-09-20

](http://arxiv.org/abs/2309.10668v1)[**Efficient Memory Management for Large Language Model Serving with PagedAttention**

2023-09-13

](http://arxiv.org/abs/2309.06180v1)[**Uncovering mesa-optimization algorithms in Transformers**

2023-09-13

](http://arxiv.org/abs/2309.05858v1)[**MADLAD-400: A Multilingual And Document-Level Large Audited Dataset**

2023-09-12

](http://arxiv.org/abs/2309.04662v1)[**Nougat: Neural Optical Understanding for Academic Documents**

2023-09-12

](http://arxiv.org/abs/2308.13418v1)[**Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning**

2023-09-12

](http://arxiv.org/abs/2309.05444v1)[**Textbooks Are All You Need II: phi-1.5 technical report**

2023-09-12

](http://arxiv.org/abs/2309.05463v1)[**Are Emergent Abilities in Large Language Models just In-Context Learning?**

2023-09-10

](http://arxiv.org/abs/2309.01809v1)[**Large Language Models as Optimizers**

2023-09-08

](http://arxiv.org/abs/2309.03409v1)[**Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs**

2023-09-07

](http://arxiv.org/abs/2308.11914v2)[**Robust fine-tuning of zero-shot models**

2023-09-05

](http://arxiv.org/abs/2109.01903v3)[**Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time**

2023-09-04

](http://arxiv.org/abs/2305.17118v2)[**YaRN: Efficient Context Window Extension of Large Language Models**

2023-09-04

](http://arxiv.org/abs/2309.00071v1)[**RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback**

2023-09-04

](http://arxiv.org/abs/2309.00267v1)[**Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models**

2023-09-03

](http://arxiv.org/abs/2308.15022v1)[**Accelerating Large Language Model Decoding with Speculative Sampling**

2023-09-01

](http://arxiv.org/abs/2302.01318v1)[**Fast Inference from Transformers via Speculative Decoding**

2023-09-01

](http://arxiv.org/abs/2211.17192v2)[**Blockwise Parallel Decoding for Deep Autoregressive Models**

2023-09-01

](http://arxiv.org/abs/1811.03115v1)[**Accelerating LLM Inference with Staged Speculative Decoding**

2023-09-01

](http://arxiv.org/abs/2308.04623v1)[**DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining**

2023-08-31

](http://arxiv.org/abs/2305.10429v2)[**LLMatic: Neural Architecture Search via Large Language Models and Quality-Diversity Optimization**

2023-08-30

](http://arxiv.org/abs/2306.01102v2)[**Level Generation Through Large Language Models**

2023-08-30

](http://arxiv.org/abs/2302.05817v2)[**Gzip versus bag-of-words for text classification**

2023-08-30

](http://arxiv.org/abs/2307.15002v5)[**Challenges and Applications of Large Language Models**

2023-08-30

](http://arxiv.org/abs/2307.10169v1)[**Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges**

2023-08-30

](http://arxiv.org/abs/2103.11251v2)[**Catala: A Programming Language for the Law**

2023-08-27

](http://arxiv.org/abs/2103.03198v2)[**Graph of Thoughts: Solving Elaborate Problems with Large Language Models**

2023-08-25

](http://arxiv.org/abs/2308.09687v2)[**Sigmoid Loss for Language Image Pre-Training**

2023-08-23

](http://arxiv.org/abs/2303.15343v3)[**Jewish Problems**

2023-08-22

](http://arxiv.org/abs/1110.1556v2)[**OctoPack: Instruction Tuning Code Large Language Models**

2023-08-20

](http://arxiv.org/abs/2308.07124v1)[**Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification**

2023-08-20

](http://arxiv.org/abs/2308.07921v1)[**TeCH: Text-guided Reconstruction of Lifelike Clothed Humans**

2023-08-20

](http://arxiv.org/abs/2308.08545v1)[**Towards a Unified View of Parameter-Efficient Transfer Learning**

2023-08-12

](http://arxiv.org/abs/2110.04366v3)[**Generative Agents: Interactive Simulacra of Human Behavior**

2023-08-10

](http://arxiv.org/abs/2304.03442v2)[**Simple synthetic data reduces sycophancy in large language models**

2023-08-10

](http://arxiv.org/abs/2308.03958v1)[**The Hydra Effect: Emergent Self-repair in Language Model Computations**

2023-08-08

](http://arxiv.org/abs/2307.15771v1)[**A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards**

2023-08-05

](http://arxiv.org/abs/2308.01074v1)[**From Sparse to Soft Mixtures of Experts**

2023-08-05

](http://arxiv.org/abs/2308.00951v1)[**AlpaGasus: Training A Better Alpaca with Fewer Data**

2023-08-03

](http://arxiv.org/abs/2307.08701v1)[**Universal and Transferable Adversarial Attacks on Aligned Language Models**

2023-07-28

](http://arxiv.org/abs/2307.15043v1)[**The First Room-Temperature Ambient-Pressure Superconductor**

2023-07-26

](http://arxiv.org/abs/2307.12008v1)[**How is ChatGPT’s behavior changing over time?**

2023-07-19

](http://arxiv.org/abs/2307.09009v1)[**Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration**

2023-07-18

](http://arxiv.org/abs/2307.05300v2)[**HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models**

2023-07-18

](http://arxiv.org/abs/2307.06949v1)[**Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning**

2023-07-18

](http://arxiv.org/abs/2205.05638v2)[**Learning to Retrieve In-Context Examples for Large Language Models**

2023-07-17

](http://arxiv.org/abs/2307.07164v1)[**Large Language Models as General Pattern Machines**

2023-07-17

](http://arxiv.org/abs/2307.04721v1)[**Provably Faster Gradient Descent via Long Steps**

2023-07-14

](http://arxiv.org/abs/2307.06324v2)[**Acceleration via Fractal Learning Rate Schedules**

2023-07-14

](http://arxiv.org/abs/2103.01338v2)[**ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT**

2023-07-14

](http://arxiv.org/abs/2004.12832v2)[**Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution**

2023-07-14

](http://arxiv.org/abs/2307.06304v1)[**Stack More Layers Differently: High-Rank Training Through Low-Rank Updates**

2023-07-14

](http://arxiv.org/abs/2307.05695v1)[**Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes**

2023-07-13

](http://arxiv.org/abs/2305.02301v2)[**Towards Automated Circuit Discovery for Mechanistic Interpretability**

2023-07-13

](http://arxiv.org/abs/2304.14997v2)[**Less is More: Parameter-Free Text Classification with Gzip**

2023-07-13

](http://arxiv.org/abs/2212.09410v1)[**SqueezeLLM: Dense-and-Sparse Quantization**

2023-07-12

](http://arxiv.org/abs/2306.07629v1)[**Large Language Models Can Be Easily Distracted by Irrelevant Context**

2023-07-12

](http://arxiv.org/abs/2302.00093v3)[**Stay on topic with Classifier-Free Guidance**

2023-07-12

](http://arxiv.org/abs/2306.17806v1)[**Pen and Paper Exercises in Machine Learning**

2023-07-12

](http://arxiv.org/abs/2206.13446v1)[**Direct Preference Optimization: Your Language Model is Secretly a Reward Model**

2023-07-11

](http://arxiv.org/abs/2305.18290v1)[**Teaching Arithmetic to Small Transformers**

2023-07-10

](http://arxiv.org/abs/2307.03381v1)[**The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only**

2023-07-07

](http://arxiv.org/abs/2306.01116v1)[**Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability**

2023-07-07

](http://arxiv.org/abs/2305.10266v1)[**Semantic Tokenizer for Enhanced Natural Language Processing**

2023-07-07

](http://arxiv.org/abs/2304.12404v1)[**Conditioning Predictive Models: Risks and Strategies**

2023-07-07

](http://arxiv.org/abs/2302.00805v2)[**Lost in the Middle: How Language Models Use Long Contexts**

2023-07-07

](http://arxiv.org/abs/2307.03172v1)[**LongNet: Scaling Transformers to 1,000,000,000 Tokens**

2023-07-07

](http://arxiv.org/abs/2307.02486v1)[**SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis**

2023-07-07

](http://arxiv.org/abs/2307.01952v1)[**Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs**

2023-07-04

](http://arxiv.org/abs/1603.09320v4)[**MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers**

2023-07-01

](http://arxiv.org/abs/2305.07185v2)[**Extending Context Window of Large Language Models via Positional Interpolation**

2023-06-28

](http://arxiv.org/abs/2306.15595v1)[**InRank: Incremental Low-Rank Learning**

2023-06-27

](http://arxiv.org/abs/2306.11250v1)[**Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok**

2023-06-27

](http://arxiv.org/abs/2306.13253v1)[**Textbooks Are All You Need**

2023-06-22

](http://arxiv.org/abs/2306.11644v1)[**SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking**

2023-06-22

](http://arxiv.org/abs/2306.05426v2)[**Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture**

2023-06-17

](http://arxiv.org/abs/2301.08243v3)[**Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models**

2023-06-17

](http://arxiv.org/abs/2306.08997v1)[**Freaky Leaky SMS: Extracting User Locations by Analyzing SMS Timings**

2023-06-14

](http://arxiv.org/abs/2306.07695v1)[**The Curse of Recursion: Training on Generated Data Makes Models Forget**

2023-06-14

](http://arxiv.org/abs/2305.17493v2)[**Is Parallel Programming Hard, And, If So, What Can You Do About It? (Release v2023.06.11a)**

2023-06-13

](http://arxiv.org/abs/1701.00854v6)[**Benchmarking Neural Network Training Algorithms**

2023-06-13

](http://arxiv.org/abs/2306.07179v1)[**Tracking Everything Everywhere All at Once**

2023-06-10

](http://arxiv.org/abs/2306.05422v1)[**Simple and Controllable Music Generation**

2023-06-10

](http://arxiv.org/abs/2306.05284v1)[**LEACE: Perfect linear concept erasure in closed form**

2023-06-07

](http://arxiv.org/abs/2306.03819v1)[**A Succinct Summary of Reinforcement Learning**

2023-06-06

](http://arxiv.org/abs/2301.01379v1)[**ImageBind: One Embedding Space To Bind Them All**

2023-06-06

](http://arxiv.org/abs/2305.05665v2)[**Capabilities of GPT-4 on Medical Challenge Problems**

2023-06-06

](http://arxiv.org/abs/2303.13375v2)[**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**

2023-06-06

](http://arxiv.org/abs/2304.03768v1)[**Training Language Models with Language Feedback at Scale**

2023-06-06

](http://arxiv.org/abs/2303.16755v2)[**MAGVLT: Masked Generative Vision-and-Language Transformer**

2023-06-06

](http://arxiv.org/abs/2303.12208v1)[**Optimizing Memory Mapping Using Deep Reinforcement Learning**

2023-06-06

](http://arxiv.org/abs/2305.07440v1)[**Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements**

2023-06-06

](http://arxiv.org/abs/2305.03695v2)[**GLM: General Language Model Pretraining with Autoregressive Blank Infilling**

2023-06-06

](http://arxiv.org/abs/2103.10360v2)[**Knowledge Graphs**

2023-06-06

](http://arxiv.org/abs/2003.02320v6)[**SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression**

2023-06-06

](http://arxiv.org/abs/2306.03078v1)[**MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks**

2023-06-06

](http://arxiv.org/abs/2303.16839v2)[**Visual Instruction Tuning**

2023-06-06

](http://arxiv.org/abs/2304.08485v1)[**Spatial-Language Attention Policies for Efficient Robot Learning**

2023-06-06

](http://arxiv.org/abs/2304.11235v1)[**Hyperbolic Image-Text Representations**

2023-06-06

](http://arxiv.org/abs/2304.09172v1)[**Answering Questions by Meta-Reasoning over Multiple Chains of Thought**

2023-06-06

](http://arxiv.org/abs/2304.13007v2)[**Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism**

2023-06-06

](http://arxiv.org/abs/2304.11414v1)[**Spreading vectors for similarity search**

2023-06-06

](http://arxiv.org/abs/1806.03198v3)[**Better Aligning Text-to-Image Models with Human Preference**

2023-06-06

](http://arxiv.org/abs/2303.14420v1)[**EVA-CLIP: Improved Training Techniques for CLIP at Scale**

2023-06-06

](http://arxiv.org/abs/2303.15389v1)[**Bayesian Optimization of Catalysts With In-context Learning**

2023-06-06

](http://arxiv.org/abs/2304.05341v1)[**Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models**

2023-06-06

](http://arxiv.org/abs/2304.12526v1)[**G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment**

2023-06-06

](http://arxiv.org/abs/2303.16634v3)[**LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction**

2023-06-06

](http://arxiv.org/abs/2304.08460v1)[**ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models**

2023-06-06

](http://arxiv.org/abs/2303.16421v1)[**Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?**

2023-06-06

](http://arxiv.org/abs/2303.18240v1)[**Training Large Language Models Efficiently with Sparsity and Dataflow**

2023-06-06

](http://arxiv.org/abs/2304.05511v1)[**Vision Transformers with Mixed-Resolution Tokenization**

2023-06-06

](http://arxiv.org/abs/2304.00287v2)[**SegGPT: Segmenting Everything In Context**

2023-06-06

](http://arxiv.org/abs/2304.03284v1)[**State Spaces Aren’t Enough: Machine Translation Needs Attention**

2023-06-06

](http://arxiv.org/abs/2304.12776v1)[**ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark**

2023-06-06

](http://arxiv.org/abs/2303.13648v1)[**Text-to-Image Diffusion Models are Zero-Shot Classifiers**

2023-06-06

](http://arxiv.org/abs/2303.15233v1)[**DETRs Beat YOLOs on Real-time Object Detection**

2023-06-06

](http://arxiv.org/abs/2304.08069v1)[**AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models**

2023-06-06

](http://arxiv.org/abs/2304.06364v1)[**FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization**

2023-06-06

](http://arxiv.org/abs/2303.14189v1)[**Self-Refine: Iterative Refinement with Self-Feedback**

2023-06-06

](http://arxiv.org/abs/2303.17651v2)[**Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks**

2023-06-06

](http://arxiv.org/abs/2304.12567v1)[**Reflexion: Language Agents with Verbal Reinforcement Learning**

2023-06-06

](http://arxiv.org/abs/2303.11366v2)[**Towards Agile Text Classifiers for Everyone**

2023-06-06

](http://arxiv.org/abs/2302.06541v1)[**DeiT III: Revenge of the ViT**

2023-06-06

](http://arxiv.org/abs/2204.07118v1)[**Emergent autonomous scientific research capabilities of large language models**

2023-06-06

](http://arxiv.org/abs/2304.05332v1)[**AugGPT: Leveraging ChatGPT for Text Data Augmentation**

2023-06-06

](http://arxiv.org/abs/2302.13007v3)[**Why think step by step? Reasoning emerges from the locality of experience**

2023-06-06

](http://arxiv.org/abs/2304.03843v2)[**Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks**

2023-06-06

](http://arxiv.org/abs/2204.07705v3)[**REFINER: Reasoning Feedback on Intermediate Representations**

2023-06-06

](http://arxiv.org/abs/2304.01904v1)[**OpenScene: 3D Scene Understanding with Open Vocabularies**

2023-06-06

](http://arxiv.org/abs/2211.15654v2)[**The Quantization Model of Neural Scaling**

2023-06-06

](http://arxiv.org/abs/2303.13506v1)[**Boosted Prompt Ensembles for Large Language Models**

2023-06-06

](http://arxiv.org/abs/2304.05970v1)[**Affordances from Human Videos as a Versatile Representation for Robotics**

2023-06-06

](http://arxiv.org/abs/2304.08488v1)[**Answering Questions Over Knowledge Graphs Using Logic Programming Along with Language Models**

2023-06-06

](http://arxiv.org/abs/2303.02206v1)[**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**

2023-06-06

](http://arxiv.org/abs/2301.12597v2)[**Rethinking the Role of Token Retrieval in Multi-Vector Retrieval**

2023-06-06

](http://arxiv.org/abs/2304.01982v2)[**End-to-End Spatio-Temporal Action Localisation with Video Transformers**

2023-06-06

](http://arxiv.org/abs/2304.12160v1)[**Symbolic Knowledge Distillation: from General Language Models to Commonsense Models**

2023-06-06

](http://arxiv.org/abs/2110.07178v2)[**Learning in High Dimension Always Amounts to Extrapolation**

2023-06-06

](http://arxiv.org/abs/2110.09485v2)[**Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models**

2023-06-06

](http://arxiv.org/abs/2304.11657v1)[**An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion**

2023-06-06

](http://arxiv.org/abs/2208.01618v1)[**Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation**

2023-06-06

](http://arxiv.org/abs/2304.06600v1)[**Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations**

2023-06-06

](http://arxiv.org/abs/2304.11267v1)[**Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning**

2023-06-06

](http://arxiv.org/abs/2303.10512v1)[**Recurrent Memory Transformer**

2023-06-06

](http://arxiv.org/abs/2207.06881v2)[**Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention**

2023-06-06

](http://arxiv.org/abs/2303.15274v2)[**M2T: Masking Transformers Twice for Faster Decoding**

2023-06-06

](http://arxiv.org/abs/2304.07313v1)[**TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs**

2023-06-06

](http://arxiv.org/abs/2303.16434v1)[**AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators**

2023-06-06

](http://arxiv.org/abs/2303.16854v1)[**CoBIT: A Contrastive Bi-directional Image-Text Generation Model**

2023-06-06

](http://arxiv.org/abs/2303.13455v1)[**Low-code LLM: Visual Programming over LLMs**

2023-06-06

](http://arxiv.org/abs/2304.08103v2)[**Forward Thinking: Building Deep Random Forests**

2023-06-06

](http://arxiv.org/abs/1705.07366v1)[**Brainformers: Trading Simplicity for Efficiency**

2023-06-03

](http://arxiv.org/abs/2306.00008v1)[**AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration**

2023-06-03

](http://arxiv.org/abs/2306.00978v1)[**Training Verifiers to Solve Math Word Problems**

2023-06-02

](http://arxiv.org/abs/2110.14168v2)[**An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**

2023-06-01

](http://arxiv.org/abs/2010.11929v2)[**Scaling Data-Constrained Language Models**

2023-06-01

](http://arxiv.org/abs/2305.16264v2)[**Adam: A Method for Stochastic Optimization**

2023-05-31

](http://arxiv.org/abs/1412.6980v9)[**DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale**

2023-05-30

](http://arxiv.org/abs/2207.00032v1)[**Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors**

2023-05-30

](http://arxiv.org/abs/2305.18274v1)[**QLoRA: Efficient Finetuning of Quantized LLMs**

2023-05-29

](http://arxiv.org/abs/2305.14314v1)[**READ: Recurrent Adaptation of Large Transformers**

2023-05-29

](http://arxiv.org/abs/2305.15348v1)[**The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python**

2023-05-28

](http://arxiv.org/abs/2305.15507v1)[**SoundStorm: Efficient Parallel Audio Generation**

2023-05-28

](http://arxiv.org/abs/2305.09636v1)[**SLiC-HF: Sequence Likelihood Calibration with Human Feedback**

2023-05-27

](http://arxiv.org/abs/2305.10425v1)[**Voyager: An Open-Ended Embodied Agent with Large Language Models**

2023-05-27

](http://arxiv.org/abs/2305.16291v1)[**Gorilla: Large Language Model Connected with Massive APIs**

2023-05-25

](http://arxiv.org/abs/2305.15334v1)[**Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training**

2023-05-24

](http://arxiv.org/abs/2305.14342v1)[**Instruction Tuning with GPT-4**

2023-05-23

](http://arxiv.org/abs/2304.03277v1)[**Hot Pixels: Frequency, Power, and Temperature Attacks on GPUs and ARM SoCs**

2023-05-23

](http://arxiv.org/abs/2305.12784v1)[**Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold**

2023-05-23

](http://arxiv.org/abs/2305.10973v1)[**Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond**

2023-05-23

](http://arxiv.org/abs/2304.13712v2)[**Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability**

2023-05-23

](http://arxiv.org/abs/2305.08746v2)[**Towards Expert-Level Medical Question Answering with Large Language Models**

2023-05-23

](http://arxiv.org/abs/2305.09617v1)[**Tree of Thoughts: Deliberate Problem Solving with Large Language Models**

2023-05-23

](http://arxiv.org/abs/2305.10601v1)[**GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints**

2023-05-23

](http://arxiv.org/abs/2305.13245v1)[**Any-to-Any Generation via Composable Diffusion**

2023-05-22

](http://arxiv.org/abs/2305.11846v1)[**LIMA: Less Is More for Alignment**

2023-05-22

](http://arxiv.org/abs/2305.11206v1)[**How Does Generative Retrieval Scale to Millions of Passages?**

2023-05-22

](http://arxiv.org/abs/2305.11841v1)[**Learning to Compress Prompts with Gist Tokens**

2023-05-21

](http://arxiv.org/abs/2304.08467v1)[**Active Retrieval Augmented Generation**

2023-05-19

](http://arxiv.org/abs/2305.06983v1)[**Effective Theory of Transformers at Initialization**

2023-05-19

](http://arxiv.org/abs/2304.02034v1)[**Pre-Training to Learn in Context**

2023-05-17

](http://arxiv.org/abs/2305.09137v1)[**Symbol tuning improves in-context learning in language models**

2023-05-16

](http://arxiv.org/abs/2305.08298v1)[**MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers**

2023-05-15

](http://arxiv.org/abs/2305.07185v1)[**Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting**

2023-05-10

](http://arxiv.org/abs/2305.04388v1)[**Unlimiformer: Long-Range Transformers with Unlimited Length Input**

2023-05-09

](http://arxiv.org/abs/2305.01625v1)[**LoRA: Low-Rank Adaptation of Large Language Models**

2023-05-07

](http://arxiv.org/abs/2106.09685v2)[**Language Models are Multilingual Chain-of-Thought Reasoners**

2023-05-04

](http://arxiv.org/abs/2210.03057v1)[**WizardLM: Empowering Large Language Models to Follow Complex Instructions**

2023-05-04

](http://arxiv.org/abs/2304.12244v1)[**Quantifying Memorization Across Neural Language Models**

2023-05-01

](http://arxiv.org/abs/2202.07646v3)[**AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head**

2023-04-30

](http://arxiv.org/abs/2304.12995v1)[**Scaling Laws for Transfer**

2023-04-29

](http://arxiv.org/abs/2102.01293v1)[**Pretrain on just structure: Understanding linguistic inductive biases using transfer learning**

2023-04-28

](http://arxiv.org/abs/2304.13060v1)[**Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning**

2023-04-28

](http://arxiv.org/abs/2304.13653v1)[**Shortformer: Better Language Modeling using Shorter Inputs**

2023-04-27

](http://arxiv.org/abs/2012.15832v2)[**Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis**

2023-04-26

](http://arxiv.org/abs/2304.12317v1)[**A Cookbook of Self-Supervised Learning**

2023-04-25

](http://arxiv.org/abs/2304.12210v1)[**Scaling Transformer to 1M tokens and beyond with RMT**

2023-04-24

](http://arxiv.org/abs/2304.11062v1)[**DINOv2: Learning Robust Visual Features without Supervision**

2023-04-20

](http://arxiv.org/abs/2304.07193v1)[**Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models**

2023-04-19

](http://arxiv.org/abs/2205.10770v2)[**Pretraining Without Attention**

2023-04-19

](http://arxiv.org/abs/2212.10544v1)[**Synthetic Data from Diffusion Models Improves ImageNet Classification**

2023-04-18

](http://arxiv.org/abs/2304.08466v1)[**Sparks of Artificial General Intelligence: Early experiments with GPT-4**

2023-04-16

](http://arxiv.org/abs/2303.12712v1)[**GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers**

2023-04-15

](http://arxiv.org/abs/2210.17323v2)[**Layer Normalization**

2023-04-15

](http://arxiv.org/abs/1607.06450v1)[**Optimisation & Generalisation in Networks of Neurons**

2023-04-14

](http://arxiv.org/abs/2210.10101v1)[**Automatic Gradient Descent: Deep Learning without Hyperparameters**

2023-04-14

](http://arxiv.org/abs/2304.05187v1)[**ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation**

2023-04-14

](http://arxiv.org/abs/2304.05977v2)[**Segment Everything Everywhere All at Once**

2023-04-14

](http://arxiv.org/abs/2304.06718v1)[**Calibrated Chaos: Variance Between Runs of Neural Network Training is Harmless and Inevitable**

2023-04-13

](http://arxiv.org/abs/2304.01910v1)[**Teaching Large Language Models to Self-Debug**

2023-04-13

](http://arxiv.org/abs/2304.05128v1)[**Multimodal Analogical Reasoning over Knowledge Graphs**

2023-04-13

](http://arxiv.org/abs/2210.00312v4)[**Text-to-Table: A New Way of Information Extraction**

2023-04-11

](http://arxiv.org/abs/2109.02707v2)[**Generative Agents: Interactive Simulacra of Human Behavior**

2023-04-10

](http://arxiv.org/abs/2304.03442v1)[**Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark**

2023-04-09

](http://arxiv.org/abs/2304.03279v1)[**DeBERTa: Decoding-enhanced BERT with Disentangled Attention**

2023-04-08

](http://arxiv.org/abs/2006.03654v6)[**The Forward-Forward Algorithm: Some Preliminary Investigations**

2023-04-06

](http://arxiv.org/abs/2212.13345v1)[**ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators**

2023-04-05

](http://arxiv.org/abs/2003.10555v1)[**Twist Decoding: Diverse Generators Guide Each Other**

2023-04-04

](http://arxiv.org/abs/2205.09273v2)[**A Survey of Large Language Models**

2023-04-03

](http://arxiv.org/abs/2303.18223v1)[**Generative Adversarial Networks**

2023-04-02

](http://arxiv.org/abs/1406.2661v1)[**Deep contextualized word representations**

2023-04-02

](http://arxiv.org/abs/1802.05365v2)[**HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace**

2023-04-02

](http://arxiv.org/abs/2303.17580v1)[**The effectiveness of MAE pre-pretraining for billion-scale pretraining**

2023-04-02

](http://arxiv.org/abs/2303.13496v1)[**Natural Selection Favors AIs over Humans**

2023-04-02

](http://arxiv.org/abs/2303.16200v1)[**The case for 4-bit precision: k-bit Inference Scaling Laws**

2023-04-02

](http://arxiv.org/abs/2212.09720v2)[**The Power of Scale for Parameter-Efficient Prompt Tuning**

2023-04-02

](http://arxiv.org/abs/2104.08691v2)[**Formal Algorithms for Transformers**

2023-04-01

](http://arxiv.org/abs/2207.09238v1)[**BloombergGPT: A Large Language Model for Finance**

2023-03-31

](http://arxiv.org/abs/2303.17564v1)[**Efficient Training of Language Models to Fill in the Middle**

2023-03-30

](http://arxiv.org/abs/2207.14255v1)[**The Curious Case of Neural Text Degeneration**

2023-03-30

](http://arxiv.org/abs/1904.09751v2)[**Chain-of-Thought Prompting Elicits Reasoning in Large Language Models**

2023-03-30

](http://arxiv.org/abs/2201.11903v6)[**WebGPT: Browser-assisted question-answering with human feedback**

2023-03-30

](http://arxiv.org/abs/2112.09332v3)[**Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer**

2023-03-30

](http://arxiv.org/abs/1910.10683v3)[**Training language models to follow instructions with human feedback**

2023-03-30

](http://arxiv.org/abs/2203.02155v1)[**Finetuned Language Models Are Zero-Shot Learners**

2023-03-30

](http://arxiv.org/abs/2109.01652v5)[**Pretraining Language Models with Human Preferences**

2023-03-30

](http://arxiv.org/abs/2302.08582v1)[**Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer**

2023-03-30

](http://arxiv.org/abs/2203.03466v2)[**Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm**

2023-03-30

](http://arxiv.org/abs/2102.07350v1)[**Scaling Laws for Neural Language Models**

2023-03-30

](http://arxiv.org/abs/2001.08361v1)[**Competition-Level Code Generation with AlphaCode**

2023-03-30

](http://arxiv.org/abs/2203.07814v1)[**Training Compute-Optimal Large Language Models**

2023-03-30

](http://arxiv.org/abs/2203.15556v1)[**PaLM: Scaling Language Modeling with Pathways**

2023-03-30

](http://arxiv.org/abs/2204.02311v5)[**LaMDA: Language Models for Dialog Applications**

2023-03-30

](http://arxiv.org/abs/2201.08239v3)[**GLaM: Efficient Scaling of Language Models with Mixture-of-Experts**

2023-03-30

](http://arxiv.org/abs/2112.06905v2)[**Root Mean Square Layer Normalization**

2023-03-30

](http://arxiv.org/abs/1910.07467v1)[**ST-MoE: Designing Stable and Transferable Sparse Expert Models**

2023-03-30

](http://arxiv.org/abs/2202.08906v2)[**Constitutional AI: Harmlessness from AI Feedback**

2023-03-30

](http://arxiv.org/abs/2212.08073v1)[**Solving Quantitative Reasoning Problems with Language Models**

2023-03-30

](http://arxiv.org/abs/2206.14858v2)[**DeepNet: Scaling Transformers to 1,000 Layers**

2023-03-30

](http://arxiv.org/abs/2203.00555v1)[**Proximal Policy Optimization Algorithms**

2023-03-30

](http://arxiv.org/abs/1707.06347v2)[**Improving alignment of dialogue agents via targeted human judgements**

2023-03-30

](http://arxiv.org/abs/2209.14375v1)[**A data-driven approach for learning to control computers**

2023-03-30

](http://arxiv.org/abs/2202.08137v2)[**Toolformer: Language Models Can Teach Themselves to Use Tools**

2023-03-30

](http://arxiv.org/abs/2302.04761v1)[**Fast Transformer Decoding: One Write-Head is All You Need**

2023-03-30

](http://arxiv.org/abs/1911.02150v1)[**Language Models are Few-Shot Learners**

2023-03-30

](http://arxiv.org/abs/2005.14165v4)[**Attention Is All You Need**

2023-03-30

](http://arxiv.org/abs/1706.03762v5)[**Scaling Language Models: Methods, Analysis & Insights from Training Gopher**

2023-03-30

](http://arxiv.org/abs/2112.11446v2)[**Improving language models by retrieving from trillions of tokens**

2023-03-30

](http://arxiv.org/abs/2112.04426v3)[**Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model**

2023-03-30

](http://arxiv.org/abs/2201.11990v3)[**Flamingo: a Visual Language Model for Few-Shot Learning**

2023-03-30

](http://arxiv.org/abs/2204.14198v2)[**PaLI: A Jointly-Scaled Multilingual Language-Image Model**

2023-03-30

](http://arxiv.org/abs/2209.06794v2)[**A Generalist Agent**

2023-03-30

](http://arxiv.org/abs/2205.06175v3)[**A General Language Assistant as a Laboratory for Alignment**

2023-03-30

](http://arxiv.org/abs/2112.00861v3)[**Language Models (Mostly) Know What They Know**

2023-03-30

](http://arxiv.org/abs/2207.05221v4)[**Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity**

2023-03-30

](http://arxiv.org/abs/2101.03961v3)[**The Capacity for Moral Self-Correction in Large Language Models**

2023-03-30

](http://arxiv.org/abs/2302.07459v2)[**Fine-Tuning Language Models from Human Preferences**

2023-03-30

](http://arxiv.org/abs/1909.08593v2)[**BLOOM: A 176B-Parameter Open-Access Multilingual Language Model**

2023-03-30

](http://arxiv.org/abs/2211.05100v3)[**Galactica: A Large Language Model for Science**

2023-03-30

](http://arxiv.org/abs/2211.09085v1)[**OPT: Open Pre-trained Transformer Language Models**

2023-03-30

](http://arxiv.org/abs/2205.01068v4)[**GLM-130B: An Open Bilingual Pre-trained Model**

2023-03-30

](http://arxiv.org/abs/2210.02414v1)[**GPT-NeoX-20B: An Open-Source Autoregressive Language Model**

2023-03-30

](http://arxiv.org/abs/2204.06745v1)[**Unified Scaling Laws for Routed Language Models**

2023-03-30

](http://arxiv.org/abs/2202.01169v2)[**Efficient Large Scale Language Modeling with Mixtures of Experts**

2023-03-30

](http://arxiv.org/abs/2112.10684v2)[**Mixture-of-Experts with Expert Choice Routing**

2023-03-30

](http://arxiv.org/abs/2202.09368v2)[**Towards a Human-like Open-Domain Chatbot**

2023-03-30

](http://arxiv.org/abs/2001.09977v3)[**Self-attention Does Not Need Memory**

2023-03-30

](http://arxiv.org/abs/2112.05682v3)[**FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness**

2023-03-30

](http://arxiv.org/abs/2205.14135v2)[**muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems**

2023-03-30

](http://arxiv.org/abs/2205.10937v2)[**SemDeDup: Data-efficient learning at web-scale through semantic deduplication**

2023-03-30

](http://arxiv.org/abs/2303.09540v3)[**Hyena Hierarchy: Towards Larger Convolutional Language Models**

2023-03-28

](http://arxiv.org/abs/2302.10866v2)[**Scaling Expert Language Models with Unsupervised Domain Discovery**

2023-03-28

](http://arxiv.org/abs/2303.14177v1)[**ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks**

2023-03-28

](http://arxiv.org/abs/2303.15056v1)[**Toy Models of Superposition**

2023-03-26

](http://arxiv.org/abs/2209.10652v1)[**Superposition of many models into one**

2023-03-26

](http://arxiv.org/abs/1902.05522v2)[**Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer**

2023-03-26

](http://arxiv.org/abs/1701.06538v1)[**Polysemanticity and Capacity in Neural Networks**

2023-03-26

](http://arxiv.org/abs/2210.01892v2)[**Decoding by Linear Programming**

2023-03-26

](http://arxiv.org/abs/math/0502327v1)[**Learning Transferable Visual Models From Natural Language Supervision**

2023-03-26

](http://arxiv.org/abs/2103.00020v1)[**GPT-4 Technical Report**

2023-03-26

](http://arxiv.org/abs/2303.08774v2)[**Learning Models of Individual Behavior in Chess**

2023-03-25

](http://arxiv.org/abs/2008.10086v3)[**Optimizing Neural Networks with Kronecker-factored Approximate Curvature**

2023-03-25

](http://arxiv.org/abs/1503.05671v7)[**The alignment problem from a deep learning perspective**

2023-03-24

](http://arxiv.org/abs/2209.00626v4)[**Sequence to Sequence Learning with Neural Networks**

2023-03-24

](http://arxiv.org/abs/1409.3215v3)[**Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations**

2023-03-24

](http://arxiv.org/abs/1903.05895v2)[**Thinking Like Transformers**

2023-03-23

](http://arxiv.org/abs/2106.06981v2)[**Attention Approximates Sparse Distributed Memory**

2023-03-23

](http://arxiv.org/abs/2111.05498v2)[**Convex Optimization: Algorithms and Complexity**

2023-03-23

](http://arxiv.org/abs/1405.4980v2)[**On-Device Training Under 256KB Memory**

2023-03-23

](http://arxiv.org/abs/2206.15472v3)[**Simplified State Space Layers for Sequence Modeling**

2023-03-21

](http://arxiv.org/abs/2208.04933v3)[**Memorizing Transformers**

2023-03-20

](http://arxiv.org/abs/2203.08913v1)[**Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers**

2023-03-20

](http://arxiv.org/abs/2212.10559v2)[**CoLT5: Faster Long-Range Transformers with Conditional Computation**

2023-03-20

](http://arxiv.org/abs/2303.09752v1)[**Can Humans Do Less-Than-One-Shot Learning?**

2023-03-20

](http://arxiv.org/abs/2202.04670v1)[**Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads**

2023-03-19

](http://arxiv.org/abs/2202.07848v2)[**HiPPO: Recurrent Memory with Optimal Polynomial Projections**

2023-03-19

](http://arxiv.org/abs/2008.07669v2)[**Generating Long Sequences with Sparse Transformers**

2023-03-19

](http://arxiv.org/abs/1904.10509v1)[**Decoupled Context Processing for Context Augmented Language Modeling**

2023-03-18

](http://arxiv.org/abs/2210.05758v1)[**LLaMA: Open and Efficient Foundation Language Models**

2023-03-18

](http://arxiv.org/abs/2302.13971v1)[**GLU Variants Improve Transformer**

2023-03-17

](http://arxiv.org/abs/2002.05202v1)[**ART: Automatic multi-step reasoning and tool-use for large language models**

2023-03-17

](http://arxiv.org/abs/2303.09014v1)[**Erasing Concepts from Diffusion Models**

2023-03-17

](http://arxiv.org/abs/2303.07345v1)[**ViperGPT: Visual Inference via Python Execution for Reasoning**

2023-03-17

](http://arxiv.org/abs/2303.08128v1)[**On Calibration of Modern Neural Networks**

2023-03-15

](http://arxiv.org/abs/1706.04599v2)[**Meet in the Middle: A New Pre-training Paradigm**

2023-03-15

](http://arxiv.org/abs/2303.07295v1)[**Self-critiquing models for assisting human evaluators**

2023-03-15

](http://arxiv.org/abs/2206.05802v2)[**World of Bits: An Open-Domain Platform for Web-Based Agents**

2023-03-13

](https://proceedings.mlr.press/v70/shi17a/shi17a.abs)[**Optimal Policies Tend to Seek Power**

2023-03-13

](http://arxiv.org/abs/1912.01683v10)[**Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation**

2023-03-12

](http://arxiv.org/abs/2108.12409v2)[**Towards Deep Learning Models Resistant to Adversarial Attacks**

2023-03-12

](http://arxiv.org/abs/1706.06083v4)[**RoFormer: Enhanced Transformer with Rotary Position Embedding**

2023-03-12

](http://arxiv.org/abs/2104.09864v4)[**Dissociating language and thought in large language models: a cognitive perspective**

2023-03-10

](http://arxiv.org/abs/2301.06627v1)[**Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models**

2023-03-10

](http://arxiv.org/abs/2303.04671v1)[**Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features**

2023-03-10

](http://arxiv.org/abs/2302.05441v1)[**Going Deeper with Convolutions**

2023-03-08

](http://arxiv.org/abs/1409.4842v1)[**Larger language models do in-context learning differently**

2023-03-08

](http://arxiv.org/abs/2303.03846v1)[**Can one hear the shape of a neural network?: Snooping the GPU via Magnetic Side Channel**

2023-03-07

](http://arxiv.org/abs/2109.07395v1)[**Prismer: A Vision-Language Model with An Ensemble of Experts**

2023-03-07

](http://arxiv.org/abs/2303.02506v1)[**PaLM-E: An Embodied Multimodal Language Model**

2023-03-07

](https://palm-e.github.io/assets/palm-e.abs)[**Language Is Not All You Need: Aligning Perception with Language Models**

2023-03-05

](http://arxiv.org/abs/2302.14045v2)[**DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter**

2023-03-05

](http://arxiv.org/abs/1910.01108v4)[**Large Language Models are Zero-Shot Reasoners**

2023-03-05

](http://arxiv.org/abs/2205.11916v4)[**High-resolution image reconstruction with latent diffusion models from human brain activity**

2023-03-03

](https://www.biorxiv.org/content/10.1101/2022.11.18.517004v2.full.abs)[**Language Models as Agent Models**

2023-03-03

](http://arxiv.org/abs/2212.01681v1)[**Least-to-Most Prompting Enables Complex Reasoning in Large Language Models**

2023-03-03

](http://arxiv.org/abs/2205.10625v2)[**Bayesian Model Selection, the Marginal Likelihood, and Generalization**

2023-03-01

](http://arxiv.org/abs/2202.11678v2)[**Adversarial Examples for Evaluating Reading Comprehension Systems**

2023-03-01

](http://arxiv.org/abs/1707.07328v1)[**On the Turing Completeness of Modern Neural Network Architectures**

2023-03-01

](http://arxiv.org/abs/1901.03429v1)[**Self-Consistency Improves Chain of Thought Reasoning in Language Models**

2023-02-26

](http://arxiv.org/abs/2203.11171v3)[**Emergent Abilities of Large Language Models**

2023-02-26

](http://arxiv.org/abs/2206.07682v2)[**Adding Conditional Control to Text-to-Image Diffusion Models**

2023-02-26

](http://arxiv.org/abs/2302.05543v1)[**Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection**

2023-02-26

](http://arxiv.org/abs/2004.07667v2)[**Recitation-Augmented Language Models**

2023-02-26

](http://arxiv.org/abs/2210.01296v2)[**Scaling Instruction-Finetuned Language Models**

2023-02-26

](http://arxiv.org/abs/2210.11416v5)[**Multitask Prompted Training Enables Zero-Shot Task Generalization**

2023-02-26

](http://arxiv.org/abs/2110.08207v3)[**What learning algorithm is in-context learning? Investigations with linear models**

2023-02-26

](http://arxiv.org/abs/2211.15661v2)[**Diffusion-LM Improves Controllable Text Generation**

2023-02-26

](http://arxiv.org/abs/2205.14217v1)[**Self-Instruct: Aligning Language Model with Self Generated Instructions**

2023-02-26

](http://arxiv.org/abs/2212.10560v1)[**Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints**

2023-02-26

](http://arxiv.org/abs/2212.05055v2)[**Efficiently Modeling Long Sequences with Structured State Spaces**

2023-02-26

](http://arxiv.org/abs/2111.00396v3)[**TALM: Tool Augmented Language Models**

2023-02-26

](http://arxiv.org/abs/2205.12255v1)[**Efficiently Scaling Transformer Inference**

2023-02-26

](http://arxiv.org/abs/2211.05102v1)[**Deduplicating Training Data Mitigates Privacy Risks in Language Models**

2023-02-26

](http://arxiv.org/abs/2202.06539v3)[**Co-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry Professionals**

2023-02-26

](http://arxiv.org/abs/2209.14958v1)[**Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?**

2023-02-26

](http://arxiv.org/abs/2202.12837v2)[**Large Language Models Encode Clinical Knowledge**

2023-02-26

](http://arxiv.org/abs/2212.13138v1)[**Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models**

2023-02-26

](http://arxiv.org/abs/2208.03306v1)[**Large Language Models Can Self-Improve**

2023-02-26

](http://arxiv.org/abs/2210.11610v2)[**Language Modelling with Pixels**

2023-02-26

](http://arxiv.org/abs/2207.06991v1)[**Holistic Evaluation of Language Models**

2023-02-26

](http://arxiv.org/abs/2211.09110v1)[**Data Distributional Properties Drive Emergent In-Context Learning in Transformers**

2023-02-26

](http://arxiv.org/abs/2205.05055v6)[**UL2: Unifying Language Learning Paradigms**

2023-02-26

](http://arxiv.org/abs/2205.05131v2)[**Transformer Memory as a Differentiable Search Index**

2023-02-26

](http://arxiv.org/abs/2202.06991v3)[**Transcending Scaling Laws with 0.1% Extra Compute**

2023-02-26

](http://arxiv.org/abs/2210.11399v2)[**RoBERTa: A Robustly Optimized BERT Pretraining Approach**

2023-02-25

](http://arxiv.org/abs/1907.11692v1)[**Revisiting Unreasonable Effectiveness of Data in Deep Learning Era**

2023-02-25

](http://arxiv.org/abs/1707.02968v2)[**Understanding deep learning requires rethinking generalization**

2023-02-25

](http://arxiv.org/abs/1611.03530v2)[**Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP**

2023-02-24

](http://arxiv.org/abs/2212.14024v2)[**Scaling Robot Learning with Semantically Imagined Experience**

2023-02-24

](http://arxiv.org/abs/2302.11550v1)[**Retrofitting Word Vectors to Semantic Lexicons**

2023-02-24

](http://arxiv.org/abs/1411.4166v4)[**Monarch: Expressive Structured Matrices for Efficient and Accurate Training**

2023-02-24

](http://arxiv.org/abs/2204.00595v1)[**Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism**

2023-02-24

](http://arxiv.org/abs/1909.08053v4)[**Generalized Out-of-Distribution Detection: A Survey**

2023-02-23

](http://arxiv.org/abs/2110.11334v2)[**Black-box Adversarial Attacks with Limited Queries and Information**

2023-02-23

](http://arxiv.org/abs/1804.08598v3)[**Meta-Learning in Neural Networks: A Survey**

2023-02-21

](http://arxiv.org/abs/2004.05439v2)[**DocPrompting: Generating Code by Retrieving the Docs**

2023-02-21

](http://arxiv.org/abs/2207.05987v3)[**BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**

2023-02-21

](http://arxiv.org/abs/1810.04805v2)[**Self-supervised Learning: Generative or Contrastive**

2023-02-21

](http://arxiv.org/abs/2006.08218v5)[**Generating Wikipedia by Summarizing Long Sequences**

2023-02-21

](http://arxiv.org/abs/1801.10198v1)[**Scaling Vision Transformers to 22 Billion Parameters**

2023-02-20

](http://arxiv.org/abs/2302.05442v1)[**Extracting Training Data from Diffusion Models**

2023-02-20

](http://arxiv.org/abs/2301.13188v1)[**Risks from Learned Optimization in Advanced Machine Learning Systems**

2023-02-20

](http://arxiv.org/abs/1906.01820v3)[**HyperNetworks**

2023-02-20

](http://arxiv.org/abs/1609.09106v4)[**Transformers learn in-context by gradient descent**

2023-02-20

](http://arxiv.org/abs/2212.07677v1)[**Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback**

2023-02-20

](http://arxiv.org/abs/2204.05862v1)[**Progress measures for grokking via mechanistic interpretability**

2023-02-20

](http://arxiv.org/abs/2301.05217v2)[**Zero-Shot Text-to-Image Generation**

2023-02-19

](http://arxiv.org/abs/2102.12092v2)[**Dota 2 with Large Scale Deep Reinforcement Learning**

2023-02-19

](http://arxiv.org/abs/1912.06680v1)[**Evaluating Large Language Models Trained on Code**

2023-02-19

](http://arxiv.org/abs/2107.03374v2)[**Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction**

2023-02-18

](http://arxiv.org/abs/2105.06965v3)[**Multimodal Chain-of-Thought Reasoning in Language Models**

2023-02-18

](http://arxiv.org/abs/2302.00923v3)[**Implicit Representations of Meaning in Neural Language Models**

2023-02-18

](http://arxiv.org/abs/2106.00737v1)[**Symbolic Discovery of Optimization Algorithms**

2023-02-18

](http://arxiv.org/abs/2302.06675v1)[**Task-Specific Skill Localization in Fine-tuned Language Models**

2023-02-18

](http://arxiv.org/abs/2302.06600v1)[**Talking About Large Language Models**

2023-02-18

](http://arxiv.org/abs/2212.03551v5)[**Image-and-Language Understanding from Pixels Only**

2023-02-18

](http://arxiv.org/abs/2212.08045v1)[**Augmented Language Models: a Survey**

2023-02-18

](http://arxiv.org/abs/2302.07842v1)[**Beyond neural scaling laws: beating power law scaling via data pruning**

2023-02-18

](http://arxiv.org/abs/2206.14486v5)[**Discovering Latent Knowledge in Language Models Without Supervision**

2023-02-18

](http://arxiv.org/abs/2212.03827v1)[**Transformer models: an introduction and catalog**

2023-02-18

](http://arxiv.org/abs/2302.07730v1)[**Theory of Mind May Have Spontaneously Emerged in Large Language Models**

2023-02-18

](http://arxiv.org/abs/2302.02083v1)[**Unsolved Problems in ML Safety**

2023-02-18

](http://arxiv.org/abs/2109.13916v5)[**Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks**

2023-02-18

](http://arxiv.org/abs/2005.11401v4)[**Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets**

2023-02-18

](http://arxiv.org/abs/2201.02177v1)[**Large Language Models Are Human-Level Prompt Engineers**

2023-02-18

](http://arxiv.org/abs/2211.01910v1)[**Monolith: Real Time Recommendation System With Collisionless Embedding Table**

2023-02-18

](http://arxiv.org/abs/2209.07663v2)[**Discovering Language Model Behaviors with Model-Written Evaluations**

2023-02-18

](https://www.anthropic.com/model-written-evals.abs)[**Adam: A Method for Stochastic Optimization**

2023-02-10

](https://arxiv.org/abs/1412.6980.pdf)[**A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity**

2023-02-10

](https://arxiv.org/abs/2302.04023v1.pdf)[**Sketchy: Memory-efficient Adaptive Regularization with Frequent Direction**

2023-02-10

](https://arxiv.org/abs/2302.03764v1.pdf)[**Cramming: Training a Language Model on a Single GPU in One Day**

2023-02-10

](https://arxiv.org/abs/2212.14034v1.pdf)[**Mass-Editing Memory in a Transformer**

2023-02-08

](https://arxiv.org/abs/2210.07229v1)[**Large Language Models Can Be Easily Distracted by Irrelevant Context**

2023-02-08

](https://arxiv.org/abs/2302.00093v1)[**Machine Learning: The High-Interest Credit Card of Technical Debt**

2023-02-06

](https://storage.googleapis.com/pub-tools-public-publication-data/abs/43146.pdf)[**Human-Timescale Adaptation in an Open-Ended Task Space**

2023-02-06

](https://arxiv.org/abs/2301.07608v1.pdf)[**Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve**

2023-02-06

](https://arxiv.org/abs/2210.11618v1.pdf)[**Learning to summarize from human feedback**

2023-02-04

](https://arxiv.org/abs/2009.01325.pdf)[**Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models**

2023-02-03

](https://arxiv.org/abs/2206.04615.pdf)[**OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization**

2023-02-03

](https://arxiv.org/abs/2212.12017.pdf)[**A Watermark for Large Language Models**

2023-02-03

](https://arxiv.org/abs/2301.10226.pdf)[**Large Language Models Are Reasoning Teachers**

2023-02-03

](https://arxiv.org/abs/2212.10071.pdf)[**Teaching Small Language Models to Reason**

2023-02-03

](https://arxiv.org/abs/2212.08410.pdf)[**Downstream Datasets Make Surprisingly Good Pretraining Corpora**

2023-02-03

](https://arxiv.org/abs/2209.14389v1.pdf)[**One Model To Learn Them All**

2023-02-03

](https://arxiv.org/abs/1706.05137v1.pdf)[**Unifying Vision, Text, and Layout for Universal Document Processing**

2023-02-01

](https://arxiv.org/abs/2212.02623v2.pdf)[**Deep Double Descent: Where Bigger Models and More Data Hurt**

2023-01-27

](https://arxiv.org/abs/1912.02292v1.pdf)[**SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot**

2023-01-27

](https://arxiv.org/abs/2301.00774.pdf)[**MusicLM: Generating Music From Text**

2023-01-27

](https://arxiv.org/abs/2301.11325.pdf)[**Program Synthesis with Large Language Models**

2023-01-26

](https://arxiv.org/abs/2108.07732.pdf)[**Teaching Algorithmic Reasoning via In-context Learning**

2023-01-23

](https://arxiv.org/abs/2211.09066.pdf)[**Hungry Hungry Hippos: Towards Language Modeling with State Space Models**

2023-01-23

](https://arxiv.org/abs/2212.14052.pdf)[**Ask Me Anything: A simple strategy for prompting language models**

2023-01-21

](https://arxiv.org/abs/2210.02441.pdf)[**Precise Zero-Shot Dense Retrieval without Relevance Labels**

2023-01-17

](https://arxiv.org/abs/2212.10496.pdf)[**Mastering Diverse Domains through World Models**

2023-01-12

](https://arxiv.org/abs/2301.04104.pdf)[**Earlybird: Real-Time Search at Twitter**

2023-01-11

](http://notes.stephenholiday.com/Earlybird.abs)[**Decision Transformer: Reinforcement Learning via Sequence Modeling**

2023-01-09

](https://arxiv.org/abs/2106.01345.pdf)[**Emergent Tool Use From Multi-Agent Autocurricula**

2023-01-09

](https://arxiv.org/abs/1909.07528.pdf)[**RT-1: Robotics Transformer for Real-World Control at Scale**

2023-01-09

](https://arxiv.org/abs/2212.06817.pdf)[**Birdwatch: Crowd Wisdom and Bridging Algorithms can Inform Understanding and Reduce the Spread of Misinformation**

2023-01-09

](https://arxiv.org/abs/2210.15723.pdf)[**Operationalizing Machine Learning: An Interview Study**

2023-01-09

](https://arxiv.org/abs/2209.09125.pdf)[**Semi-supervised Sequence Learning**

2023-01-09

](https://arxiv.org/abs/1511.01432.pdf)[**Large language models are not zero-shot communicators**

2023-01-09

](https://arxiv.org/abs/2210.14986.pdf)[**Show Your Work: Scratchpads for Intermediate Computation with Language Models**

2023-01-09

](https://arxiv.org/abs/2112.00114.pdf)[**ReAct: Synergizing Reasoning and Acting in Language Models**

2023-01-09

](https://arxiv.org/abs/2210.03629.pdf)[**A Succinct Summary of Reinforcement Learning**

2023-01-06

](https://arxiv.org/abs/2301.01379.pdf)[**A Mathematical Theory of Communication**

2022-12-26

](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.abs)