- On the Fundamental Impossibility of Hallucination Control in Large Language Models
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- Stealing User Prompts from Mixture of Experts
- XSTest A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
- The Instruction Hierarchy Training LLMs to Prioritize Privileged Instructions
- Preserving Privacy in Large Language Models A Survey on Current Threats and Solutions - from Michele Miranda
- Defeating Prompt Injections by Design
- Deep Learning with Differential Privacy
- Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
- Estimating Worst-Case Frontier Risks of Open-Weight LLMs (OpenAI safety paper for gpt-oss)
- Guardrail Removal and Malicious Fine-Tuning (MFT)
- Shadow Alignment The Ease of Subverting Safely-Aligned Language Models
- Look Into It. - Uncensored AI - categorised under anti-refusal training in Estimating Worst-Case Frontier Risks of Open-Weight LLMs
- Domain-specific capability training
- Are aligned neural networks adversarially aligned
- The Ai2 Safety Toolkit Datasets and models for safe and responsible LLMs development Ai2
- allenai/wildguard đ€ model
- âš Stealing Part of a Production Language Model
- Intriguing properties of neural networks
- Robustness May Be at Odds with Accuracy
- Causal Reasoning for Algorithmic Fairness
- A Causal Bayesian Networks Viewpoint on Fairness
- Counterfactual Fairness
- Causal Bayesian Networks A flexible tool to enable fairer machine learning
- Machine Bias - ProPublica
- Google says sorry for racist auto-tag in photo app
Resources đ
- 10.4 Adversarial Examples Interpretable Machine Learning from Interpretable Machine Learning: A Guide for Making Black Box Models Explainable by Christoph Molnar
- Adversarial Attacks on Neural Networks Exploring the Fast Gradient Sign Method
- Adversarial attacks with FGSM (Fast Gradient Sign Method) - PyImageSearch
- Fairness (machine learning) - Wikipedia; really nicely sets out:
- Independence: Independence of the sensitive characteristic and the predictors
- Separation: Conditional independence of the sensitive characteristic and the predictors condititioned on the target (GT)
- Sufficiency: Conditional independence of the sensitive characteristic and the target (GT) conditioned on the predictors
Articles
- Introducing LawZero from Yoshua Bengio signposting many useful resources and further reading
- Why I attack by Nicholas Carlini