🪴 Anil's Garden

❯

❯

Safety and Fairness

Safety and Fairness

23 Nov 20252 min read

On the Fundamental Impossibility of Hallucination Control in Large Language Models
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
Stealing User Prompts from Mixture of Experts
XSTest A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
The Instruction Hierarchy Training LLMs to Prioritize Privileged Instructions
Preserving Privacy in Large Language Models A Survey on Current Threats and Solutions - from Michele Miranda
Defeating Prompt Injections by Design
Deep Learning with Differential Privacy
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
Estimating Worst-Case Frontier Risks of Open-Weight LLMs (OpenAI safety paper for gpt-oss)
Guardrail Removal and Malicious Fine-Tuning (MFT)
- Shadow Alignment The Ease of Subverting Safely-Aligned Language Models
- Look Into It. - Uncensored AI - categorised under anti-refusal training in Estimating Worst-Case Frontier Risks of Open-Weight LLMs
- Domain-specific capability training
Are aligned neural networks adversarially aligned
The Ai2 Safety Toolkit Datasets and models for safe and responsible LLMs development Ai2
- allenai/wildguard 🤗 model
✨ Stealing Part of a Production Language Model
Intriguing properties of neural networks
Inspect AI || UKGovernmentBEIS/inspect_ai - Inspect: A framework for large language model evaluations

Robustness May Be at Odds with Accuracy
Causal Reasoning for Algorithmic Fairness
A Causal Bayesian Networks Viewpoint on Fairness
Counterfactual Fairness
Causal Bayesian Networks A flexible tool to enable fairer machine learning
Machine Bias - ProPublica
Google says sorry for racist auto-tag in photo app

Resources 📚

10.4 Adversarial Examples Interpretable Machine Learning from Interpretable Machine Learning: A Guide for Making Black Box Models Explainable by Christoph Molnar
Adversarial Attacks on Neural Networks Exploring the Fast Gradient Sign Method
Adversarial attacks with FGSM (Fast Gradient Sign Method) - PyImageSearch
Fairness (machine learning) - Wikipedia; really nicely sets out:
1. Independence: Independence of the sensitive characteristic and the predictors
2. Separation: Conditional independence of the sensitive characteristic and the predictors condititioned on the target (GT)
3. Sufficiency: Conditional independence of the sensitive characteristic and the target (GT) conditioned on the predictors

Articles

Introducing LawZero from Yoshua Bengio signposting many useful resources and further reading
Why I attack by Nicholas Carlini
AI couldn’t create an image of a woman like me - until now

Graph View

Resources 📚
Articles

Backlinks

Language Models

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋