14 Papers Accepted to NeurIPS 2024
Researchers from our department showcased their work at NeurIPS 2024, a leading conference that brings together experts in machine learning and related sciences to exchange ideas, foster collaboration, and advance interdisciplinary innovation.
Below are the abstracts of the accepted papers:
Mind the Gap: A Causal Perspective on Bias Amplification in Prediction & Decision-Making
Drago Plecko Columbia University, Elias Bareinboim Columbia University
Abstract:
As society increasingly relies on AI-based tools for decision-making in socially sensitive domains, investigating fairness and equity of such automated systems has become a critical field of inquiry. Most of the literature in fair machine learning focuses on defining and achieving fairness criteria in the context of prediction, while not explicitly focusing on how these predictions may be used later on in the pipeline. For instance, if commonly used criteria, such as independence or sufficiency, are satisfied for a prediction score S used for binary classification, they need not be satisfied after an application of a simple thresholding operation on S (as commonly used in practice). In this paper, we take an important step to address this issue in numerous statistical and causal notions of fairness. We introduce the notion of a margin complement, which measures how much a prediction score S changes due to a thresholding operation. We then demonstrate that the marginal difference in the optimal 0/1 predictor Yb between groups, written P(ˆy | x1) − P(ˆy | x0), can be causally decomposed into the influences of X on the L2-optimal prediction score S and the influences of X on the margin complement M, along different causal pathways (direct, indirect, spurious). We then show that under suitable causal assumptions, the influences of X on the prediction score S are equal to the influences of X on the true outcome Y . This yields a new decomposition of the disparity in the predictor Yb that allows us to disentangle causal differences inherited from the true outcome Y that exists in the real world vs. those coming from the optimization procedure itself. This observation highlights the need for more regulatory oversight due to the potential for bias amplification, and to address this issue we introduce new notions of weak and strong business necessity, together with an algorithm for assessing whether these notions are satisfied. We apply our method to three real-world datasets and derive new insights on bias amplification in prediction and decision-making.
Causal Imitation for Markov Decision Processes: a Partial Identification Approach
Kangrui Ruan Columbia University, Junzhe Zhang Columbia University, Xuan Di Columbia University, Elias Bareinboim Columbia University
Abstract:
Imitation learning enables an agent to learn from expert demonstrations when the performance measure is unknown and the reward signal is not specified. Standard imitation methods do not generally apply when the learner and the expert’s sensory capabilities mismatch and demonstrations are contaminated with unobserved confounding bias. To address these challenges, recent advancements in causal imitation learning have been pursued. However, these methods often require access to underlying causal structures that might not always be available, posing practical challenges.In this paper, we investigate robust imitation learning within the framework of canonical Markov Decision Processes (MDPs) using partial identification, allowing the agent to achieve expert performance even when the system dynamics are not uniquely determined from the confounded expert demonstrations. Specifically, first, we theoretically demonstrate that when unobserved confounders (UCs) exist in an MDP, the learner is generally unable to imitate expert performance. We then explore imitation learning in partially identifiable settings — either transition distribution or reward function is non-identifiable from the available data and knowledge. Augmenting the celebrated GAIL method (Ho \& Ermon, 2016), our analysis leads to two novel causal imitation algorithms that can obtain effective policies guaranteed to achieve expert performance.
Partial Transportability for Domain Generalization
Kasra Jalaldoust Columbia University, Alexis Bellot Columbia University, Elias Bareinboim Columbia University
Abstract:
A fundamental task in AI is providing performance guarantees for predictions made in unseen domains. In practice, there can be substantial uncertainty about the distribution of new data, and corresponding variability in the performance of existing predictors. Building on the theory of partial identification and transportability, this paper introduces new results for bounding the value of a functional of the target distribution, such as the generalization error of a classifiers, given data from source domains and assumptions about the data generating mechanisms, encoded in causal diagrams. Our contribution is to provide the first general estimation technique for transportability problems, adapting existing parameterization schemes such Neural Causal Models to encode the structural constraints necessary for cross-population inference. We demonstrate the expressiveness and consistency of this procedure and further propose a gradient-based optimization scheme for making scalable inferences in practice. Our results are corroborated with experiments.
Unified Covariate Adjustment for Causal Inference
Yonghan Jung Purdue University, Jin Tian Iowa State University, Elias Bareinboim Columbia University
Abstract:
Causal effect identification and estimation are two crucial tasks in causal inference. Although causal effect identification has been theoretically resolved, many existing estimators only address a subset of scenarios, known as the sequential back-door adjustment (SBD) (Pearl and Robins, 1995) or g-formula (Robins, 1986). Recent efforts for developing general-purpose estimators with broader coverage, incorporating the front-door adjustment (FD) (Pearl, 2000) and more, lack scalability due to the high computational cost of summing over high-dimensional variables. In this paper, we introduce a novel approach that achieves broad coverage of causal estimands beyond the SBD, incorporating various sum-product functionals like the FD, while maintaining scalability — estimated in polynomial time relative to the number of variables and samples. Specifically, we present the class of UCA for which a scalable and doubly robust estimator is developed. In particular, we illustrate the expressiveness of UCA for a wide spectrum of causal estimands (e.g., SBD, FD, and more) in causal inference. We then develop an estimator that exhibits computational efficiency and doubly robustness. The scalability and robustness of the proposed framework are verified through simulations.
Disentangled Representation Learning in Non-Markovian Causal Systems
Adam Li Columbia University, Yushu Pan Columbia University, Elias Bareinboim Columbia University
Abstract:
Considering various data modalities, such as images, videos, and text, humans perform causal reasoning using high-level causal variables, as opposed to operating at the low, pixel level from which the data comes. In practice, most causal reasoning methods assume that the data is described as granular as the underlying causal generative factors, which is often violated in various AI tasks. This mismatch translates into a lack of guarantees in various tasks such as generative modeling, decision-making, fairness, and generalizability, to cite a few. In this paper, we acknowledge this issue and study the problem of causal disentangled representation learning from a combination of data gathered from various heterogeneous domains and assumptions in the form of a latent causal graph. To the best of our knowledge, the proposed work is the first to consider i) non-Markovian causal settings, where there may be unobserved confounding, ii) arbitrary distributions that arise from multiple domains, and iii) a relaxed version of disentanglement. Specifically, we introduce graphical criteria that allow for disentanglement under various conditions. Building on these results, we develop an algorithm that returns a causal disentanglement map, highlighting which latent variables can be disentangled given the combination of data and assumptions. The theory is corroborated by experiments.
The Fine-Grained Complexity of Gradient Computation for Training Large Language Models
Josh Alman Columbia University, Zhao Song Adobe Research
Abstract:
Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run ‘forward’ computations and ‘backward’ computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis SETH is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training. This completely characterizes the fine-grained complexity of every step of LLM training.
Metric Transforms and Low Rank Representations of Kernels for Fast Attention
Timothy Chu Independent Researcher, Josh Alman Columbia University, Gary L. Miller Carnegie Mellon University, Shyam Narayanan Citadel Securities, Mark Sellke Harvard University, Zhao Song Simons Institute for the Theory of Computing, UC Berkeley
Abstract:
We introduce a new linear-algebraic tool based on group representation theory, and use it to address three key problems in machine learning.
- Past researchers have proposed fast attention algorithms for LLMs by approximating or replace softmax attention with other functions, such as low-degree polynomials. The key property of these functions is that, when applied entrywise to the matrix QK>, the result is a low rank matrix when Q and K are n × d matrices and n d. This suggests a natural question: what are all functions f with this property? If other f exist and are quickly computable, they can be used in place of softmax for fast subquadratic attention algorithms. It was previously known that low-degree polynomials have this property. We prove that low-degree polynomials are the only piecewise continuous functions with this property. This suggests that the low-rank fast attention only works for functions approximable by polynomials. Our work gives a converse to the polynomial method in algorithm design.
- We prove the first full classification of all positive definite kernels that are functions of Manhattan or `1 distance. Our work generalizes, and also gives a new proof for, an existing theorem at the heart of kernel methods in machine learning: the classification of all positive definite kernels that are functions of Euclidean distance.
- The key problem in metric transforms, a mathematical theory used in geometry and machine learning, asks what functions transform pairwise distances in metric space M to metric space N for specified M and N. We prove the first full classification of functions that transform Manhattan distances to Manhattan distances. Our work generalizes the foundational work of Schoenberg, which fully classifies functions that transform Euclidean to Euclidean distances.
We additionally prove results about stable-rank preserving functions that are potentially useful in algorithmic design, and more. Our core tool for all our results is a new technique called the representation theory of the hyperrectangle.
Statistical-Computational Trade-offs for Density Estimation
Anders Aamand University of Copenhagen, Alexandr Andoni Columbia University, Justin Chen hen MIT, Piotr Indyk hen MIT, Shyam Narayanan Citadel Securities, Sandeep Silwal UW-Madison, Haike Xu MIT
Abstract:
We study the density estimation problem defined as follows: given k distributions p1, . . . , pk over a discrete domain [n], as well as a collection of samples chosen from a “query” distribution q over [n], output pi that is “close” to q. Recently [1] gave the first and only known result that achieves sublinear bounds in both the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses O(n/ logc k) samples for some constant c > 0 and polynomial space, then the query time of the data structure must be at least k 1−O(1)/ log log k , i.e., close to linear in the number of distributions k. This is a novel statistical-computational trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where q = pi for some i, and when the distributions are flat (specifically, all distributions are uniform over half of the domain [n]). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.
Hypothesis Testing the Circuit Hypothesis in LLMs
Claudia Shi Columbia University, Nicolas Beltran Velez Columbia University, Achille Nazaret Columbia University, Carolina Zheng Columbia University, Adrià Garriga-Alonso FAR AI, Andrew Jesson Columbia University, Maggie Makar University of Michigan, Ann Arbor, David Blei Columbia University
Abstract:
Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis?In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothesis tests to evaluate how well circuits satisfy them. The criteria focus on the extent to which the LLM’s behavior is preserved, the degree of localization of this behavior, and whether the circuit is minimal.We apply these tests to six circuits described in the research literature. We find that synthetic circuits — circuits that are hard-coded in the model — align with the idealized properties. Circuits discovered in Transformer models satisfy the criteria to varying degrees.To facilitate future empirical studies of circuits, we created the \textit{circuitry} package, a wrapper around the \textit{TransformerLens} library, which abstracts away lower-level manipulations of hooks and activations. The software is available at \url{https://github.com/blei-lab/circuitry}.
Treeffuser: probabilistic prediction via conditional diffusions with gradient-boosted trees
Nicolas Beltran Velez Columbia University, Alessandro A Grande Columbia University, Achille Nazaret Columbia University, Alp Kucukelbir Columbia University, David Blei Columbia University
Abstract:
Probabilistic prediction aims to compute predictive distributions rather than single point predictions. These distributions enable practitioners to quantify uncertainty, compute risk, and detect outliers. However, most probabilistic methods assume parametric responses, such as Gaussian or Poisson distributions. When these assumptions fail, such models lead to bad predictions and poorly calibrated uncertainty. In this paper, we propose Treeffuser, an easy-to-use method for probabilistic prediction on tabular data. The idea is to learn a conditional diffusion model where the score function is estimated using gradient-boosted trees. The conditional diffusion model makes Treeffuser flexible and non-parametric, while the gradient-boosted trees make it robust and easy to train on CPUs. Treeffuser learns well-calibrated predictive distributions and can handle a wide range of regression tasks—including those with multivariate, multimodal, and skewed responses. We study Treeffuser on synthetic and real data and show that it outperforms existing methods, providing better calibrated probabilistic predictions. We further demonstrate its versatility with an application to inventory allocation under uncertainty using sales data from Walmart. We implement Treeffuser in https://github.com/blei-lab/treeffuser.
Estimating the Hallucination Rate of Generative AI
Andrew Jesson Columbia University, Nicolas Beltran Velez Columbia University, Quentin Chu Columbia University, Sweta Karlekar Columbia University, Jannik Kossen University of Oxford, Yarin Gal University of Oxford, John Cunningham Columbia University, David Blei Columbia University
Abstract:
This paper presents a method for estimating the hallucination rate for in-context learning (ICL) with generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and a prediction question and asked to generate a response. One interpretation of ICL assumes that the CGM computes the posterior predictive of an unknown Bayesian model, which implicitly defines a joint distribution over observable datasets and latent mechanisms. This joint distribution factorizes into two components: the model prior over mechanisms and the model likelihood of datasets given a mechanism. With this perspective, we define a \textit{hallucination} as a generated response to the prediction question with low model likelihood given the mechanism. We develop a new method that takes an ICL problem and estimates the probability that a CGM will generate a hallucination. Our method only requires generating prediction questions and responses from the CGM and evaluating its response log probability. We empirically evaluate our method using large language models for synthetic regression and natural language ICL tasks.
EigenVI: score-based variational inference with orthogonal function expansions
Diana Cai Flatiron Institute, Chirag Modi Flatiron Institute, Charles Margossian Flatiron Institute, Robert Gower Flatiron Institute, David Blei Columbia University, Lawrence Saul Flatiron Institute
Abstract
We develop EigenVI, an eigenvalue-based approach for black-box variational inference (BBVI). EigenVI constructs its variational approximations from orthogonal function expansions. For distributions over R D, the lowest order term in these expansions provides a Gaussian variational approximation, while higher-order terms provide a systematic way to model non-Gaussianity. These approximations are flexible enough to model complex distributions (multimodal, asymmetric), but they are simple enough that one can calculate their low-order moments and draw samples from them. EigenVI can also model other types of random variables (e.g., nonnegative, bounded) by constructing variational approximations from different families of orthogonal functions. Within these families, EigenVI computes the variational approximation that best matches the score function of the target distribution by minimizing a stochastic estimate of the Fisher divergence. Notably, this optimization reduces to solving a minimum eigenvalue problem, so that EigenVI effectively sidesteps the iterative gradient-based optimizations that are required for many other BBVI algorithms. (Gradient-based methods can be sensitive to learning rates, termination criteria, and other tunable hyperparameters.) We use EigenVI to approximate a variety of target distributions, including a benchmark suite of Bayesian models from posteriordb. On these distributions, we find that EigenVI is more accurate than existing methods for Gaussian BBVI.
Optimization-based Causal Estimation from Heterogeneous Environments
Mingzhang Yin University of Florida, Yixin Wang University of Michigan, David Blei Columbia University
Abstract:
This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments—and ones that exhibit sufficient heterogeneity—CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model and more accurate predictions under interventions.
Understanding Transformer Reasoning Capabilities via Graph Algorithms
Clayton Sanford Columbia University, Bahare Fatemi Google Research, Ethan Hall Google, Anton Tsitsulin Google Research, Mehran Kazemi Google DeepMind, Jonathan Halcrow Google Research, Bryan Perozzi Google Research, Vahab Mirrokni Google Research
Abstract:
Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their algorithmic reasoning capabilities in realistic parameter regimes is lacking. We investigate this question in terms of the network’s depth, width, and number of extra tokens for algorithm execution. Our novel representational hierarchy separates 9 algorithmic reasoning problems into classes solvable by transformers in different realistic parameter scaling regimes. We prove that logarithmic depth is necessary and sufficient for tasks like graph connectivity, while single-layer transformers with small embedding dimensions can solve contextual retrieval tasks. We also support our theoretical analysis with ample empirical evidence using the GraphQA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.