Context-Aware Reasoning On Parametric Knowledge
for Inferring Causal Variables
September 04, 2024
Scientific discovery catalyzes human intellectual advances, driven by the cycle of hypothesis generation, experimental design, evaluation, and assumption refinement. Central to this process is causal inference, uncovering the mechanisms behind observed phenomena. While randomized experiments provide strong inferences, they are often infeasible due to ethical or practical constraints. However, observational studies are prone to confounding or mediating biases. While crucial, identifying such backdoor paths is expensive and heavily depends on scientists’ domain knowledge to generate hypotheses. We introduce a novel benchmark where the objective is to complete a partial causal graph. We design a benchmark with varying difficulty levels with over 4000 queries. We show the strong ability of LLMs to hypothesize the backdoor variables between a cause and its effect. Unlike simple knowledge memorization of fixed associations, our task requires the LLM to reason according to the context of the entire graph1.
Scientific discovery has been key to humankind’s advances. It is a dynamic process revolving around inquiry and refinement. Scientists adhere to a process that involves formulating a hypothesis and then collecting pertinent data [1]. They then draw inferences from these experiments, modify the hypothesis, formulate sub-questions, and repeat the process until the research question is answered [2].
Central to scientific discovery is formulating hypotheses and identifying relevant variables that drive the underlying causal mechanisms of observed phenomena [3]. Randomized controlled trials are the gold standard for establishing causal relationships, but they are often infeasible due to ethical, financial, or logistical constraints [4]. In such cases, researchers rely on observational data, where a key challenge lies not only in analyzing relationships but in determining which variables should be observed and included in the analysis, particularly confounders or mediators that influence causal mechanisms underlying the outcomes [5], [6].
With the recent advancement of Large Language Models (LLMs), there has been a growing interest in using them for scientific discovery [7]–[9]. LLMs have demonstrated strong performance in internalizing knowledge [10], [11] and reasoning-based tasks [12], [13], including causal discovery, where they infer pairwise causal relationships based on variable semantics [2], [14]–[18].
Scientific reasoning is fundamentally context-driven; unlike simple factual retrieval, it requires adapting hypotheses based on new evidence and integrating knowledge across varying subpopulations. While recent work has explored the use of LLMs for causal discovery [2], [14]–[17], much of it assumes a fixed set of variables and focuses on identifying relationships among them. However, a critical and underexplored aspect is determining which variables should be considered in the first place. This demands flexible, context-sensitive reasoning to identify missing causal factors.
In our paper, we use the term reasoning operationally to describe the model’s ability to generate hypotheses or identify variables that complete partial causal graphs. Our usage follows previous work by [2] in causal discovery and LLM research, where “reasoning” often refers to generating plausible hypotheses or prioritizing potential candidates given partial structural information, rather than strict deductive logic.
To address this gap, we propose a novel task: given a partial causal graph with missing variables, the LLM is prompted to hypothesize what those variables might be, using the structure and known nodes as context. By systematically omitting different variables, we generate diverse test cases to evaluate the robustness of model reasoning. We further decompose the benchmark into subtasks, starting from baseline variable identification to more realistic, open-ended settings where multiple unobserved mediators exist between known treatments and outcomes.
Our task mirrors real-world scientific workflows, where identifying missing variables—especially confounders and mediators is essential for valid causal inference. This typically demands costly, interdisciplinary effort. LLMs, trained on diverse knowledge sources, offer a scalable alternative. For example, in a stroke drug study, an LLM might suggest socioeconomic status as an unmeasured confounder. While recent works advocate using LLMs as co-pilots for causal tasks [19], [20], systematic evaluations are lacking. Our benchmark addresses this gap by assessing LLMs’ ability to infer missing causal variables across domains.
Our main contributions are: 1) We propose and formalize the novel task of LLM-assisted causal variable inference. 2) We propose a benchmark for inferring missing variables across diverse domains of causal graphs. 3) We design experimental tasks with different difficulty levels and knowledge assumptions, such as open-world and closed-world settings, the number of missing variables, etc. 4) Our benchmark allows for both grounded evaluations and a reproducible framework to benchmark LLMs’ capabilities in hypothesis generation.
Our work builds on the foundational framework of causality by [21]. Prior studies have explored extracting causal relationships from text [22]–[25] and using LLMs for causal reasoning [2], including commonsense [26], [27] and temporal causality [28], [29]. Recent efforts prompt LLMs with variable names to discover causal structures [2], [14]–[17]. Others integrate LLMs with deep structural causal models [30], [31], or focus on graph formatting [32], query design [33], and causal inference [34]. In contrast to prior work, we use LLMs to infer missing variables before data collection and evaluation, leveraging their pre-trained knowledge for this novel hypothesizing task.
Existing work tested hypothesis generation with LLMs in reasoning tasks or free-form scientific hypotheses from background knowledge provided in the context [8], [35]–[39]. In contrast, we consider the structured task of causal hypothesis generation, where the ground-truth variables are known and can be used for evaluation.
has been explored through prompt engineering [40]–[42], premise ordering manipulation [43], diagnostic analyses [44], and compositional reasoning evaluations [45], [46]. Unlike premise-based or linguistic evaluations, our setup requires reasoning over causal graph topology, using contextual cues by varying assumptions.
A causal relationship can be modeled via a Directed Acyclic Graph (DAG). A causal DAG represents relationships between a set of \(N\) variables defined by \(\mathbf{{V}} = \{ v_1,...,v_N \}\). The variables are encoded in a graph \(\mathcal{G} = (\mathbf{{V}}, \mathbf{{E}})\) where \(\mathbf{\textrm{E}}\) is a set of directed edges between the nodes \(\in \mathbf{\textrm{V}}\) such that no cycle is formed. Mathematically, it can be expressed as: \[\mathcal{G} = (\mathbf{{V}}, \mathbf{{E}}),\] \[\mathbf{{E}} = \{e_{i,j} \mid v_i, v_j \in \mathbf{{V}} , i \neq j \enspace \text{and} \: v_i \rightarrow v_j \}\] Each edge \(e_{i,j}\) denotes causal relationship and the influence from \(v_i\) to \(v_j\), \(v_i \xrightarrow{e_{i,j}} v_j\). We define \(\mathbf{d}(v)\) as the degree of a node \(v\), representing the total number of edges connected to \(v\). \(\mathbf{d_{\text{in}}}(v)\) is the in-degree, representing the number of incoming edges to \(v\). \(\mathbf{d_{\text{out}}}(v)\) is the out-degree, representing the number of outgoing edges from \(v\).
Source has no incoming edges; \(d_{\text{in}}(v) = 0\).
Sink has no outgoing edges. Sinks are \(d_{\text{out}}(v) = 0\).
Treatment is characterized by nodes that are being intervened upon.
Outcome is characterized by nodes that are observed for interventions from the treatments.
Mediator has both incoming and outgoing edges (\(d_{\text{in}}(v) > 0\) and \(d_{\text{out}}(v) > 0\)) as intermediaries in the pathways between treatment and
outcome.
Confounder influences both treatment and outcome, exhibiting edges directed towards the treatment and outcome nodes (\(d_{\text{out}}(v) \geq 2 )\). Hence \(v\) is a
confounder if it is a parent of both \(v_i\) and \(v_j\).
Collider has two edges meeting, and \(d_{\text{in}}(v) > 1\). I.e., \(v\) is a collider if it is a child of both \(v_i\) and \(v_j\).
Motivated by the challenge of discovering variables that block backdoor paths to ensure unbiased causal inference [47], in this work, we leverage language models to infer missing variables in a causal DAG. We assume that a part of the graph is already known, and the aim is to find additional variables that can be incorporated into the existing DAG to enhance the underlying causal mechanism.
Formally, we assume a partially known causal DAG, \(\mathcal{G}^* = ({\textrm{V}^*}, \mathbf{\textrm{E}})\), where \({\textrm{V}}^* \subseteq {\mathbf{V}}\). The objective is to identify the set of missing variables \({\textrm{V}^*} = {\mathbf{V}} \setminus {\textrm{V}}_{\text{missing}}\) thereby expanding \(\mathcal{G}^*\) to \(\mathcal{G}\). This implies that all causal relationships (edges) among variables in \(V^*\) are known and correctly represented in \(\mathcal{G}^*\); i.e., \(\textrm{E}\) is fully specified. Here, “missing” variables are not latent or hidden by measurement error but known unknowns within the causal graph reflective of the LLM’s perspective.
To systematically assess LLMs’ ability to infer missing causal variables, we construct a multi-stage benchmark with increasing levels of complexity. We begin with a controlled setting, where the model is provided with a partial causal DAG and a set of multiple-choice options to identify missing variables. Then, the task becomes open-ended, where LLMs hypothesize missing variables, simulating an open-world paradigm. Additionally, as the task escalates, we introduce more complexity by omitting additional nodes, challenging the model to hypothesize multiple missing variables.
We evaluate the reasoning capability of LLMs through prompting. We represent the graph \(\mathcal{G}^*\) using a prompt template \(P_{\text{LLM}}(\cdot)\) which enables LLMs to parse causal relationships in the DAG.
Motivation. To assess whether LLMs can infer missing variables in causal graphs, we begin with a controlled multiple-choice setting that serves as a baseline. This task isolates the core challenge: identifying a single missing variable from a causal DAG. By restricting the search space to a fixed set of options, including the correct variable and out-of-context distractors, we evaluate whether the model can distinguish the variable that meaningfully completes the causal structure.
The partial DAG \(\mathcal{G}^*\) is created by removing one variable, denoted as \(v_x\), from the original DAG \(\mathcal{G}\). The role of the LLM is to select a variable from the multiple choices, \(\text{MCQ}_{v_x}\), that can be used to complete the graph.
Figure 2: Leveraging LLM to identify the missing variable for a causal DAG in the presence of out-of-context distractors (a), an in-context distractor along with out-of-context distractors (b).. a — Task 1, b — Task 2
The out-of-context distractors are unrelated to the causal domain of the given DAG, chosen to minimize any contextual overlap with the true missing variable. Let \(v_x^*\) represent the variable selected by the LLM to complete \(\mathcal{G}^*\). \[v_x^* = P_{\text{LLM}}(\mathcal{G}^*, \text{MCQ}_{v_x}) \: \: \: \forall v_x \in \boldsymbol{{V}}\]
Motivation. In real-world domains like healthcare and finance, missing or unobserved variables often challenge causal inference [48], [49]. This task simulates such ambiguity by requiring LLMs to identify a relevant missing variable when presented with multiple plausible options, going beyond the baseline.
Here, instead of removing one node from the ground truth DAG \(\mathcal{G}\), two nodes, \(v_{x_1}\) and \(v_{x_2}\), are now removed to create the partial graph, \(\mathcal{G}^{*}\). \[\mathcal{G}^{*} = \mathcal{G} \setminus \{v_{x_1}, v_{x_2}\} \quad \text{for} \quad v_{x_1}, v_{x_2} \in \mathbf{V}\] The MCQA paradigm provides multiple choices, including the missing variables \(v_{x_1}\) and \(v_{x_2}\). The task for the LLM here is to select the correct variable \(v_{x_1}\) only, given an in-context choice \(v_{x_2}\) and out-of-context choices. The in-context variables are plausible within the same causal graph, allowing the LLM to use DAG-defined context inference to distinguish the relevant from the irrelevant options. We ensure \(v_{x_1}\) and \(v_{x_2}\) are not directly connected i.e., neither is a parent of the other. \[\begin{align} v_{x_1}^{*} &= P_{\text{LLM}}(\mathcal{G^*}, \text{MCQ}_{v_{x_1}, v_{x_2}}) \:\:\: \forall \: v_{x_1}, v_{x_2} \in \mathbf{V} \: \: \end{align}\] \[\begin{align} \text{and} \: v_{x_1} \not\rightarrow v_{x_2}, \:\: v_{x_2} \not\rightarrow v_{x_1} \end{align}\]
Motivation. Previous tasks constrained the model to select from predefined options. However, real-world reasoning rarely offers such scaffolding. This task increases complexity by removing the multiple-choice format entirely.
Given a partial DAG \(\mathcal{G}^*\), formed by removing a node \(v_x\), the model must generate potential missing variables without any provided candidates (see 3 (a)). The output is a ranked list of hypotheses \(\{v_{x,1}^{*},..., v_{x,k}^{*}\}\) for \(k\) suggestions, simulating open-ended discovery. \[\{v_{x,1}^{*}, v_{x,2}^{*}, ..., v_{x,k}^{*}\} =P_{\text{LLM}}(\mathcal{G}^{*}) \: \forall \: v_{x} \in \mathbf{V}\]
Figure 3: Leveraging LLM to hypothesize the missing variable in a causal DAG in an open-world setting for one variable (a), in an iterative fashion for multiple missing mediators (b).. a — Task 3, b — Task 4
Motivation. Building on the open-world setting, we further increase task difficulty by removing multiple nodes from the causal graph. The goal is no longer to recover a single missing variable but to iteratively hypothesize a set of mediators that link a treatment to an outcome.
Given a partial DAG \(\mathcal{G}^* = \mathcal{G} \setminus {v_{x_1}, \ldots, v_{x_M}}\), the task (illustrated in 3 (b)) involves generating a sequence of missing mediators \(M = {v_{m_1}, v_{m_2}, ..., v_{m_H}}\) that plausibly connect a treatment variable \(v_t\) to an outcome variable \(v_y\).
At each iteration \(i\), , the LLM is prompted with the current partial graph and returns a hypothesis for the next mediator. This process continues until all of the mediators are inferred. \[v_{m_i}^* = P_{\text{LLM}}(\mathcal{G}^* \cup \{v_{m_1}^*, ..., v_{m_{i-1}}^*\}),\] for \(i = 1, ..., H\). The sequence of mediators \(M = \{v_{m_1}, v_{m_2}, ..., v_{m_H}\}\) is chosen at random.
To assess how mediator order affects performance, we draw on mediation analysis concepts [50], specifically the Natural Direct Effect (NDE)—the treatment’s effect not mediated by a variable—and the Natural Indirect Effect (NIE)—the portion mediated by it (see Appendix 10.4). We propose the Mediation Influence Score (MIS) to quantify each mediator’s impact between a treatment and outcome. Defined as the ratio of NIE to NDE, MIS is a scale-free, positive measure of a mediator’s relative contribution: \[\begin{align} \text{MIS} \: (v_{m_i}) = \left| \frac{\text{NIE}(v_{m_i)}}{\text{NDE}(v_{m_i})}\right| \quad \text{for} \quad i = 1, ..., H. \end{align}\] This metric quantifies the relative importance of the indirect effect (through the mediator) compared to the direct impact. Mediators are then ranked and prioritized based on their MIS scores, with higher scores indicating a stronger mediation effect.
Graphs. We evaluate a variety of causal graphs spanning diverse domains. We use the semi-synthetic DAGs from BNLearn repository - Cancer [51], Survey [52], Asia [53], Child [54], Insurance [55], and Alarm [56]. We also evaluate our approach on a realistic Alzheimer’s Disease graph [30], developed by five domain experts and Law [57]. See Appendix 10.1 for further details.
| Graph | \(\mathbf{\textrm{V}}\) | \(\mathbf{\textrm{E}}\) | Description |
|---|---|---|---|
| Cancer | \(5\) | \(4\) | Factors around lung cancer |
| Survey | \(6\) | \(6\) | Factors for choosing transportation |
| Asia | \(8\) | \(8\) | Factors affecting dysponea |
| Law | \(8\) | \(20\) | factors around legal system |
| Alzheimer | \(9\) | \(16\) | Factors around Alzheimer’s Disease |
| Child | \(20\) | \(25\) | Lung related illness for a child |
| Insurance | \(27\) | \(52\) | Factors affecting car accident insurance |
| Alarm | \(37\) | \(46\) | Patient monitoring system |
Models. We evaluate our setups across different open-source and closed models. The models we use are GPT-4o [58], GPT-4 [59], LLama3-chat-8b [60], Mistral-7B-Instruct-v0.2 [61], Mixtral-7B-Instruct-v0.1 [62], Zephyr-7b-Beta [63] and Neural-chat-7b-v3-1 [64].
Prompt. We used the textual prompting strategy from [32] after performing experiments on some of the proposed encoding methods (see Appendix 11.10). Implementation details are in Appendix 10 and prompts in Appendix 15. Our code will be available after anonymity period.
Setup. The input to the LLM consists of a partial DAG \(\mathcal{G}^*\), and multiple choices including the correct missing variable \(v_x\) and several out-of-context distractors. This task includes 120 queries. We define accuracy to assess the LLM’s \(v_x\) prediction. \[\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1} (v_{x}^{*} = v_{x}^{i})\]
Figure 4: Accuracy of LLMs in identifying the missing causal variable from multiple choices with out-of-context distractors (a), and from both out-of-context and in-context distractors (b).. a — Task 1 Result., b — Task 2 Result.
Results. In 4 (a), we report the accuracy of different LLMs in identifying the missing variable. GPT-4, followed closely by Mixtral and GPT-4o, consistently performs well, achieving perfect accuracy on most of the graphs. Other models, including Mistral-7b, Llama-7b, Neural, and Zephyr-7b, have varying degrees of success. Insurance remains the most challenging graph, potentially due to the high number of edges present in the DAG. All models significantly outperform the random baseline. However, we conjecture that the high performance could be partially attributed to the simplicity of the task. The models might be using the context of the graph domain to exclude unrelated distractors rather than engaging in deeper causal reasoning among multiple plausible choices. To investigate this, we introduce an in-domain choice among the multiple choices in the next experiment.
Setup. This is a more challenging task where the partial graph has two missing nodes. In addition to out-of-context distractors and the ground-truth variable, \(v_{x_1}\), the multiple-choice set includes the second missing variable \(v_{x_2}\) as an in-context distractor. This setup tests the model’s ability to reason over indirect causal relations contextually to identify the correct variable. This task results in over 3800 queries. To evaluate performance, we use two metrics: Accuracy and False Node Accuracy (FNA). FNA captures how often the model incorrectly selects the in-context distractor instead: \[\text{FNA} \downarrow = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(v_{x_1}^{*} = v_{x_2})\]
1.5pt
| Cancer | Survey | Asia | Law | Alzheimers | Child | Insurance | Alarm | Avg | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2-19 | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J |
| Zephyr | \({0.36}\) | \({0.61}\) | \({0.34}\) | \({0.60}\) | \({0.45}\) | \({0.66}\) | \(0.41\) | \({0.70}\) | \({0.35}\) | \({0.75}\) | \({0.51}\) | \({0.70}\) | \({0.45}\) | \({0.44}\) | \({0.46}\) | \({0.69}\) | \({0.42}\) | \({0.63}\) |
| Mixtral | \(0.41\) | \(0.66\) | \(0.39\) | \(0.66\) | \(\mathbf{0.66}\) | \(0.75\) | \(0.38\) | \(0.69\) | \(0.31\) | \(0.77\) | \(\mathbf{0.53}\) | \(\mathbf{0.77}\) | \(0.46\) | \({0.56}\) | \(\mathbf{0.50}\) | \(0.72\) | \(0.46\) | \(0.70\) |
| Neural | \(0.38\) | \(0.77\) | \(0.43\) | \(0.55\) | \(0.53\) | \(0.55\) | \(0.47\) | \(0.72\) | \(0.44\) | \(0.71\) | \(0.48\) | \(0.70\) | \(0.47\) | \(0.43\) | \(0.47\) | \(0.67\) | \(0.45\) | \(0.63\) |
| Llama | \(0.40\) | \(0.48\) | \(0.40\) | \(0.54\) | \(0.53\) | \(0.58\) | \(\mathbf{0.67}\) | \(0.65\) | \(0.45\) | \(0.61\) | \(0.48\) | \(0.63\) | \(0.42\) | \(0.34\) | \(0.46\) | \(0.65\) | \(0.45\) | \(0.55\) |
| Mistral | \(0.33\) | \(0.67\) | \(0.44\) | \(0.65\) | \(0.60\) | \(0.73\) | \(0.49\) | \(0.67\) | \(0.34\) | \(0.76\) | \(0.48\) | \(0.68\) | \(0.46\) | \(0.47\) | \(0.47\) | \(0.71\) | \(0.44\) | \(0.67\) |
| GPT-4 | \(\mathbf{0.49}\) | \(\mathbf{0.90}\) | \(\mathbf{0.51}\) | \(0.67\) | \(\mathbf{0.66}\) | \({0.76}\) | \(0.55\) | \({0.78}\) | \(0.47\) | \(\mathbf{0.98}\) | \(0.36\) | \(0.53\) | \(0.52\) | \(0.56\) | \(0.49\) | \(0.75\) | \(0.50\) | \(0.73\) |
| GPT-4o | \(0.52\) | \(0.89\) | \(0.50\) | \(\mathbf{0.71}\) | \(\mathbf{0.66}\) | \(\mathbf{0.78}\) | \(0.58\) | \(\mathbf{0.80}\) | \(\mathbf{0.50}\) | \(0.91\) | \(0.40\) | \(0.60\) | \(\mathbf{0.54}\) | \(\mathbf{0.58}\) | \(0.44\) | \(\mathbf{0.76}\) | \(\mathbf{0.54}\) | \(\mathbf{0.76}\) |
Results. In 4 (b), we report Accuracy and False Node Accuracy (FNA) across graphs. Accuracy reflects how often the correct missing variable is chosen, while FNA measures how often the model incorrectly selects the in-context distractor—another missing variable included to test deeper causal reasoning. Since there are 5 options, random accuracy is \(0.2\), and FNA under random guessing would be around \(0.2\) as well. GPT-4 and GPT-4o achieve high accuracy and low FNA, showing that they reliably distinguish the true missing node from both distractors and the in-context variable. GPT-4o slightly outperforms GPT-4 on several graphs. Open models like Mistral, Zephyr, and Mixtral show more variability, performing well on simpler graphs like Cancer but struggling on complex ones like Alarm. While most models exceed random chance, higher FNA in some cases highlights a tendency to confuse plausible but incorrect variables, emphasizing the difficulty of reasoning over multiple missing nodes.
Setup. In real-world settings, partial causal graphs provided by domain experts often lack ground truth and multiple choices. Hypotheses may vary depending on context, data, or domain knowledge. To simulate this, we prompt the LLM to generate. The LLM generates \(k=5\) suggestions for the missing node \(v_{x}\). This task has 120 queries. We compare suggestions to the ground truth, recognizing that real-world cases often lack a single correct answer. Since traditional metrics may miss contextual nuances, we use two evaluations: semantic similarity and LLM-as-Judge (see Appendix 11.4).
Semantic Similarity. We compute the cosine similarity between the embeddings of the predictions, \(v_{x_{1:5}}^*\), and the ground truth \(v_{x}\), averaging the highest similarity scores across all nodes \(v_x \in \boldsymbol{V}\) (see Appendix 10.5 for details).
LLM-Judge. Inspired by [65], this two-step metric assesses contextual semantic similarity beyond exact matches. First, LLM ranks suggestions \(v_{x_{1:5}}^*\) based on how well they fit the partial graph. Second, it rates the best match on a 1–10 scale. Scores are averaged across nodes for an overall measure (see Appendix 10.6).
Results. We report models’ performances using both semantic similarity and LLM-Judge metrics in 2. For brevity, we provided the variances in Appendix 11.1. We provide a detailed analysis of each metric across different types of node variables (defined in Section 3). We evaluate sources, sinks, colliders, and mediators for each of the partial causal graphs. The results, fine-grained by node type, are given in Figure 5, which shows each model’s average performance across graphs with a detailed performance per graph in 11. GPT-4, GPT-4o and Mixtral generally achieve higher semantic similarity and LLM-as-Judge scores across most graphs (11). We observe that semantic similarity is a stricter metric than LLM-as-judge since it cannot encode contextual information about the causal DAG (see example in 10). Despite different scales, both metrics seem to be fairly correlated. 5, shows that models display stronger performance for colliders and mediators on average. This suggests that these models are better at reasoning about common causes and indirect causal relationships. Sinks are typically the nodes that represent the outcomes or effects of interventions (treatments) applied to other nodes. Source nodes represent the causes in a causal graph. Lower performance on these nodes indicates to reason about the potential causes and outcomes of the causal graphs is difficult.
In 10 (a), model performance improves with more suggestions (\(k\)). 10 (b) shows that accuracy also correlates with node degree (\(d_{in} + d_{out}\)), indicating that more context aids prediction. Overall, LLMs perform well on many nodes, especially mediators and colliders, making them promising tools for real-world causal discovery where treatments and outcomes are known.
Figure 5: Task 3 Results. Visualizing each model’s performance, averaged across the different graphs, for Sink, Source, Mediator, and Collider nodes.. a — Semantic similarity., b — LLM-as-Judge.
Backdoor paths are alternative causal pathways that confound the estimation of causal effects and introduce bias if not accounted for. Hence, hypothesizing and controlling for confounders is an important task in causal inference [66]. We extract confounder subgraphs from [67], Alarm, and Insurance graphs.
10pt
| Sachs | Alarm | Insurance | |
|---|---|---|---|
| Zephyr | \(0.10\) | \(0.45\) | \(0.53\) |
| Mixtral | \(\mathbf{0.95}\) | \(\mathbf{0.85}\) | \(0.63\) |
| Neural | \(0.30\) | \(0.45\) | \(0.61\) |
| LLama | \(0.20\) | \(0.47\) | \(0.63\) |
| Mistral | \(0.20\) | \(\mathbf{0.85}\) | \(0.61\) |
| GPT-4 | \(\mathbf{0.95}\) | \(0.73\) | \(\mathbf{0.78}\) |
| GPT-4o | \(\mathbf{0.95}\) | \(0.70\) | \(0.73\) |
From 3 and Appendix 13, we find that while LLMs accurately hypothesize some confounders, models struggle with domain-specific graphs like SACHS. Larger models like GPT-4o don’t necessarily always perform best, underscoring the need for diverse benchmarks.
Setup. We adopt an iterative approach for hypothesizing mediators, allowing the model to refine predictions step-by-step—unlike global prediction, which yields lower performance (Appendix 11.6). This aligns with Chain-of-Thought [68] reasoning and improves accuracy. There are more than 140 queries for this task, ranging from 1-10 missing mediators. For unordered evaluation, mediators are given in random order and scored via average semantic similarity. For ordered evaluation, we rank mediators using the Mediation Influence Score (MIS) and compare model performance when prompted in ascending vs. descending MIS order. We define a metric, \(\Delta\), to capture this difference.
| Asia | Child | Insurance | Alarm | |||||
| Sim | \(\Delta\) | Sim | \(\Delta\) | Sim | \(\Delta\) | Sim | \(\Delta\) | |
| Zephyr | \(0.61\) | \(-0.02\) | \(\mathbf{0.54}\) | \(\;\;\;\;0.17\) | \(0.47\) | \(\;\;\;\;0.19\) | \(0.51\) | \(\;\;\;\;0.20\) |
| Mixtral | \(\mathbf{0.87}\) | \(\;\;\;\;0.01\) | \(0.50\) | \(\;\;\;\;0.18\) | \(0.48\) | \(\;\;\;\; 0.15\) | \(0.52\) | \(\;\;\;\;0.13\) |
| Neural | \(0.65\) | \(\;\;\;\;0.04\) | \(0.48\) | \(\;\;\;\;0.21\) | \(0.42\) | \(\;\;\;\;0.16\) | \(0.46\) | \(\;\;\;\;0.12\) |
| Llama | \(0.80\) | \(\;\;\;\;0.07\) | \(0.49\) | \(-0.05\) | \(0.44\) | \(\;\;\;\;0.21\) | \(0.51\) | \(\;\;\;\;0.07\) |
| Mistral | \(0.33\) | \(\;\;\;\;0.02\) | \(0.50\) | \(\;\;\;\;0.12\) | \(0.48\) | \(\;\;\;\;0.13\) | \(0.47\) | \(\;\;\;\;0.11\) |
| GPT-4 | \(0.49\) | \(\;\;\;\;0.04\) | \(0.39\) | \(\;\;\;\;0.16\) | \(\mathbf{0.52}\) | \(\;\;\;\;0.14\) | \({0.60}\) | \(-0.07\) |
| GPT-4o | \(0.55\) | \(\;\;\;{0.00}\) | \(0.48\) | \(\;\;\;{0.10}\) | \(0.51\) | \(\;\;\;{0.08}\) | \(\mathbf{0.62}\) | \(\;\;\;{0.01}\) |
Results. The results of this experiment are in 4. Results with variances are provided in Appendix 11.1. In this highly complex environment with more than one node missing and with open-world search space, LLMs can still maintain their performance. Unlike the overall consistent performance of GPT-4 across all graphs, other models showed superior performance in Insurance and Alarm graphs only. As the complexity of the graph increases, we observe larger differences in hypothesizing the mediators according to the MIS order. Positive \(\Delta\) values suggest that prompting the LLM based on the MIS metric leads to higher semantic similarity between the mediator hypotheses and the ground truth variables. In summary, we observe that LLMs can be effective in iteratively hypothesizing multiple mediators in a DAG, and if present, some domain knowledge about the significance of the mediator can boost the performance.
A concern in evaluating pretrained LLMs on knowledge-intensive tasks is contamination i.e., memorization of evaluation data from training. This is especially relevant for public datasets like those in the BNLearn repository, which may have appeared in training corpora.
To assess this, we tested whether models could recall the number and names of variables from each of the eight datasets in our benchmark. This included well-known BNLearn graphs (e.g., Asia, Child, Insurance, Alarm) and less common ones (e.g., Law, Alzheimer’s). We prompted each model to report node counts and variable names, including explicit references to BNLearn for relevant datasets, to detect signs of memorization.
| Model | Cancer | Survey | Asia | Law | Alz | Child | Insurance | Alarm |
|---|---|---|---|---|---|---|---|---|
| Zephyr | ||||||||
| Mixtral | 0.71 | 0.13 | ||||||
| Neural | ||||||||
| LLama | ✔ | |||||||
| Mistral | ✔ | |||||||
| GPT-4 | ✔ | ✔ | ✔ | 0.55 | ✔ | ✔ | ✔ | |
| GPT-4o | ✔ | ✔ | ✔ | 0.45 | ✔ | ✔ | ✔ |
In 5, except GPT family models, which exhibited partial recall for some widely known BNLearn datasets, we observe that full reconstruction of the graphs’ details was rare. This recall was consistently absent for lesser-known datasets such as Law and Alzheimer’s, which are less likely to have appeared during pretraining. While these findings cannot eliminate memorization with certainty, they suggest that it is not predominant for most models.
To further test GPT-4, we explicitly mentioned the graph provenance (e.g., “This graph is from BNLearn”) during “Task 3”, shown in 18. GPT-4’s performance improved across most graphs. This suggests that its initial responses were not purely reciting these graphs but potentially based on broader parametric knowledge.
The results show that LLMs effectively hypothesize missing variables, especially mediators, though performance varies with task complexity. Simple tasks, like identifying missing variables from controlled options, had high success rates. Performance differences across domains may stem from biases in LLM training data, affecting parametric memory. For instance, confounder hypothesis quality varied across graphs, with domain-specific gaps lowering accuracy, like in the Sachs graph (Appendix 13).
We explored fine-tuning and few-shot prompting to enhance performance, but small DAG sizes limited the graph size, yielding mixed results (Appendix 12.1). While fine-tuning may help specialization, it can also reduce reliance on general parametric knowledge [69]. Future work could explore domain-specific fine-tuning.
Though model training data is undisclosed, we used a recently released graph [30] that postdates cut-off dates (at the time of performing experiments). Our novel task and verbalization approach further reduce the risks of memorization. Table 2 confirms LLMs generate novel hypotheses rather than retrieving memorized patterns, with no evidence of direct graph reconstruction. Our work relies on reasoning via parametric knowledge rather than explicit memorization.
Our setup assumes known edges among missing variables for controlled evaluation, which future work can extend. We envision this as a human-LLM collaboration under expert supervision, as LLMs cannot self-assess plausibility or confidence [70]. Future work could also refine filtering mechanisms and improve performance on source and sink nodes.
To complement our automatic evaluation metrics, we conducted a small-scale human evaluation on three representative graphs (Cancer, Survey, and Asia). Two independent annotators (a CS PhD student and a CS PhD graduate) rated the quality of LLM-suggested variables. We then measured the agreement between human judgments, semantic similarity, and our LLM-judge using Spearman correlation.
| Correlation | p-value | |
|---|---|---|
| Sim – LLM-judge | 0.430 | 0.2475 |
| Sim – R1 | 0.781 | 0.0130 |
| Sim – R2 | 0.623 | 0.0732 |
| LLM-judge – R1 | 0.622 | 0.0738 |
| LLM-judge – R2 | 0.831 | 0.0055 |
| R1 – R2 | 0.823 | 0.0065 |
6 indicates strong correlations among human annotators and between human judgments and the automatic metrics. In particular, the LLM-judge shows high alignment with both annotators, suggesting that it serves as a reliable proxy for human evaluation. This supports the use of our automatic evaluation framework as a scalable approach for benchmarking causal reasoning tasks.
Most causality research focuses on identifying relationships from observed data, while hypothesizing which variables to observe remains largely reliant on expert knowledge. We propose using LLMs as proxies for this step and introduce a novel task: hypothesizing missing variables in causal graphs. We formalize this with a benchmark that spans varying levels of difficulty and ground-truth knowledge. Our results highlight LLMs’ strengths in inferring backdoor paths, including colliders, confounders, and mediators, which often lead to biased causal inference when unaccounted for. Our work LLMs can serve as useful tools for early-stage hypothesis generation, supporting scientists in formulating plausible causal variables before data collection. By evaluating models across different graph completeness, open- and closed-world settings, we highlight their potential and limitations.
While this work presents promising advancements in leveraging LLMs for hypothesizing missing variables in causal graphs, there are some limitations to consider. Our evaluation relies on established DAGs and comparisons with known ground truth, limiting assessment in scenarios without a defined baseline. Future work can include validation using human in loop evaluation. Future work can also integrate our work into the full causal discovery pipeline with statistical data.
Our work leverages LLMs for hypothesis generation in causal discovery but comes with ethical risks. Biases from training data may lead to skewed hypotheses, and over-reliance on AI without expert validation could result in misleading conclusions. While we design our task to minimize memorization, risks of data leakage remain. Additionally, LLM performance varies across domains, making errors in high-stakes fields like healthcare particularly concerning. To mitigate these risks, we emphasize human-AI collaboration, transparency in model limitations, and improved evaluation frameworks for reliability.
This work was partially funded by ELSA – European Lighthouse on Secure and Safe AI funded by the European Union under grant agreement No. 101070617 and the German Federal Ministry of Education and Research (BMBF) under the grant AIgenCY (16KIS2012). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.
We use 7 real-world based graphs. These graphs span different domain knowledge topics. These graphs have ground truth graphs along with their observational data. The simplest graph used is the cancer graph with 4 edges and 5 node variables. In addition to the semi-synthetic graphs from the BNLearn library, we also evaluate our approach on a realistic Alzheimer’s Disease graph [30], which was developed by five domain experts. Given that each expert created a different causal graph, the final causal DAG comprises only those edges that were agreed upon by consensus.
| graph | \(\mathbf{\textrm{V}}\) | \(\mathbf{\textrm{E}}\) | Description |
|---|---|---|---|
| Cancer | \(5\) | \(4\) | Factors around lung cancer |
| Survey | \(6\) | \(6\) | Factors for choosing transportation |
| Asia | \(8\) | \(8\) | Factors affecting dysponea |
| Law | \(8\) | \(20\) | factors around legal system |
| Alzheimer | \(9\) | \(16\) | Factors around Alzheimer’s Disease |
| Child | \(20\) | \(25\) | Lung related illness for a child |
| Insurance | \(27\) | \(52\) | Factors affecting car accident insurance |
| Alarm | \(37\) | \(46\) | Patient monitoring system |
For reproducibility, we used temperature \(0\) and top-p value as \(1\) across all of the models. We also mentioned the snapshot of the model used. We have also included the prompts and examples below. Our code will be released upon acceptance. The graphs are under CC BY-SA 3.0, which allows us to freely modify the graphs for benchmarking. Our benchmark will be released under the CC BY-SA License.
GPT-4o, GPT-4 was accessed via API. The rest of the models were run on 1 A100 GPU. Since we used an off-the-shelf LLM, there was no training to be performed. Since many of the models were run by API, it is difficult to calculate the entire computation, however, all of the experiments for each model took \(\approx 6\) hours.
For variable identification, we generate multiple choices that remain consistent across all missing nodes and all of the graphs. The words were randomly chosen to be far enough from the nodes. The options chosen were weather, book sales, and movie ratings. We wanted to make sure that the options were not from one specific domain, such that the LLM could do the process of elimination.
Average Treatment Effect (ATE) quantifies the expected change in the outcome \(v_y\) caused by the unit change of the treatment \(v_t\). ATE is a part of the causal do-calculus introduced by [21]. We consider binary causal DAGs, i.e., each variable can either take \(0\) or \(1\) as values. \[\text{ATE} = \mathbb{E}[v_y|\text{do}(v_t=1)] - \mathbb{E}[v_y|\text{do}(v_t=0)]\] where the \(\text{do}(\cdot)\) operator, represents an intervention. The \(E[v_y|\text{do}(v_t=1)]\) represents the expected value of the outcome variable \(v_y\) when we intervene to set the treatment variable \(v_t\) to \(1\) (i.e., apply the treatment), and \(E[v_y|\text{do}(v_t=0)]\) represents the expected value of \(v_y\) when we set \(v_t\) to \(0\) (i.e., do not apply the treatment).
Mediation analysis is implemented to quantify the effect of a treatment on the outcome via a third variable, the mediator. The total mediation effect can be decomposed into the Natural Direct Effect (NDE) and the Natural Indirect Effect (NIE). The Natural Direct Effect (NDE) is the effect of the treatment on the outcome variable when not mediated by the mediator variable. The Natural Indirect Effect (NIE) is the effect of the treatment variable on the outcome variable when mediated by the mediator variable. \[\begin{align} \text{NDE} &= \mathbb{E}[v_{t=1},v_{m=0} - v_{t=0},v_{m=0}] \end{align}\] Here, NDE is calculated by comparing the expected outcome when the treatment variable is set to \(1\) and the mediator is fixed at the level it would take under the control treatment \(v_t=0\), with the expected outcome when both the treatment and the mediator are set to the control level. \[\begin{align} \text{NIE} &= \mathbb{E}[v_{t=0},v_{m=1} - v_{t=0},v_{m=0}] \end{align}\] Here, NIE is calculated by comparing the expected outcome when the treatment variable is set to \(1\) and the mediator is allowed to change as it would under the treatment, with the expected outcome when the treatment variable is set to \(1\) but the mediator is fixed at the control level.
Given the task of hypothesizing missing nodes in a partial graph \(\mathcal{G}^{*}\) in the absence of multiple-choices, we evaluate the semantic similarity between the model’s predictions and the ground truth node variable. We leverage an open model namely ‘all-mpnet-base-v2’ to transform the textual representations of the model’s predictions and the ground truth into high-dimensional vector space embeddings. Post transforming textual representations into embeddings and normalizing them, we calculate the cosine similarity. Scores closer to 1 indicate a high semantic similarity, suggesting the model’s predictions align well with the ground truth. This metric gives a score of similarity without the contextual knowledge of the causal graph. We perform our experiments to consider every node of the ground truth as a missing node iteratively. For all the suggestions for a node variable, we calculate the semantic similarity. The average similarity reported is the highest semantic similarity for each of the variable suggestions.
| Ground Truth: | Smoking status | ||||
| LLM Suggestions: | Smoking | Alcohol Consumption | Exposure to Radiation | Poor Diet | Genetic Predisposition |
| Semantic similarity : | \(0.72\) | \(0.38\) | \(0.22\) | \(0.22\) | \(0.17\) |
| Ground Truth: | Employee or self-employed | ||||
| LLM Suggestions: | Income Level | Job Location | Environmental Awareness | Lifestyle Preferences | Health Consciousness |
| Semantic similarity : | \(0.30\) | \(0.25\) | \(0.17\) | \(0.15\) | \(0.10\) |
| Ground Truth: | Dyspnea laboured breathing | ||||
| LLM Suggestions: | Shortness of breath | Chest Pain | Coughing | Fatigue | Weight Loss |
| Semantic similarity : | \(0.57\) | \(0.41\) | \(0.36\) | \(0.29\) | \(0.11\) |
| Ground Truth: | Montreal Cognitive Assessment score | ||||
| LLM Suggestions: | Cognitive Function | Neurological Function | Mental Health Status | Risk of Alzheimer’s Disease | Memory Performance |
| Semantic similarity : | \(0.60\) | \(0.47\) | \(0.38\) | \(0.36\) | \(0.16\) |
| Ground Truth: | Grunting in infants | ||||
| LLM Suggestions: | Respiratory distress | Asthma | Pneumonia | Pulmonary infection | Bronchopulmonary dysplasia (BPD) |
| Semantic similarity : | \(0.22\) | \(0.18\) | \(0.17\) | \(0.11\) | \(0.01\) |
| Ground Truth: | Driving history | ||||
| LLM Suggestions: | Previous accidents | Distance driven daily | Type of car insurance | Frequency of car maintenance | Location of parking |
| Semantic similarity : | \(0.55\) | \(0.42\) | \(0.27\) | \(0.26\) | \(0.18\) |
| Ground Truth: | Heart rate blood pressure | ||||
| LLM Suggestions: | Pulse Rate | Blood Pressure | Respiratory Rate | EKG Reading | Blood Oxygen Level |
| Semantic similarity : | \(0.78\) | \(0.78\) | \(0.57\) | \(0.49\) | \(0.42\) |
To capture the domain knowledge of the expert that selects the most relevant causal variable, we use LLM-as-Judge as a proxy expert. This also allows for evaluation based on contextual DAG knowledge as well. Given the impressive results of GPT-4 in [65], we use GPT-4 as a judge for all of the experiments.
| Ground Truth: | Education up to high school or university degree |
| Top ranked suggestion: | Education level |
| Rating : | \(9.5\) |
| Ground Truth: | Pollution |
| Top ranked suggestion: | Smoking history |
| Rating : | \(2.0\) |
| Ground Truth: | Bonchitis |
| Top ranked suggestion: | smoking behavior |
| Rating : | \(2.0\) |
| Ground Truth: | Lung XRay report |
| Top ranked suggestion: | Lung Damage |
| Rating : | \(8.0\) |
| Ground Truth: | Socioeconomic status |
| Top ranked suggestion: | Driver’s lifestyle |
| Rating : | \(7.0\) |
LLM-as-judge uses GPT-4 as a judge model which could be biased towards some data. Since the training graphs are not public for this model, it would be hard to judge how these biases might affect the final score. Hence for robust evaluation we also evaluate using the semantic similarity.
| Ground Truth: Dyspnea laboured breathing | |
| LLM Suggestion: Shortness of breath | |
| Semantic similarity to GT: \(0.57\) | |
| LLM-as-Judge score: \(9.5\) |
For each order, the algorithm prompts the LLM to generate mediator suggestions, selects the suggestion with the highest semantic similarity to the context, and iteratively updates the partial graph with these mediators. \(\Delta\), quantifies the impact of mediator ordering by comparing the average highest semantic similarity scores obtained from both descending and ascending orders. This methodical evaluation sheds light on how the sequence in which mediators are considered might affect the LLM’s ability to generate contextually relevant and accurate predictions.
For brevity we didnt add variance in the main text, the following results have variances:
1.5pt
| Cancer | Survey | Asia | Alzheimers | Child | Insurance | Alarm | Avg | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2-17 | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J | Sim | LLM-J |
| Zephyr | \(\underset{\pm{0.04}}{0.36}\) | \(\underset{\pm{0.06}}{0.61}\) | \(\underset{\pm{0.07}}{0.34}\) | \(\underset{\pm{0.05}}{0.60}\) | \(\underset{\pm{0.05}}{0.45}\) | \(\underset{\pm{0.04}}{0.66}\) | \(\underset{\pm{0.03}}{0.35}\) | \(\underset{\pm{0.03}}{0.75}\) | \(\underset{\pm{0.02}}{0.51}\) | \(\underset{\pm{0.04}}{0.70}\) | \(\underset{\pm{0.04}}{0.45}\) | \(\underset{\pm{0.05}}{0.44}\) | \(\underset{\pm{0.03}}{0.46}\) | \(\underset{\pm{0.02}}{0.69}\) | \(\underset{\pm{0.04}}{0.42}\) | \(\underset{\pm{0.04}}{0.63}\) |
| Mixtral | \(\underset{\pm{0.03}}{0.41}\) | \(\underset{\pm{0.04}}{0.66}\) | \(\underset{\pm{0.05}}{0.39}\) | \(\underset{\pm{0.06}}{0.66}\) | \(\underset{\pm{0.02}}{\mathbf{0.66}}\) | \(\underset{\pm{0.03}}{0.75}\) | \(\underset{\pm{0.04}}{0.31}\) | \(\underset{\pm{0.02}}{0.77}\) | \(\underset{\pm{0.03}}{\mathbf{0.53}}\) | \(\underset{\pm{0.02}}{\mathbf{0.77}}\) | \(\underset{\pm{0.03}}{0.46}\) | \(\underset{\pm{0.04}}{\mathbf{0.56}}\) | \(\underset{\pm{0.03}}{\mathbf{0.50}}\) | \(\underset{\pm{0.06}}{0.72}\) | \(\underset{\pm{0.03}}{0.46}\) | \(\underset{\pm{0.05}}{0.70}\) |
| Neural | \(\underset{\pm{0.02}}{0.38}\) | \(\underset{\pm{0.05}}{0.77}\) | \(\underset{\pm{0.02}}{0.43}\) | \(\underset{\pm{0.03}}{0.55}\) | \(\underset{\pm{0.03}}{0.53}\) | \(\underset{\pm{0.04}}{0.55}\) | \(\underset{\pm{0.05}}{0.44}\) | \(\underset{\pm{0.03}}{0.71}\) | \(\underset{\pm{0.04}}{0.48}\) | \(\underset{\pm{0.03}}{0.70}\) | \(\underset{\pm{0.04}}{0.47}\) | \(\underset{\pm{0.05}}{0.43}\) | \(\underset{\pm{0.02}}{0.47}\) | \(\underset{\pm{0.03}}{0.67}\) | \(\underset{\pm{0.03}}{0.45}\) | \(\underset{\pm{0.04}}{0.63}\) |
| Llama | \(\underset{\pm{0.03}}{0.40}\) | \(\underset{\pm{0.05}}{0.48}\) | \(\underset{\pm{0.04}}{0.40}\) | \(\underset{\pm{0.05}}{0.54}\) | \(\underset{\pm{0.03}}{0.53}\) | \(\underset{\pm{0.06}}{0.58}\) | \(\underset{\pm{0.05}}{0.45}\) | \(\underset{\pm{0.03}}{0.61}\) | \(\underset{\pm{0.04}}{0.48}\) | \(\underset{\pm{0.03}}{0.63}\) | \(\underset{\pm{0.01}}{0.42}\) | \(\underset{\pm{0.05}}{0.34}\) | \(\underset{\pm{0.02}}{0.46}\) | \(\underset{\pm{0.03}}{0.65}\) | \(\underset{\pm{0.03}}{0.45}\) | \(\underset{\pm{0.04}}{0.55}\) |
| Mistral | \(\underset{\pm{0.01}}{0.33}\) | \(\underset{\pm{0.05}}{0.67}\) | \(\underset{\tiny{\pm{0.05}}}{0.44}\) | \(\underset{\pm{0.04}}{0.65}\) | \(\underset{\pm{0.03}}{0.60}\) | \(\underset{\pm{0.04}}{0.73}\) | \(\underset{\pm{0.04}}{0.34}\) | \(\underset{\pm{0.02}}{0.76}\) | \(\underset{\pm{0.04}}{0.48}\) | \(\underset{\pm{0.03}}{0.68}\) | \(\underset{\pm{0.03}}{0.46}\) | \(\underset{\pm{0.01}}{0.47}\) | \(\underset{\pm{0.03}}{0.47}\) | \(\underset{\pm{0.03}}{0.71}\) | \(\underset{\pm{0.03}}{0.44}\) | \(\underset{\pm{0.03}}{0.67}\) |
| GPT-4 | \(\underset{\pm{0.02}}{\mathbf{0.49}}\) | \(\underset{\pm{0.03}}{\mathbf{0.90}}\) | \(\underset{\pm{0.06}}{\mathbf{0.51}}\) | \(\underset{\pm{0.04}}{0.67}\) | \(\underset{\pm{0.02}}{\mathbf{0.66}}\) | \(\underset{\pm{0.03}}{\mathbf{0.76}}\) | \(\underset{\pm{0.02}}{\mathbf{0.47}}\) | \(\underset{\pm{0.02}}{0.98}\) | \(\underset{\pm{0.05}}{0.36}\) | \(\underset{\pm{0.04}}{0.53}\) | \(\underset{\pm{0.03}}{\mathbf{0.52}}\) | \(\underset{\pm{0.03}}{\mathbf{0.56}}\) | \(\underset{\pm{0.06}}{0.49}\) | \(\underset{\pm{0.02}}{\mathbf{0.75}}\) | \(\underset{\pm{0.04}}{\mathbf{0.50}}\) | \(\underset{\pm{0.03}}{\mathbf{0.73}}\) |
| Asia | Child | Insurance | Alarm | |||||
| Sim | \(\Delta\) | Sim | \(\Delta\) | Sim | \(\Delta\) | Sim | \(\Delta\) | |
| Zephyr | \(\underset{\pm0.03}{0.61}\) | \(\underset{\pm0.01}{-0.02}\) | \(\underset{\pm0.04}{\mathbf{0.54}}\) | \(\underset{\pm0.02}{0.17}\) | \(\underset{\pm0.05}{0.47}\) | \(\underset{\pm0.02}{0.19}\) | \(\underset{\pm0.05}{0.51}\) | \(\underset{\pm0.02}{0.20}\) |
| Mixtral | \(\underset{\pm0.02}{\mathbf{0.87}}\) | \(\underset{\pm0.01}{0.01}\) | \(\underset{\pm0.05}{0.50}\) | \(\underset{\pm0.02}{0.18}\) | \(\underset{\pm0.05}{0.48}\) | \(\underset{\pm0.02}{0.15}\) | \(\underset{\pm0.05}{0.52}\) | \(\underset{\pm0.01}{0.13}\) |
| Neural | \(\underset{\pm0.06}{0.65}\) | \(\underset{\pm0.02}{0.04}\) | \(\underset{\pm0.05}{0.48}\) | \(\underset{\pm0.02}{0.21}\) | \(\underset{\pm0.04}{0.42}\) | \(\underset{\pm0.02}{0.16}\) | \(\underset{\pm0.04}{0.46}\) | \(\underset{\pm0.01}{0.12}\) |
| Llama | \(\underset{\pm0.08}{0.80}\) | \(\underset{\pm0.02}{0.07}\) | \(\underset{\pm0.05}{0.49}\) | \(\underset{\pm0.01}{-0.05}\) | \(\underset{\pm0.06}{0.44}\) | \(\underset{\pm0.02}{0.21}\) | \(\underset{\pm0.05}{0.51}\) | \(\underset{\pm0.01}{0.07}\) |
| Mistral | \(\underset{\pm0.03}{0.33}\) | \(\underset{\pm0.01}{0.02}\) | \(\underset{\pm0.05}{0.50}\) | \(\underset{\pm0.01}{0.12}\) | \(\underset{\pm0.05}{0.48}\) | \(\underset{\pm0.02}{0.13}\) | \(\underset{\pm0.04}{0.47}\) | \(\underset{\pm0.01}{0.11}\) |
| GPT-4 | \(\underset{\pm0.07}{0.49}\) | \(\underset{\pm0.01}{0.04}\) | \(\underset{\pm0.05}{0.39}\) | \(\underset{\pm0.02}{0.16}\) | \(\underset{\pm0.05}{\mathbf{0.52}}\) | \(\underset{\pm0.02}{0.14}\) | \(\underset{\pm0.06}{\mathbf{0.60}}\) | \(\underset{\pm0.01}{-0.07}\) |


Figure 10: L: Plot of semantic similarity with an increasing number of suggestions for GPT-4 on the Alarm graph. R: Plot of semantic similarity against the total number of incoming and outgoing edges for GPT-4 on the Alarm graph..
We observed notable differences in the accuracy of LLM predictions for missing nodes within causal graphs when context was provided versus when it was absent. Specifically, the inclusion of contextual information about the causal graph significantly enhanced the LMs’ ability to generate accurate and relevant predictions. In realistic settings, when this setup is being used by a scientist, they would provide the context of the task along with the partial graph. When context was not provided, the models often struggled to identify the most appropriate variables, leading to a decrease in prediction accuracy, especially for smaller models. Unsurprisingly, providing context was more important for smaller graphs than larger graphs. LLMs were able to understand the context of the graph via multiple other nodes in the graph for larger graphs.
2pt
| Cancer | Survey | Asia | Insurance | Alarm | ||||||
| \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | |
| In-Context | \(0.75\) | \(1.00\) | \(0.67\) | \(1.00\) | \(0.68\) | \(0.88\) | \(0.85\) | \(0.90\) | \(0.96\) | \(0.96\) |
| Out-of-Context | \(0.00\) | \(0.25\) | \(0.33\) | \(0.33\) | \(0.53\) | \(0.61\) | \(0.58\) | \(0.58\) | \(0.60\) | \(0.57\) |
| Open world Hypothesis | \(0.39\) | \(0.41\) | \(0.40\) | \(0.39\) | \({0.63}\) | \(0.66\) | \(0.49\) | \(0.50\) | \(0.44\) | \(0.46\) |
While using LLMs for hypothesizing the missing nodes within the causal graph for the open world setting, an additional question is for the model to provide explanations for each of its predictions. This was motivated by the fact that incorporating a rationale behind each prediction might enhance the model’s semantic similarity. We present the results in the Table below. We observe that evaluating semantic similarity with explanations leads to a decrease in performance as compared to the earlier setting where the language model returned phrases. This is because semantic similarity, as a metric, evaluates the closeness of the model’s predictions to the ground truth in a high-dimensional vector space, focusing on the semantic content encapsulated within the embeddings. It is a metric that leaves little room for interpretative flexibility, focusing strictly on the degree of semantic congruence between the predicted and actual variables. The introduction of explanations, while enriching the model’s outputs with contextual insights, did not translate into improved semantic alignment with the ground truth.
2pt
| Cancer | Survey | Asia | Insurance | Alarm | ||||||
| \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | |
| Sim | \(\underset{\pm{0.02}}{0.49}\) | \(\underset{\pm{0.07}}{0.38}\) | \(\underset{\pm{0.06}}{0.51}\) | \(\underset{\pm{0.10}}{0.44}\) | \(\underset{\pm{0.02}}{0.66}\) | \(\underset{\pm{0.09}}{0.57}\) | \(\underset{\pm{0.03}}{0.52}\) | \(\underset{\pm{0.07}}{0.40}\) | \(\underset{\pm{0.06}}{0.49}\) | \(\underset{\pm{0.06}}{0.40}\) |
| LLM-Judge | \(\underset{\pm{0.03}}{0.90}\) | \(\underset{\pm{0.02}}{0.91}\) | \(\underset{\pm{0.04}}{0.67}\) | \(\underset{\pm{0.02}}{0.69}\) | \(\underset{\pm{0.03}}{0.76}\) | \(\underset{\pm{0.04}}{0.76}\) | \(\underset{\pm{0.03}}{0.56}\) | \(\underset{\pm{0.03}}{0.55}\) | \(\underset{\pm{0.02}}{0.75}\) | \(\underset{\pm{0.02}}{0.75}\) |
An important linguistic concern that could be missed by semantic similarity is an ambiguous hypothesis by the LLM that may have the same semantics, which again breaks the semantic similarity metric. This further motivates the LLM-judge metric, whose input is the context of the causal graph, the partial causal graph, the ground truth variable, and the model predictions. Given the rich context of the LLM-judge metric, we suspect it would be able to overcome the ambiguity. We prompted the model to justify its hypothesis variables using explanations. We observe that evaluating semantic similarity with explanations leads to a decrease in performance as compared to the earlier setting where the language model returned just phrases. In Table 14, we observed a drop in performance for semantic similarity. In contrast, we observe a similar or slight improvement in the LLM-judge metric when the explanation of the model hypothesis is given.
Chain-of-Thought prompting has gained popularity due to its impressive performance in proving the quality of LLMs’ output [71], also in metadata-based causal reasoning [16]. We also incorporated COT prompting for our prompts. We perform ablation studies in 15. We observe that COT particularly improves the performance of the identification experiments.
2pt
| Cancer | Survey | Asia | Insurance | Alarm | ||||||
| \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | \(X\) | \(✔\) | |
| In-Context | \(1.00\) | \(1.00\) | \(0.83\) | \(1.00\) | \(0.75\) | \(0.88\) | \(0.74\) | \(0.90\) | \(0.91\) | \(0.96\) |
| Out-of-Context | \(0.50\) | \(0.25\) | \(0.18\) | \(0.33\) | \(0.57\) | \(0.61\) | \(0.56\) | \(0.58\) | \(0.54\) | \(0.57\) |
For Task 4, we iteratively hypothesize the missing variables (mediators). Our choice was primarily driven by the complexity of Task 4, which involves predicting multiple missing mediators, ranging from 1 to 10. For a Task with 10 missing mediators, the model would have to predict 50 suggestions at once. We initially hypothesized that LLMs might struggle with making multiple predictions across different variables simultaneously. This was indeed reflected in our results and GPT-4 outputs from Table X. The iterative approach allows the model’s prediction to narrow the search space, which would not be possible in a non-iterative approach. This method is more aligned with the scientific discovery process, where hypotheses are often refined iteratively based on new findings. Furthermore, our approach simulates a human-in-the-loop scenario, where the most plausible answer is selected and used to guide the next prediction.
| Asia | Child | Insurance | Alarm | |
|---|---|---|---|---|
| Non-iterative | 0.42 +- 0.07 | 0.33 +- 0.06 | 0.45 +- 0.09 | 0.54 +- 0.05 |
| Iterative | 0.49 +- 0.05 | 0.39 +-0.03 | 0.52 +- 0.02 | 0.60 +- 0.04 |
We added a new graph, the neuropathic pain graph [72], which is not part of common LLM training corpora as one needs to use a python script to download it. The graph consists of 221 nodes and 770 edges, but for feasibility, we selected a subset of the graph for evaluation. We ran experiments for Task 1, Task 2, and Task 3.
| Model | Task 1 | Task 2 Result | Task 2 FNA | Task 3 Sim | Task 3 LLM-J |
|---|---|---|---|---|---|
| Mistral | 0.64 | 0.51 | 0.32 | 0.38 | 0.53 |
| Mixtral | 0.83 | 0.55 | 0.34 | 0.45 | 0.69 |
| Llama | 0.78 | 0.49 | 0.27 | 0.44 | 0.63 |
| GPT-4 | 0.94 | 0.68 | 0.24 | 0.51 | 0.76 |
Figure 11: Detailed spider plots for Semantic similarity. a — Cancer, b — Survey, c — Alzheimers, d — Asia, e — Child, f — Insurance, g — Alarm
Figure 12: Detailed spider plots for LLM-as-judge metric. a — Cancer, b — Survey, c — Alzheimers, d — Asia, e — Child, f — Insurance, g — Alarm
| Dataset | Current Sim | Memorization | Est. Sim w/ BNLearn Context |
|---|---|---|---|
| Cancer | 0.49 | ✔ | 0.60 |
| Survey | 0.51 | ✔ | 0.62 |
| Asia | 0.66 | ✔ | 0.78 |
| Law | 0.55 | 0.56 | |
| Alzheimers | 0.47 | 0.55 | 0.52 |
| Child | 0.36 | ✔ | 0.52 |
| Insurance | 0.52 | ✔ | 0.64 |
| Alarm | 0.49 | ✔ | 0.62 |
To test whether GPT-4’s original performance was driven by the retrieval of memorized content, we reran the variable inference task with explicit prompts stating that the graphs were from the BNLearn repository. We observed modest gains in similarity for well-known graphs (e.g., Asia, Alarm), indicating that GPT-4 can retrieve additional details when cued. However, the performance in the original setting, without such cues, was already strong, suggesting that the model was not merely retrieving memorized structures. Instead, its responses appear to reflect contextual reasoning and generalization beyond rote recall.
We observe that different graph representations yield similar performance across tasks, with the most variation for Task 2 where we have 2 missing variables on Mistral and Mixtral models.
| Model | Asia | Child | Insurance | Alarm |
|---|---|---|---|---|
| JSON | 0.80 | 0.79 | 0.50 | 0.85 |
| GraphML | 0.80 | 0.78 | 0.47 | 0.85 |
| Textual (ours) | 0.78 | 0.80 | 0.49 | 0.85 |
| Model | Asia | Child | Insurance | Alarm |
|---|---|---|---|---|
| JSON | 0.73/0.16 | 0.45/0.30 | 0.37/0.17 | 0.50/0.21 |
| GraphML | 0.70/0.15 | 0.41/0.29 | 0.37/0.12 | 0.53/0.22 |
| Textual (ours) | 0.73/0.17 | 0.42/0.31 | 0.40/0.12 | 0.51/0.22 |
| Model | Asia | Child | Insurance | Alarm |
|---|---|---|---|---|
| JSON | 0.75 | 0.67 | 0.46 | 0.69 |
| GraphML | 0.71 | 0.67 | 0.50 | 0.72 |
| Textual (ours) | 0.73 | 0.68 | 0.47 | 0.71 |
We aim to assess the LLM’s causal reasoning via prompting. The following are the reasons why fine-tuning is not the most practical solution:
Pretrained models come with a wealth of general knowledge, which we aim to leverage. Fine-tuning these models could potentially limit their ability to draw on this broad knowledge base. We aim to understand the utility of pretrained models, as fine-tuning large models like GPT-4 is not always feasible.
The training graph is too small for fine-tuning. Despite considering a large 52-edged graph: Insurance, we would have just 27 datapoints or Alarm with 37 datapoint. Additionally:
Using the same graph as part of train and test would unfortunately lead to training data leakage.
If we consider different graphs for train and test, there would exist a domain shift in the two graphs and the model may be overfitted to the domain of the train graph.
However, to illustrate our hypothesis and alleviate the reviewer’s concern, we performed Supervised Fine-Tuning using QLoRA on the Mistral-7b-Instruct model for hypothesizing in the open world task. The train set here is all of the graphs minus the respective graph it was tested on. We tested on Survey, Insurance and Alzheimers graphs. The model was trained to give one best-fit suggestion for the missing variable.
| Insurance | Survey | Alzheimers | |
|---|---|---|---|
| No fine-tuning | 0.42 +- 0.03 | 0.44 +- 0.05 | 0.34 +- 0.04 |
| Fine-tuned | 0.39 +- 0.04 | 0.39 +- 0.03 | 0.36 +- 0.07 |
From the above results, it is evident that finetuning does not significantly improve over the prompting results. This is because during training the LLM gets biased towards the domains of training graphs which are contextually distant from the test domain, given the diversity of graphs chosen. One may think that training might help the LLM to understand the task, but from prompt-based model output, it was evident that the LLM can instruction-follow. In summary, we were able to extract the LLM knowledge via prompting and domain-specific fine-tuning could be closely looked at in the future works.
Similar to fine-tuning, few-shot learning’s success depends on balancing domain specificity and generality. To avoid test examples becoming part of the shots, we have to use different domains as examples. Given the complexity of the Alarm graph, we decided to use them as a prior. We performed experiments with 1-shot and 5-shots for the Mixtral 8x7b model.
| graph | 0-shot | 1-shot | 5-shot |
|---|---|---|---|
| Cancer | 0.41 | 0.43 | 0.46 |
| Survey | 0.39 | 0.38 | 0.36 |
| Asia | 0.66 | 0.70 | 0.72 |
| Alzheimer’s | 0.31 | 0.33 | 0.34 |
| Child | 0.53 | 0.55 | 0.56 |
| Insurance | 0.46 | 0.42 | 0.45 |
We would like to remind you that Alarm was a medical graph which means that providing more examples in a different domain might hinder the model performance. Drop in performance when changing domain for in-context learning has been discussed in [73] and [74].
| Sachs | Alarm1 | Alarm2 | Ins1 | Ins2 | Ins3 | Ins4 | Ins5 | Ins6 | Ins7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Zephyr | 0.12 | 0.37 | 0.29 | 0.45 | 0.49 | 0.37 | 0.29 | 0.33 | 0.46 | 0.73 |
| Mixtral | 0.89 | 0.54 | 0.57 | 0.57 | 1.0 | 0.32 | 0.23 | 0.38 | 0.28 | 1.0 |
| Neural | 0.34 | 0.27 | 0.28 | 0.42 | 0.47 | 0.34 | 0.48 | 0.48 | 0.38 | 0.48 |
| LLama | 0.27 | 0.39 | 0.44 | 0.55 | 1.0 | 0.29 | 0.22 | 0.57 | 0.45 | 1.0 |
| Mistral | 0.23 | 0.62 | 0.46 | 0.58 | 1.0 | 0.28 | 0.28 | 0.28 | 0.28 | 1.0 |
| GPT-4 | 0.91 | 0.49 | 0.44 | 0.62 | 0.39 | 0.58 | 0.44 | 0.58 | 0.52 | 1.0 |
| Sachs | Alarm1 | Alarm2 | Ins1 | Ins2 | Ins3 | Ins4 | Ins5 | Ins6 | Ins7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Zephyr | 0.10 | 0.40 | 0.30 | 0.45 | 0.60 | 0.40 | 0.40 | 0.30 | 0.70 | 0.80 |
| Mixtral | 0.95 | 0.70 | 1.0 | 0.75 | 1.0 | 0.80 | 0.20 | 0.20 | 0.20 | 1.0 |
| Neural | 0.30 | 0.60 | 0.30 | 1.0 | 0.60 | 0.30 | 0.80 | 0.30 | 0.40 | 0.60 |
| LLama | 0.20 | 0.50 | 0.44 | 0.40 | 1.0 | 0.50 | 0.20 | 0.70 | 0.45 | 1.0 |
| Mistral | 0.20 | 0.90 | 0.80 | 0.55 | 1.0 | 0.30 | 0.20 | 0.70 | 0.30 | 1.0 |
| GPT-4 | 0.95 | 0.65 | 0.80 | 0.60 | 0.70 | 0.80 | 0.85 | 0.80 | 0.75 | 1.0 |
The causal sufficiency of \(\mathcal{G}\), by definition, implies that for every pair of variables within \(\mathbf{V}\), all common causes are also included within \(\mathbf{V}\). Extending this assumption to \(\mathcal{G}^*\), we assume that the partial graph inherits causal sufficiency for its given that all edges among these variables are preserved as in \(\mathcal{G}\). This preservation ensures that the observed relationships within \(V^*\) are not confounded by omitted common causes. Since the faithfulness of \(\mathcal{G}\) ensures that the observed conditional independencies among variables in \(\mathbf{V}\) are accurately reflected by the causal structure represented by \(\mathbf{E}\). By maintaining the same set of edges \(\mathbf{E}\) in \(\mathcal{G}^*\) for the subset \(V^*\), we uphold the faithfulness assumption within the partial graph.
Code available at https://github.com/ivaxi0s/inferring-causal-variables↩︎