Abstract

Instrumental variable methods are fundamental to causal inference when treatment assignment is confounded by unobserved variables. In this article, we develop a general nonparametric framework for identification and learning with multi-categorical or continuous instrumental variables. Specifically, we propose an additive instrumental variable framework to identify mean potential outcomes and the average treatment effect with a weighting function. Leveraging semiparametric theory, we derive efficient influence functions and construct consistent, asymptotically normal estimators via debiased machine learning. Extensions to longitudinal data, dynamic treatment regimes, and multiplicative instrumental variables are further developed. We demonstrate the proposed method by employing simulation studies and analyzing real data from the Job Training Partnership Act program.

Identification and Debiased Learning of Causal Effects with General Instrumental Variables

Keywords: Additive instrumental variables, Causal inference, Debiased machine learning, Dynamic treatment regimes, Longitudinal data.

1 Introduction↩︎

1.1 Background↩︎

Observational studies are commonly employed to estimate treatment effects in biomedical and economic research [1]. In the presence of unmeasured confounding, instrumental variable (IV) methods have been widely used to identify causal effects [2]–[6]. These approaches exploit exogenous variation in treatment induced by instruments that influence treatment assignment but are conditionally independent of the latent confounding effects.

Under the monotonicity condition that the instrument does not decrease (or increase) the probability of receiving treatment for any individual, IV methods identify causal effects within the complier subpopulation. [3], [6], [7] showed that, with multi-categorical instruments, identification is feasible for local treatment effects within specific complier subgroups. Building on this, [8], [9] proposed estimators for the local IV effect curve, capturing the treatment effect among individuals who would comply when the instrument exceeds a certain threshold. [10] relaxed the monotonicity assumption, establishing identification of the average treatment effect in the nudge subgroup, which involves mixtures of compliers and defiers.

Unlike approaches relying on the monotonicity condition, the identification strategy of [11], [12] does not require monotonicity. Instead, they identify the average treatment effect (ATE) by imposing no-interaction assumptions between the IV and latent confounding in the treatment model. [13]–[15] extended this framework to the marginal structural Cox model, while [16], [17] further generalized it to longitudinal settings. Furthermore, [18] proposed to identify optimal treatment regimes under a no unmeasured common effect modifier assumption, and [19] extended the approach to learn optimal treatment regimes for censored survival data. Building on this foundation, [20] proposed an identifying condition that excludes any multiplicative interaction between the IV and latent confounders in the treatment model, thereby enabling the nonparametric identification of the average treatment effect on the treated (ATT).

In parallel, [21], [22] developed IV quantile regression (IVQR) methods to estimate quantile treatment effects (QTE), allowing both the instrument and treatment variables to be non-binary. [23] adopt a novel type of copula invariance condition to identify the treatment effects for the entire population, allowing the treatment to be binary, multi-categorical or continuous. [24], [25] investigate scenarios involving invalid instrumental variables, where the exclusion restriction may be violated.

Collectively, these approaches, however, do not offer nonparametric identification strategies for the ATE in settings where either the instrument or the treatment variable is non-binary, which limits their applicability in empirical contexts involving multi-categorical or continuous treatments and instruments.

Despite identifying the ATE, there are several classical frameworks dealing with continuous IV. The generalized method of moments (GMM) provides a unifying and flexible parametric framework under the IV setting [26]. It estimates structural parameters of interest by solving the corresponding moment conditions. It encompasses traditional methods like two-stage least squares [27], [28] as a special case and extends naturally to situations with multiple instruments, heteroskedasticity, and other complexities.

Nonparametric IV methods represent another flexible and widely adopted framework, particularly suited for settings where the structural relationships between variables are complex and cannot be adequately captured by parametric models [29]–[34]. By avoiding restrictive assumptions on functional forms, these methods enable flexible, data-driven estimation of treatment and outcome models. Notably, proximal causal inference methods are closely related to nonparametric IV techniques and have emerged as effective approaches to address unmeasured confounding [35]–[37].

1.2 Outlines of our paper↩︎

Inspired by existing literature, our study investigates the fundamental problem of estimating the mean potential outcomes [38], [39] in settings involving multi-categorical or continuous IVs and treatments, which substantially broadens the scope beyond the conventional binary treatment–binary IV framework. This generalization not only captures a wider range of empirical applications but also poses new methodological challenges for achieving valid identification.

We now outline the contents of our paper. In Section 2, we generalize the no-interaction condition between a binary IV and latent confounding [11], and introduce the additive IV framework for multi-categorical or continuous IVs to establish identification of mean potential outcomes and the ATE. We connect our strategy to the solution of a specific nonparametric IV problem, providing both theoretical insight and intuitive interpretation of the proposed estimands.

In Section 3, we use semiparametric theory [40]–[42] to derive efficient influence functions (EIFs) for target estimands defined by different weighting functions. For ATE identification, we show that under homoskedastic latent confounding, the optimal weighting function achieves the efficiency bound. Using the debiased machine learning (DML) framework with cross-fitting [43], [44], we construct two estimators for the ATE. The first estimator uses a fixed weighting function, and the second estimator leverages an adaptive procedure that selects the optimal weighting function. Both estimators are consistent and asymptotically normal.

In Section 4, we extend the identification strategy from point-exposure settings to longitudinal data. Other extensions to dynamic treatment regimes and multiplicative instrumental variables are given in the appendix. In Section 5, we conduct simulation studies to assess the validity of the proposed estimators under both point-exposure and longitudinal settings. We analyze Job Training Partnership Act program data in Section 6. We conclude our paper with a discussion in Section 7.

2 Identification↩︎

2.1 Basic assumptions↩︎

Let $A \in \mathcal{A} := \{0, \ldots, M\}$ denote a multi-categorical treatment variable, where $M = 1$ corresponds to the binary treatment setting. Let $Z \in \mathcal{Z}\subseteq\mathbb{R}^{|\mathcal{Z}|}$ denote the IV, which may be multi-categorical or continuous. Let $U\in\mathcal{U} \subseteq\mathbb{R}^{|\mathcal{U}|}$ represent unmeasured confounders, $L\in\mathcal{L} \subseteq\mathbb{R}^{|\mathcal{L}|}$ observable confounders, $Y\in\mathcal{Y} \subseteq\mathbb{R}$ the observed outcome, and $Y(a)$ the potential outcome under treatment level $A = a$. The observed data consist of $O = \{Z, A, Y, L\} \in \mathcal{Z}\times \mathcal{A}\times \mathcal{Y}\times\mathcal{L}$. We introduce four fundamental assumptions under the IV setting.

Assumption 1 (Consistency). $Y=Y(A)$.

Assumption 2 (Latent ignorability). For any $a\in\mathcal{A}$, $Y(a)\perp\!\!\!\perp\{A,Z\}\mid U,L$.

Assumption 3 (IV independence). $Z\perp\!\!\!\perp U\mid L$.

Assumption 4 (IV relevance). For any $a\in\mathcal{A}$ and $l\in\mathcal{L}$, $Z\not\perp\!\!\!\perp I\{A=a\}\mid L=l$. That is, there exist two distinct values $z_0,z_1\in\mathcal{Z}$ such that $\Pr(A=a\mid Z=z_0,L=l)\neq \Pr(A=a\mid Z=z_1,L=l).$

Assumption 2 posits that, conditional on both the observed covariates and unmeasured confounders, the potential outcome $Y(a)$ is independent of the treatment and IVs. Notably, it implies the IV exclusion restriction: the instrument $Z$ affects the outcome $Y$ only through its effect on the treatment $A$. Assumption 3 states that $Z$ is independent of the unmeasured confounders $U$ given the observed covariates $L$.

Assumption 4 requires that $Z$ has a nontrivial effect on the treatment $A$, conditional on any level of $L$. This condition is slightly stronger than $A \not\perp\!\!\!\perp Z \mid L$, which only requires the existence of some $l \in \mathcal{L}$, $a \in \mathcal{A}$, and $z_0, z_1 \in \mathcal{Z}$ such that Assumption 4 holds.

Next, we provide the definition of additive IV, which serves as a key condition for identifying the causal estimands of interest.

Definition 1 (Additive IV). For each $a \in \mathcal{A}$, $Z$ is an additive IV* (AIV) for $A = a$ if there exist functions $b(U,L)$ and $c(Z,L)$ such that $\Pr(A = a \mid Z, U, L) = b(U, L) + c(Z, L).$ Furthermore, $Z$ is an AIV for $A$ if it is an AIV for $A=a$ for all $a\in\mathcal{A}$.*

According to [11], the definition of AIV originates from the no-interaction condition between $Z$ and $U$ in the treatment model. [24], [25], [45] also use this type of no-interaction condition to make inference with an invalid IV. The following proposition gives an alternative definition for AIV.

Proposition 1. For each $a \in \mathcal{A}$, $Z$ is an AIV for $A = a$ if and only if for any $\pi(Z,L)$, $\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid U,L\}=\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid L\}.$

Specifically, when $Z$ is binary, Definition 1 holds if and only if $\Pr(A=a \mid Z=1, U, L) - \Pr(A=a \mid Z=0, U, L) \perp\!\!\!\perp U \mid L,$ meaning that the differential effect of the instrument on treatment $A$ is conditionally independent of the unmeasured confounders $U$, given the observed covariates $L$.

Next, we introduce the concept of a regular weighting function, which is analogous to the positivity condition commonly assumed in causal inference.

Definition 2 (Regular weighting function). For each $a\in\mathcal{A}$, a function $\pi(Z,L)$ is a regular weighting function* (RWF) for $A = a$ if it is uniformly bounded and there exists a positive constant $\epsilon_0$ such that $|\mathrm{Cov}\!\{I\{A=a\}, \pi(Z,L) \mid L\} | \geq \epsilon_0$ uniformly for all $L$.*

The existence of an RWF for every $a \in \mathcal{A}$ implies the IV relevance condition in Assumption 4. Specifically, requiring the absolute value of the conditional covariance to be uniformly bounded below by a positive constant $\epsilon_0$ rules out scenarios where $\pi(Z,L)$ is irrelevant to $A^{(a)}$, which would undermine the stability and validity of causal effect identification. To guarantee the existence of an RWF, the following positivity assumption is required.

Assumption 5 (Positivity). There exists a positive constant $\epsilon_0$ such that for any $l\in\mathcal{L}$ and $a\in\mathcal{A}$, $\mathrm{Var}\!\{\Pr(A=a\mid Z,L) \mid L=l\} \geq \epsilon_0.$

Assumption 5 clearly entails the IV relevance condition in Assumption 4. The following proposition highlights its necessity. In particular, when $Z$ is binary, Proposition 2 reduces to the standard positivity condition in the IV literature.

Proposition 2 (Existence). There exists an RWF $\pi(Z,L)$ for $A$ if and only if Assumption 5 holds. If there exists an RWF $\pi(Z,L)$ for $A=a$, then $\pi^o(Z,L):=\Pr(A=a\mid Z,L)$ must be an RWF for $A=a$.

2.2 Solving nonparametric IV models↩︎

In this subsection, we propose a strategy to identify the potential outcome mean $\mathbb{E}[Y(a)]$ by formulating and solving a class of nonparametric IV models. Throughout, we define $A^{(a)} := I\{A = a\}$ for each $a \in \mathcal{A}$ for notational convenience. Our first primary goal is, for each $a \in \mathcal{A}$, to identify a function $f_a^o(A^{(a)}, L)$ that satisfies the conditional moment restriction given by \[\label{eq:32npiv} \mathbb{E}[A^{(a)} Y \mid Z, L] = \mathbb{E}[f_a(A^{(a)}, L) \mid Z, L].\tag{1}\]

Such conditional moment equation is common in the nonparametric IV literature [29]. The following theorem establishes the uniqueness of the solution to Equation 1 and provides an explicit representation in terms of any RWF $\pi(Z, L)$.

Theorem 1 (Uniqueness and closed form solution). Under Assumption 4, for each $a\in\mathcal{A}$, if a solution $f_a^o(A^{(a)}, L)$ to Equation 1 exists, it is unique. Moreover, for any RWF $\pi(Z, L)$ for $A=a$ with $\mathrm{Cov}\!\{A^{(a)}, \pi(Z, L) \mid L\} \neq 0$, the solution satisfies \[\label{eq:32explicit32form} \begin{align} f_a^o(0,L) &= -\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}} \mathbb{E}[A^{(a)}\mid Z,L] + \mathbb{E}[A^{(a)}Y\mid Z,L],\\ f_a^o(1,L) &= \dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}} \{1-\mathbb{E}[A^{(a)}\mid Z,L]\} + \mathbb{E}[A^{(a)}Y\mid Z,L]. \end{align}\qquad{(1)}\]

However, Theorem 1 ensures uniqueness without guaranteeing existence of a solution to Equation 1 . Indeed, Equation 1 is over-identified when $Z$ is non-binary, meaning that a solution may not exist in general. In the proximal causal inference literature, the existence of a solution to the nonparametric IV (bridge) equation is typically guaranteed under a completeness condition [36]. Likewise, within the IV framework, the AIV condition plays a central role in ensuring existence. This is formally stated in the following proposition.

Proposition 3 (Identification of mean potential outcomes). Under Assumptions 1–4, for each $a\in\mathcal{A}$, if $Z$ is an AIV for $A=a$, then there exists a unique solution $f_a^o(A^{(a)}, L)$ to Equation 1 , given by \[\begin{align} f_a^o(0,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\},\\ f_a^o(1,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\} + \mathbb{E}[Y(a)\mid L]. \end{align}\] In particular, if Assumption 5 holds and $\pi(Z,L)$ is an RWF for $A=a$, \[\begin{align} \label{eq:32identification32AIV} \mathbb{E}[Y(a)] = \mathbb{E}\left[\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}}\right]. \end{align}\qquad{(2)}\]

Our identification strategy holds even when the treatment space $\mathcal{A}$ is multi-categorical. Intuitively, this is because we transform the multi-categorical treatment $A$ into a binary variable $A^{(a)}$ for each $a \in \mathcal{A}$.

Moreover, if there are no latent confounders $U$, then $f_a^o(0, L) \equiv 0$ by Proposition 3 and Assumption 2 that $Y(a)\perp\!\!\!\perp A\mid U,L$. In other words, a significant deviation of $f_a^o(0, L)$ from zero indicates the existence of latent confounding. This observation provides a novel criterion for detecting unmeasured confounder $U$, which is beyond the scope of our article.

Furthermore, when $Z$ is a non-binary IV, Proposition 3 provides a practical approach to assess whether $Z$ qualifies as an AIV for $A = a$, which constitutes a relatively strong condition. Specifically, under Assumptions 1–4, if $Z$ is indeed an AIV for $A = a$, the choice of the weighting function $\pi(Z, L)$ does not affect the value of the expression on the right-hand side of Equation ?? . Therefore, one can select two distinct RWFs $\pi_1(Z, L)$ and $\pi_2(Z, L)$ and compare the resulting values. A discrepancy between these values would indicate a violation of the AIV condition.

2.3 Identification of average treatment effect↩︎

In practical applications with a binary treatment $A$, researchers are also interested in the ATE. The following proposition summarizes the identification results for the ATE.

Proposition 4 (Identification of ATE). Assume $A$ is binary. Under Assumptions 1–5, if either $Z$ is an AIV for $A$ or $Y(1)-Y(0)\perp\!\!\!\perp U \mid L$, \[\label{eq:32npiv32Y40141-Y40041} \mathbb{E}[Y\mid Z,L]=\mathbb{E}[f(A,L)\mid Z,L]\qquad{(3)}\] has a unique solution $f^o(A, L)$ as \[\begin{align} f^o(0,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(1)-Y(0)\mid U,L], \Pr(A=1\mid Z,U,L)\mid L\} + \mathbb{E}[Y(0)\mid L],\\ f^o(1,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(1)-Y(0)\mid U,L], \Pr(A=1\mid Z,U,L)\mid L\} + \mathbb{E}[Y(1)\mid L]. \end{align}\] In particular, for any RWF $\pi(Z, L)$ for $A$, the ATE is identified as \[\begin{align} \label{eq:32identification32AIV32Y40141-Y40041} \mathbb{E}[Y(1)-Y(0)] = \psi_{\pi}^o := \mathbb{E}\left[\dfrac{\mathrm{Cov}\!\{Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}\right]. \end{align}\qquad{(4)}\]

Proposition 4 shows that the AIV condition is not strictly necessary for identifying the ATE. Instead, the alternative condition $Y(1) - Y(0) \perp\!\!\!\perp U \mid L$ also ensures the existence of a solution to Equation ?? .

Remark 1. In fact, under Assumptions 1–5, without the AIV condition, the following equation still holds: \[\begin{align} \psi_{\pi}^o= \mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\dfrac{\mathrm{Cov}\!\{A, \pi(Z,L) \mid U,L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}\right]. \end{align}\] This result suggests that, even when $Z$ fails to be an AIV, the estimand $\psi_{\pi}^o$ can still be interpreted as a weighted average of the conditional average treatment effect $\mathbb{E}[Y(1)-Y(0)\mid U,L]$, as long as $\mathrm{Cov}\!\{A, \pi(Z,L) \mid U,L\}$ and $\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}$ have the same sign. This representation is analogous to the assumption-lean inference of [46], [47], in which the estimands remain meaningful and interpretable even under model misspecification. In addition, Equation ?? holds when $\mathrm{Cov}\!\{Y(1)-Y(0),\mathrm{Cov}\!\{A, \pi(Z,L) \mid U,L\}\mid L\}=0.$ This result can be viewed as an extension of the findings in [18].

Remark 2. When $L = \emptyset$ and $Z$ is univariate, the parameter $\psi_{\pi}^o$ in Equation ?? reduces to a form resembling the two-stage least squares (TSLS) estimator if we set $\pi(Z,L) = Z$. Moreover, several classical works interpret $\psi_{\pi}^o$ as $\tau_0$, the solution to the conditional moment equation $\mathbb{E}[Y - \tau_0 A \mid Z] = 0$ [26], [28], [48], where $\tau_0$ represents the effect of a marginal change in the endogenous variable $A$ on the outcome. Consequently, our identification strategy can be viewed as a nonparametric generalization of the TSLS and GMM approaches, which additionally accounts for confounding effects through $L$.

Remark 3. For a continuous treatment, we provide a similar identification result in the appendix.

3 Semiparametric theory↩︎

In this section, we consider the case of a binary treatment assignment $A$, a setting commonly encountered in causal inference. Our main objective is to rigorously characterize the semiparametric efficiency bound for the ATE under this framework. Leveraging semiparametric theory [42], we identify the minimum asymptotic variance attainable by any regular, asymptotically linear estimator of the ATE. This efficiency bound serves as a benchmark for guiding the construction of adaptive estimators.

3.1 Prespecified weighting function scenario↩︎

Proposition 4 establishes that, under Assumptions 1–4, if $Z$ is an AIV for the binary treatment $A$, the choice of the RWF $\pi(Z, L)$ does not affect the value of $\psi_{\pi}^o$ in Equation ?? . However, the semiparametrically efficiency bound of $\psi_{\pi}^o$ still depends on the choice of $\pi(Z, L)$. Consequently, it is necessary to derive the EIFs corresponding to all possible RWFs $\pi(Z, L)$. To this end, we define the following nuisance functions: \[\begin{align} &\delta^o(L) := \mathbb{E}[A \mid L], && \eta^o(L) := \mathbb{E}[Y \mid L],\\ &\kappa_{\pi}^o(L) := \mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}, && \zeta_{\pi}^o(L) := \mathbb{E}[Y \pi(Z,L) \mid L],\\ &\rho_{\pi}^o(L) := \mathbb{E}[\pi(Z,L) \mid L], && \gamma_{\pi}^o(L) := \dfrac{\mathrm{Cov}\!\{Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}. \end{align}\] For notational convenience, we collect them into a unified nuisance vector: \[\begin{align} \label{eq:32nuisance32function32fixed32pi} \alpha_{\pi}^o(L) := [\delta^o(L), \kappa_{\pi}^o(L), \rho_{\pi}^o(L), \eta^o(L), \gamma_{\pi}^o(L)]. \end{align}\tag{2}\] Note that $\zeta_{\pi}^o(L)$ serves only as an intermediate nuisance function which will be used in our proof and does not appear in the unified vector $\alpha_{\pi}^o(L)$. The following theorem derives the efficient influence function (EIF) of $\psi_{\pi}^o$ for any choice of RWF $\pi(Z,L)$.

Theorem 2. Under Assumptions 1–5, for any RWF $\pi(Z,L)$ for $A$, the EIF for $\psi_{\pi}^o$ in Equation ?? is given by $\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o)$, where $\varphi_{\pi}(O;\psi_{\pi},\alpha_{\pi})$ equals to \[\begin{align} \dfrac{1}{\kappa_{\pi}(L)} \left\{\pi(Z,L)-\rho_{\pi}(L)\right\} \{Y-\eta(L)\} - \psi_{\pi} + \left(1 - \dfrac{\{\pi(Z,L)-\rho_{\pi}(L)\}\{A-\delta(L)\}}{\kappa_{\pi}(L)} \right) \gamma_{\pi}(L). \end{align}\]

Remark 4. We carry out our semiparametric analysis in the fully nonparametric model. In this case, the tangent space coincides with $L_2(O)$, and the unique influence function corresponds to the EIF.

This theorem underpins the construction of efficient estimators for $\psi_{\pi}^o$. In practice, the true nuisance vector $\alpha_{\pi}^o$ is unknown and must be estimated from the data. The following proposition quantifies the bias introduced by substituting an estimated $\alpha_{\pi}$ for the true vector. This mixed bias property is well-documented in the existing literature [44], [49].

Proposition 5 (Mixed bias property). Under Assumptions 1–5, for any RWF $\pi(Z,L)$ and any fixed $\alpha_{\pi}$ in Equation 2 , $\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi})$ satisfies that \[\begin{align} &\mathbb{E}[\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi})] =\mathbb{E}\left[\dfrac{1}{\kappa_{\pi}(L)}\left\{\begin{array}{l} \{\kappa_{\pi}(L)-\kappa_{\pi}^o(L)\}\{\gamma_{\pi}(L)-\gamma_{\pi}^o(L)\}\\ +\{\rho_{\pi}(L)-\rho_{\pi}^o(L)\}\{\eta(L)-\eta^o(L)\}\\ +\{\rho_{\pi}(L)-\rho_{\pi}^o(L)\}\{\delta(L)-\delta^o(L)\}\gamma_{\pi}(L) \end{array}\right\}\right]. \end{align}\]

Next, we aim to select a function $\pi(Z, L)$ from the class of RWFs that minimizes the asymptotic variance of the estimator for $\psi_{\pi}^o$. Specifically, we seek the optimal $\pi(Z, L)$ that minimizes the second moment of the efficient influence function. Minimizing this quantity yields the most statistically efficient estimator among all estimators based on different RWFs.

Intuitively, Proposition 2 suggests that $\pi^o(Z,L) := \Pr(A = 1 \mid Z, L)$ is a natural candidate for the optimal weighting function. The following proposition characterizes the $\pi(Z,L)$ that attains this minimum variance bound.

Proposition 6 (Optimal RWF). Under the conditions of Theorem 2, suppose that the solution $f^o(A,L)$ to the nonparametric IV problem in Equation ?? satisfies \[\label{eq:32homoskedastic} \mathbb{E}\left[\{Y-f^o(A,L)\}^2\middle| Z,L\right]\perp\!\!\!\perp Z\mid L.\qquad{(5)}\] Then the quantity $\mathbb{E}[\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o)^2]$ attains its lower bound when $\pi(Z,L)=\Pr(A=1\mid Z,L)$.

The condition in Equation ?? implies that, after conditioning on $L$, the instrument $Z$ provides no additional information about the residual variation in $Y$. [50] leverage a similar assumption to derive the “optimal instruments” in the special case where $L=\emptyset$.

3.2 Adaptive weighting function scenario↩︎

As established in the previous subsection, setting $\pi^o(Z,L) := \Pr(A = 1 \mid Z, L)$ yields the optimal efficiency for estimating $\psi_{\pi}^o$ in the nonparametric model. In practice, however, $\pi^o(Z,L)$ is unknown. A natural approach is to treat $\pi^o(Z,L)$ as a nuisance function and estimate it from the data adaptively. Another important motivation is that Proposition 2 implies that if $\pi^o(Z,L)$ fails to be a valid RWF for $A_t$, then no alternative RWF exists.

Therefore, we develop a strategy for adaptively estimating the optimal weighting function. First, define the following nuisance functions: \[\begin{align} &\delta^o(L):=\mathbb{E}[A\mid L], &&\eta^o(L):=\mathbb{E}[Y\mid L],\\ &\kappa^o(L):=\mathrm{Cov}\!\{A,\pi^o(Z,L)\mid L\}, &&\zeta^o(L):=\mathbb{E}[Y\pi^o(Z,L)\mid L],\\ &\xi^o(Z,L):=\mathbb{E}[Y\mid Z,L], &&\gamma^o(L):=\dfrac{\mathrm{Cov}\!\{Y, \pi^o(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi^o(Z,L) \mid L\}}. \end{align}\] We unified these nuisance functions into a nuisance vector: \[\label{eq:32nuisance32function} \beta^o(Z,L):=[\pi^o(Z,L),\delta^o(L),\kappa^o(L), \xi^o(Z,L),\eta^o(L),\gamma^o(L)].\tag{3}\] Note that $\zeta^o(L)$ is only an auxiliary quantity and is not included in $\beta^o(Z,L)$. According to Proposition 4, the ATE can then be identified as \[\label{eq:32identification32AIV32Y40141-Y4004132unknown32weight} \psi_{ada}^o:=\mathbb{E}\left[\gamma^o(L)\right] =\mathbb{E}\left[\dfrac{\pi^o(Z,L)-\delta^o(L)}{\kappa^o(L)}Y\right].\tag{4}\] We now proceed to derive the EIF for $\psi_{ada}^o$.

Theorem 3. Under Assumptions 1–5, the EIF for $\psi_{ada}^o$ in Equation 4 is given by $\varphi(O;\psi_{ada}^o,\beta^o)$, where \[\begin{align} \varphi(O;\psi_{ada},\beta)=&\dfrac{\pi(Z,L)-\delta(L)}{\kappa(L)}Y-\psi_{ada} +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \kappa(L) + (A-\pi(Z,L))^2 - (A-\delta(L))^2 \right\}\\ &+\dfrac{1}{\kappa(L)} \left\{\xi(Z,L)(A-\pi(Z,L))-\eta(L)(A-\delta(L))\right\}. \end{align}\]

Remark 5. Similar to Theorem 2, this theorem still holds even if $Z_t$ fails to be an AIV for $A_t$, because the definition of $\psi_{ada}^o$ does not rely on the AIV condition.

Next, the EIF characterization in Theorem 3 forms the foundation for analyzing the robustness of the proposed estimator, which is demonstrated in the next proposition.

Proposition 7 (Mixed bias property). Under Assumptions 1–5, for any fixed nuisance vector $\beta$ in Equation 3 , $\varphi(O;\psi_{ada}^o,\beta)$ satisfies \[\begin{align} \mathbb{E}[\varphi(O;\psi_{ada}^o,\beta)]= \mathbb{E}\left[\dfrac{1}{\kappa(L)}\left\{\begin{array}{l} \{\gamma(L)-\gamma^o(L)\}\{\kappa(L) - \kappa^o(L)\}\\ - \gamma(L)(\delta^o(L)-\delta(L))^2\\ +\gamma(L)(\pi^o(Z,L)-\pi(Z,L))^2\\ -(\xi(Z,L)-\xi^o(Z,L))(\pi(Z,L)-\pi^o(Z,L))\\ +(\eta(L)-\eta^o(L))(\delta(L)-\delta^o(L)) \end{array}\right\}\right]. \end{align}\]

3.3 Cross-fitting procedure↩︎

In this subsection, we adopt the cross-fitting procedure proposed by [43], [44] to construct debiased estimators for $\psi_{\pi}^o$ and $\psi_{ada}^o$ in Equations ?? and 4 . Without loss of generality, assume that the sample size $n$ is evenly divisible by the number of folds $K$. We randomly partition the sample into $K$ folds of equal size. Let $I_k$ denote the set of indices belonging to the $k$-th fold, and let $I_{-k}$ denote its complement. Denote by $|I_k|$ the size of fold $I_k$. For any random variable $O$, define the empirical average over fold $k$ as $\mathbb{E}_{nk}[O] := \sum_{i \in I_k} O_i/|I_k|.$ We further define the $L_2$ norm of the nuisance vector $\alpha_{\pi}(L)$ from Equation 2 as \[\|\alpha_{\pi}(L)\|_2^2 := \|\delta(L)\|_2^2 + \|\eta(L)\|_2^2 + \|\kappa_{\pi}(L)\|_2^2 + \|\rho_{\pi}(L)\|_2^2 + \|\gamma_{\pi}(L)\|_2^2,\] and the $L_2$ norm of the nuisance vector $\beta(Z,L)$ from Equation 3 as \[\|\beta(Z,L)\|_2^2 := \|\pi(Z,L)\|_2^2 + \|\xi(Z,L)\|_2^2 + \|\delta(L)\|_2^2 + \|\eta(L)\|_2^2 + \|\kappa(L)\|_2^2 + \|\gamma(L)\|_2^2.\]

Next, for any fixed fold $I_k$, the nuisance estimators $\hat{\alpha}_{\pi}^{(n,k)}$ are trained using only the observations in $I_{-k}$ with any suitable machine learning methods. By construction, $\hat{\alpha}_{\pi}^{(n,k)}$ is independent of the samples in $I_k$. We derive the estimator $\hat{\psi}_{\pi}^{(n)}$ as the solution to the equation \[\begin{align} \label{eq:32AUG32estimator32prespecified32weight} \sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})\right]=0. \end{align}\tag{5}\] Next, we establish the consistency and asymptotic normality of $\hat{\psi}_{\pi}^{(n)}$ defined in 5 .

Theorem 4 (Asymptotic normality of $\hat{\psi}_{\pi}^{(n)}$). Under Assumptions 1–4, suppose that $\pi(Z,L)$ is an RWF for $A$. Assume further that, for any $k=1,\ldots,K$, $\mathbb{E}[\|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^o\|_2^2]=o(1)$, and that \[\begin{align} \left\{\begin{array}{l} \|\hat{\kappa}_{\pi}^{(n,k)}-\kappa_{\pi}^o\|_2\times \|\hat{\gamma}_{\pi}^{(n,k)}-\gamma_{\pi}^o\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}-\rho_{\pi}^o\|_2\times \|\hat{\delta}^{(n,k)}-\delta^o\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}-\rho_{\pi}^o\|_2\times \|\hat{\eta}^{(n,k)}-\eta^o\|_2\\ \end{array}\right\} =o_p(n^{-1/2}). \end{align}\] Then $\sqrt{n}\left(\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^o\right)/\sigma_{\pi}^o$ converges in distribution to $\mathcal{N}(0,1)$, where the asymptotic variance is defined as $(\sigma_{\pi}^o)^2:=\mathbb{E}[\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o)^2]$. In addition, if we define the variance estimator for $(\sigma_{\pi}^o)^2$ as $(\hat{\sigma}_{\pi}^{(n)})^2:=\sum_{k=1}^K\mathbb{E}_{nk}[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2]/K,$ then $(\hat{\sigma}_{\pi}^{(n)})^2$ converges in probability to $(\sigma_{\pi}^o)^2$.

Analogously, for any fixed fold $k = 1, \ldots, K$, the nuisance estimators $\hat{\beta}^{(n,k)}$ are trained using only the observations in $I_{-k}$ with any suitable machine learning method. The estimator $\hat{\psi}_{ada}^{(n)}$ is then defined as the solution to \[\begin{align} \label{eq:32AUG32estimator32adaptive32weight} \sum_{k=1}^K \mathbb{E}_{nk}\!\left[\varphi\!\left(O;\hat{\psi}_{ada}^{(n)},\hat{\beta}^{(n,k)}\right)\right] = 0. \end{align}\tag{6}\] Next, we establish the consistency and asymptotic normality of $\hat{\psi}_{ada}^{(n)}$ defined in 6 .

Theorem 5 (Asymptotic normality of $\hat{\psi}_{ada}^{(n)}$). Under Assumptions 1–5, suppose that either $Z$ is an AIV for $A$ or $Y(1)-Y(0)\perp\!\!\!\perp U\mid L$. Assume that for any $k=1,\ldots,K$, $\mathbb{E}[\|\hat{\beta}^{(n,k)}-\beta^o\|_2^2]=o(1)$, and that \[\begin{align} \left\{ \begin{array}{l} \|\hat{\gamma}^{(n,k)}-\gamma^o\|_2 \times \|\hat{\kappa}^{(n,k)}-\kappa^o\|_2 +\|\hat{\delta}^{(n,k)}-\delta^o\|_2^2+\|\hat{\pi}^{(n,k)}-\pi^o\|_2^2\\ +\|\hat{\xi}^{(n,k)}-\xi^o\|_2\times \|\hat{\pi}^{(n,k)}-\pi^o\|_2 +\|\hat{\eta}^{(n,k)}-\eta^o\|_2\times \|\hat{\delta}^{(n,k)}-\delta^o\|_2 \end{array}\right\} =o_p(n^{-1/2}) \end{align}\] Then $\sqrt{n}\left(\hat{\psi}_{ada}^{(n)}-\psi_{ada}^o\right)/\sigma_{ada}^o$ converges in distribution to $\mathcal{N}(0,1)$, where the asymptotic variance is defined as $(\sigma_{ada}^o)^2:=\mathbb{E}[\varphi(O;\psi_{ada}^o,\beta^o)^2]$. In addition, if we define the variance estimator for $(\sigma_{ada}^o)^2$ as $(\hat{\sigma}_{ada}^{(n)})^2:=\sum_{k=1}^K\mathbb{E}_{nk}[\varphi(O;\hat{\psi}_{ada}^{(n)},\hat{\beta}^{(n,k)})^2]/K,$ then $(\hat{\sigma}_{ada}^{(n)})^2$ converges in probability to $(\sigma_{ada}^o)^2$.

4 Extension to longitudinal data↩︎

4.1 Identification strategy↩︎

In this section, we extend the identification strategy introduced in Section 2 to a longitudinal setting with a sequence of IVs observed at each time point. Specifically, consider a longitudinal study with measurements collected at $T+1$ discrete time points, indexed by $t = 0, 1, \ldots, T$, where $T$ is a fixed nonnegative integer. The special case $T = 0$ reduces to the panel data framework discussed in Section 2.

For notation, let $\overline{a}_t := [a_0, a_1, \ldots, a_t]$, $\underline{a}_t := [a_t, a_{t+1}, \ldots, a_T]$, $a_t^s := [a_t, a_{t+1}, \ldots, a_s]$, and $\overline{a} := [a_0, a_1, \ldots, a_T]$. By convention, we set $\underline{a}_{T+1} = a_t^{t-1} := \emptyset$ for any $t$, and note that $\overline{a} = \underline{a}_0 = \overline{a}_T$ for brevity. At each time point $t$, let $L_t\in\mathcal{L}_t$ denote the vector of observed confounders, $U_t\in\mathcal{U}_t$ the vector of unobserved confounders, $Z_t\in\mathcal{Z}_t$ the IV (which may be multi-categorical or continuous), and $A_t\in\mathcal{A}_t$ the multi-categorical treatment assignment. The observed data are given by $O := [\overline{Z}_T, \overline{A}_T, \overline{L}_T, Y],$ where $Y$ is the outcome of interest, observed only at time $T+1$.

At each time point $t$, define the history $H_t := [\overline{Z}_{t-1}, \overline{A}_{t-1}, \overline{L}_t]\in\mathcal{H}_t$, and let $H_{T+1} := O$ denote the full observed data. Notably, the histories satisfy the recursive relation $H_t = [H_{t-1}, Z_{t-1}, A_{t-1}, L_t].$ Let $Y(\overline{a})$ denote the potential outcome under treatment history $\overline{A} = \overline{a}$. Next, we extend the preceding assumptions to a longitudinal setting with valid instrumental variables.

Assumption 6 (Consistency). $Y=Y(\overline{A})$.

Assumption 7 (Latent ignorability). For any fixed $t$, $\{Z_t,A_t\}\perp\!\!\!\perp Y(\underline{a}_t) \mid H_t,\overline{U}_t$.

Assumption 8 (IV independence). For any fixed $t$, $Z_t\perp\!\!\!\perp\overline{U}_t\mid H_t$.

Assumption 9 (IV relevance). For any fixed $t$, $a_t\in\mathcal{A}_t$, $h_t\in\mathcal{H}_t$, $Z_t\not\perp\!\!\!\perp I\{A_t=a_t\}\mid H_t=h_t.$

Assumption 10 (Positivity). For each time $t=0,\ldots,T$, there exists a positive constant $\epsilon_0$ such that for any $l_t\in\mathcal{L}_t$ and $a_t\in\mathcal{A}_t$, $\mathrm{Var}\!\{\Pr(A_t=a_t\mid Z_t,L_t) \mid L_t=l_t\} \geq \epsilon_0.$

Assumptions 6–10 can be regarded as a longitudinal extension of Assumptions 1–5. For illustration, Figure 1 displays a sequential directed acyclic graph (DAG) for the IV setting with $T = 2$ under the one-step Markov property, where Assumptions 7 and 8 hold.

Figure 1: Sequential DAG when T=2. Gray nodes indicate unobserved variables, while white nodes indicate observed ones. — Figure 1: Sequential DAG when $T=2$. Gray nodes indicate unobserved variables, while white nodes indicate observed ones.

Next, we generalize the definitions of AIV and RWF from the point-exposure setting to accommodate longitudinal data.

Definition 3 (Longitudinal AIV). For any fixed $t$, we say that $Z_t$ is an AIV for $A_t$ if there exist functions $b_{t,a_t}(\overline{U}_t,H_t)$ and $c_{t,a_t}(Z_t,H_t)$ such that, for all $a_t \in \mathcal{A}_t$, \[\Pr(A_t = a_t \mid Z_t, \overline{U}_t, H_t) = b_{t,a_t}(\overline{U}_t, H_t) + c_{t,a_t}(Z_t, H_t).\]

Definition 4 (Longitudinal RWF). A function $\pi_t(Z_t, H_t)$ is said to be an RWF for $A_t$ if it is uniformly bounded and, for each $a_t \in \mathcal{A}_t$, there exists a constant $\epsilon_0 > 0$ such that \[\left| \mathrm{Cov}\!\left\{ I\{A_t = a_t\}, \pi_t(Z_t, H_t) \mid H_t \right\} \right| \geq \epsilon_0, \quad \text{uniformly over } H_t.\]

These definitions are required for identifying the longitudinal mean potential outcomes. For a fixed sequence of RWFs $\pi_t(Z_t,H_t)$, define $\gamma_{T+1,\underline{a}_{T+1}}^o(H_{T+1}) := Y.$ Then, for $t = T, \ldots, 0$, recursively define the nuisance function \[\gamma_{t,\underline{a}_t}^o(H_t) := \frac{\mathrm{Cov}\!\left\{ I\{A_t = a_t\} \, \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}), \pi_t(Z_t, H_t) \mid H_t \right\}}{\mathrm{Cov}\!\left\{ I\{A_t = a_t\}, \pi_t(Z_t, H_t) \mid H_t \right\}}.\] Intuitively, $\gamma_{t,\underline{a}_t}^o(H_t)$ can be interpreted as an estimator of the conditional mean potential outcomes $\mathbb{E}[Y(\underline{a}_t) \mid H_t]$. This relationship will be formally established in the next proposition.

Proposition 8 (Longitudinal AIV identification). Under Assumptions 6–9, let $0 \le s \le T+1$, $r \ge 0$, and $s + r \le T+1$. Suppose that, for each $t=0,\ldots,T$, $Z_t$ serves as an AIV for $A_t$, and that $\pi_t(Z_t, H_t)$ is an RWF for $A_t$. Then, the mean potential outcomes $\mathbb{E}\bigl[Y(\underline{a}_{s})\bigr]$ can be expressed as \[\label{eq:32identification32AIV32longitudinal} \begin{align} \mathbb{E}\left[ \prod_{t=s}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r}) \right]. \end{align}\qquad{(6)}\]

Notably, consider the special case of Proposition 8 with $s = 0$ and $r = 0$, which corresponds to identifying the mean of potential outcomes from the initial time point through the final time $T$ without truncation. Without loss of generality, if we set $\pi(Z_t,H_t) = Z_t$, the identification formula simplifies to \[\begin{align} \mathbb{E}[Y(\overline{a})] = \psi_{\overline{a}}^o := \mathbb{E}\Biggl[ \prod_{t=0}^{T} \frac{(Z_t - \mathbb{E}[Z_t \mid H_t]) \, I\{A_t = a_t\}}{\mathrm{Cov}\!\{ I\{A_t = a_t\}, Z_t \mid H_t \}} \times Y \Biggr].\label{eq:32identification32AIV32longitudinal322} \end{align}\tag{7}\] Alternatively, by setting $s = 0$ and $r = T+1$, the identification formula reduces to $\psi_{\overline{a}}^o = \mathbb{E}\bigl[\gamma_{0,\overline{a}}^o(H_0)\bigr].$

4.2 Semiparametric theory↩︎

In this subsection, without loss of generality, we focus on the case where $Z_t$ is univariate and that $\pi_t(Z_t, H_t)=Z_t$ is an RWF for $A_t$. We derive the EIFs for $\psi_{\overline{a}}^o$ in Equation 7 when $\pi_t(Z_t, H_t) = Z_t$; for a general RWF $\pi_t(Z_t, H_t)$, we can define $Z_t^{\pi} := \pi_t(Z_t, H_t)$ and replace $Z_t$ with $Z_t^{\pi}$ in the subsequent analysis.

For notational convenience, define $\gamma_{T+1,\underline{a}_{T+1}}^o(H_{T+1}) := Y$ and $A_t^{(a_t)} := I\{A_t = a_t\}$. For $t = T, \ldots, 0$, recursively define the nuisance functions: \[\begin{align} &\kappa_{t,a_t}^o(H_t) := \mathrm{Cov}\!\{A_t^{(a_t)}, Z_t \mid H_t\},\\ &\delta_{t,a_t}^o(H_t) := \mathbb{E}[A_t^{(a_t)} \mid H_t], && \eta_{t,\underline{a}_t}^o(H_t) := \mathbb{E}\bigl[A_t^{(a_t)} \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}) \mid H_t\bigr],\\ &\rho_t^o(H_t) := \mathbb{E}[Z_t \mid H_t], && \gamma_{t,\underline{a}_t}^o(H_t) := \frac{\mathrm{Cov}\!\{ A_t^{(a_t)} \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}), Z_t \mid H_t \}}{\mathrm{Cov}\!\{ A_t^{(a_t)}, Z_t \mid H_t \}}. \end{align}\] We summarize these nuisance functions into a unified nuisance vector: \[\begin{align} \label{eq:32nuisance32function32longitudinal} \alpha_{\overline{a}}^o := \{\alpha_{t,\underline{a}_t}^o\}_{t=0}^T, \qquad \alpha_{t,\underline{a}_t}^o := [ \delta_{t,a_t}^o, \kappa_{t,a_t}^o, \rho_t^o, \eta_{t,\underline{a}_t}^o, \gamma_{t,\underline{a}_t}^o]. \end{align}\tag{8}\] We now proceed to derive the EIF for $\psi_{\overline{a}}^o$ in Equation 7 in the following theorem.

Theorem 6. Under Assumptions 6–9, suppose that for each $t=0,\ldots,T$, $\pi_t(Z_t, H_t)=Z_t$ is an RWF for $A_t$. Then, the EIF for $\psi_{\overline{a}}^o$ consists of $\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o)$, where \[\begin{align} &\varphi_{\overline{a}}(O;\psi_{\overline{a}},\alpha_{\overline{a}}):= \prod_{t=0}^{T} \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} Y-\psi_{\overline{a}} +\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\right) \\&\times \left\{ \left(1-\dfrac{\{Z_t-\rho_t(H_t)\}\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}}{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\times\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\right\}. \end{align}\]

The next proposition derives the mixed bias property for the EIF in Theorem 6.

Proposition 9 (Mixed bias property). Under the conditions of Theorem 6, for any fixed nuisance vector $\alpha_{\overline{a}}$ in Equation 8 , $\mathbb{E}[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}})]$ equals to \[\begin{align} &\mathbb{E}\left[\begin{array}{l} \displaystyle\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\right) \times \dfrac{1}{\kappa_{t,a_t}(H_t)}\\ \times\left(\begin{array}{l} \left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ +\left\{\rho_t(H_t) - \rho_t^o(H_t)\right\}\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ +\left\{\rho_t(H_t) - \rho_t^o(H_t)\right\}\left\{\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t)\right\} \gamma_{t,\underline{a}_t}(H_t) \end{array}\right) \end{array}\right]. \end{align}\]

4.3 Cross-fitting procedure↩︎

We develop a cross-fitting procedure for estimating $\psi_{\overline{a}}^o$. This estimator can be intuitively understood as resulting from a backward fitting strategy. Let $\hat{\mathbb{E}}^{(n,k)}[\phi(O) \mid H_t]$ denote an estimate of the conditional expectation $\mathbb{E}[\phi(O)\mid H_t]$, and $\widehat{\text{Cov}}^{(n,k)}\{\phi_1(O), \phi_2(O)\mid H_t\}$ denote an estimate of the conditional covariance $\text{Cov}\{\phi_1(O),\phi_2(O)\mid H_t\}$. Both estimates are obtained using only the observations in $I_{-k}$ and fitted using an appropriate machine learning method. Algorithm 2 summarizes the backward cross-fitting procedure.

Figure 2: Backward Cross-Fitting Procedure for Longitudinal Data

Figure 3: Illustration of the backward cross-fitting procedure in the k-th fold for T=2. — Figure 3: Illustration of the backward cross-fitting procedure in the $k$-th fold for $T=2$.

Figure 3 provides a graphical illustration of Algorithm 2 for $T=2$. The top row represents the evaluation set ($I_k$), while the bottom row corresponds to the training set ($I_{-k}$). The middle row depicts the nuisance estimators $\hat{\alpha}_{t,\overline{a}}^{(n,k)}$, which are fitted using the training data and contribute to the final predictions. Importantly, $\hat{\alpha}_{t,\overline{a}}^{(n,k)}$ is constructed without using any samples from the evaluation set $I_k$, ensuring their independence from the evaluation data.

In particular, in Algorithm 2, we introduce the random variable $\hat{\Psi}_{t,\underline{a}_{t}}^{(n,k)}$. By induction, one can verify that for any $t=0,\ldots,T$, \[\begin{align} &\hat{\Psi}_{t,\underline{a}_{t}}^{(n,k)}=\prod_{s=t}^{T} \dfrac{\left\{Z_s-\hat{\rho}_s^{(n,k)}(H_s)\right\}A_s^{(a_s)}} {\hat{\kappa}_{s,a_s}^{(n,k)}(H_s)} Y +\sum_{s=t}^T\left(\displaystyle\prod_{r=t}^{s-1}\dfrac{\left\{Z_r-\hat{\rho}_r^{(n,k)}(H_r)\right\}A_r^{(a_r)}} {\hat{\kappa}_{r,a_r}^{(n,k)}(H_r)}\right) \\&\times \left\{\begin{array}{c} \left(1-\dfrac{\{Z_s-\hat{\rho}_s^{(n,k)}(H_s)\}\{A_s^{(a_s)}-\hat{\delta}_{s,a_s}^{(n,k)}(H_s)\}}{\hat{\kappa}_{s,a_s}^{(n,k)}(H_s)} \right)\hat{\gamma}_{s,\underline{a}_s}^{(n,k)}(H_s) \\- \dfrac{(Z_s-\hat{\rho}_s^{(n,k)}(H_s))\times\hat{\eta}_{s,\underline{a}_s}^{(n,k)}(H_s)}{\hat{\kappa}_{s,a_s}^{(n,k)}(H_s)} \end{array}\right\}. \end{align}\] Intuitively, $\hat{\Psi}_{t,\underline{a}_t}^{(n,k)}$ provides an estimate of the true conditional mean potential outcomes $\gamma_{t,\underline{a}_t}^o(H_t)$. That is, if the nuisance functions all equal to the truth, then $\mathbb{E}[\hat{\Psi}_{t,\underline{a}_t}^{(n,k)}\mid H_t]=\gamma_{t,\underline{a}_t}^o(H_t)$. This type of estimator is referred to as a DR-Learner (or IF-Learner) in [51]–[53], where its theoretical properties are also established.

In addition, one can verify that $\varphi_{\overline{a}}(O;\psi_{\overline{a}},\hat{\alpha}_{\overline{a}}^{(n,k)}) = \hat{\Psi}_{0,\overline{a}}^{(n,k)} - \psi_{\overline{a}}.$ This representation naturally motivates the construction of the estimator $\hat{\psi}_{\overline{a}}^{(n)}$ as defined in Equation [eq:32AUG32estimator32longitudinal].

Next, we establish that, under the IV assumptions and the required convergence rates for the nuisance estimators, $\hat{\psi}_{\overline{a}}^{(n)}$ is asymptotically normal, and its variance estimator is consistent.

Theorem 7 (Asymptotic normality of $\hat{\psi}_{\overline{a}}^{(n)}$). Under Assumptions 6–9, suppose that for each $t = 0, \ldots, T$, $Z_t$ serves as an AIV for $A_t$, and that $\pi_t(Z_t, H_t) = Z_t$ is an RWF for $A_t$. Further, for each $t = 0, \ldots, T$ and $k = 1, \ldots, K$, suppose that the following rate condition holds for the nuisance functions $\hat{\alpha}_{t,\overline{a}}^{(n,k)}$ defined in Algorithm 2: \[\begin{align} \left\{\begin{array}{l} \|\hat{\kappa}_{t,a_t}^{(n,k)}- \kappa_{t,a_t}^{o}\|_2 \times\|\hat{\gamma}_{t,\underline{a}_t}^{(n,k)}-\gamma_{t,\underline{a}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\eta}_{t,\underline{a}_t}^{(n,k)}-\eta_{t,\underline{a}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\delta}_{t,a_t}^{(n,k)}- \delta_{t,a_t}^{o}\|_2 \end{array}\right\}=o_p(n^{-1/2}). \end{align}\] Furthermore, assume that for any fixed $k$ and time $t$, $\mathbb{E}[\|\hat{\alpha}_{t,\overline{a}}^{(n,k)}- \alpha_{t,\overline{a}}^o\|_2^2]=o(1)$. Then $\sqrt{n}\{\hat{\psi}_{\overline{a}}^{(n)}-\psi_{\overline{a}}^o\}/\sigma_{\overline{a}}^o$ converges in distribution to $\mathcal{N}(0,1)$, where $\hat{\psi}_{\overline{a}}^{(n)}$ is defined in Equation [eq:32AUG32estimator32longitudinal], and $(\sigma_{\overline{a}}^o)^2:=\mathbb{E}[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o)^2]$. In addition, $\hat{\sigma}_{\overline{a}}^{(n)}$ converges in probability to $\sigma_{\overline{a}}^o$.

We note that one can adaptively select the RWF $\pi_t(Z_t,H_t)$, as discussed in Section 3, to obtain an efficient estimator of $\psi_{\overline{a}}^o$. A detailed discussion of this adaptive selection is provided in the appendix. Intuitively, a natural candidate for the adaptive RWF is the conditional probability $\pi_t^o(Z_t,H_t) = \Pr(A_t = a_t \mid Z_t,H_t)$, which directly characterizes the treatment assignment mechanism given the instrument and history.

Finally, although we focus on the evaluation of static treatment rules, the proposed methods can be extended to facilitate the evaluation of dynamic treatment rules, which is aslo demonstrated in the appendix. This extension would allow for the incorporation of time-varying treatment strategies and enable more comprehensive assessments of treatment effects over time, enhancing the applicability of our approach to dynamic decision-making processes.

5 Simulations↩︎

5.1 Point exposure scenario↩︎

In this section, we conduct simulation studies with a binary treatment and a continuous IV to illustrate the asymptotic results established in Section 3.

We generate $U$ and $L$ independently from the uniform distribution $U(-1,1)$. The continuous IV is constructed as $Z = L + \sin(3L) + 2\epsilon_Z$, where $\epsilon_Z$ is an exogenous error term drawn from the standard normal distribution. The binary treatment $A$ is generated under the following two designs:

$A \sim \mathrm{Bernoulli}\big(0.7\Phi(-2Z + 2L) + 0.3\Phi(3U - L)\big)$, where $\Phi$ denotes the cumulative distribution function of the standard normal distribution.
$A \sim \mathrm{Bernoulli}\big(\{1 + \exp(-(Z - L + U))\}^{-1}\big)$.

It is straightforward to verify that $Z$ is an AIV for $A$ in [A1], whereas in [A2] the AIV conditions are violated. Next, we independently generate $\epsilon_Y \sim N(0,1)$. The outcome $Y$ is then generated according to the following two models:

$Y = Y(A) = 2U - 2L + 4AL + \epsilon_Y$.
$Y = Y(A) = (1 - A)\{3\cos(2U) - 3\cos(2L)\} + A\{3\sin(2U) + 2L\} + \epsilon_Y$.

In [Y1], the condition $Y(1) - Y(0)\not\perp\!\!\!\perp U \mid L$ holds, while in [Y2] this condition is violated. We set the sample size to $n=5000$ and use $K=2$ folds for cross-fitting. Nuisance functions are estimated via spline methods implemented in the mgcv package in R [54].

The results are summarized in Table 1. The simulations suggest that both estimators are nearly unbiased under correct model specification, with estimated standard errors closely tracking empirical variability and coverage rates remaining near the nominal 95% level. Under misspecification ([Y2], [A2]), we observe somewhat larger bias and a modest decline in coverage. We also find that the adaptive weighting method generally yields smaller variance, indicating that it tends to be more efficient.

Table 1: Simulation results under four DGPs. In each setting, the estimation is repeated 1000 times to calculate the average bias (Bias), empirical standard deviation (SD), average estimated standard error (SE), and the empirical 95% coverage rate (CR).
DGP	Metric	Adaptive weight			Prespecified ($\pi(Z,L)=Z$)
DGP	Metric	Treated	Control	ATE	Treated	Control	ATE
[Y1],[A1]	Bias	.0011	.0011	.0019	.0018	.0032	.0028
	SE	.0543	.0545	.0807	.0626	.0626	.0920
	SD	.0577	.0581	.0844	.0612	.0648	.0922
	CR	.927	.938	.939	.947	.935	.950
[Y2],[A1]	Bias	.0023	.0078	.0064	.0043	.0044	.0012
	SE	.0864	.0611	.1060	.1006	.0691	.1220
	SD	.0906	.0637	.1092	.0987	.0703	.1200
	CR	.926	.946	.941	.951	.951	.952
[Y1],[A2]	Bias	.0006	.0020	.0013	.0006	.0013	.0003
	SE	.0554	.0561	.0820	.0565	.0569	.0833
	SD	.0547	.0575	.0830	.0563	.0557	.0828
	CR	.954	.934	.938	.956	.956	.944
[Y2],[A2]	Bias	.0000	.0318	.0315	.0027	.0269	.0301
	SE	.0886	.0615	.1079	.0908	.0621	.1100
	SD	.0891	.0619	.1089	.0915	.0618	.1110
	CR	.948	.912	.936	.955	.924	.941

5.2 Longitudinal scenario↩︎

To simplify the setting, we fix the time horizon at $T = 1$. For each time point $t = 0,\,1$, we independently generate noise terms $\epsilon_{U_t}$, $\epsilon_{L_t}$, $\epsilon_{Z_t}$, and $\epsilon_Y$ from standard normal distributions. Based on these, the variables are simulated according to the following DGP: \[\begin{align} &L_0 \sim 1.5\epsilon_{L_0},\quad U_0 \mid H_0 \sim 1.5\epsilon_{U_0};\\ &Z_0 \mid H_0,U_0 \sim 0.3 L_0 + \sin(1.5 L_0)+ 2\epsilon_{Z_0};\\ &A_0 \mid H_0,U_0,Z_0 \sim \text{Ber}(1,\,0.7\Phi(-2Z_0+0.6L_0) + 0.3\Phi(3U_0-L_0));\\ &L_1 \mid H_0,U_0,Z_0,A_0 \sim (A_0-0.5)+0.5L_0+0.3U_0+0.5 \epsilon_{L_1};\\ &U_1 \mid H_1,U_0 \sim (A_0-0.5)+0.5U_0+0.3L_1+0.5 \epsilon_{U_1};\\ &Z_1 \mid H_1,\overline{U}_1 \sim 0.5L_1-0.5(A_0-0.5)-0.3Z_0+2\epsilon_{Z_1};\\ &A_1 \mid H_1,\overline{U}_1,Z_1 \sim \text{Ber}(1,\,0.7\Phi(-2Z_1+L_1) + 0.3\Phi(3U_1-L_1));\\ &Y \mid H_1,\overline{U}_1,Z_1,A_1 \sim (A_1-0.5)+2L_1+U_1+0.5\epsilon_{Y}. \end{align}\] Notably, “Ber” represents the binomial distribution. $Z_0$ and $Z_1$ serve as AIVs for $A_0$ and $A_1$, respectively.

Data are generated using the R package simcausal [55]. The sample sizes are set to $2000$ and $5000$, with cross-fitting performed using $K=2$ folds. Nuisance functions are estimated via spline methods implemented in the R package mgcv. Because the true treatment effects are not analytically available under the constructed DGPs, we estimate them using 100,000 samples by simulating potential outcomes under modified data-generating processes, where pairs of treatment assignments, $(A_0,A_1)$, are set deterministically to $(0,0)$, $(1,1)$, $(0,1)$, $(1,0)$, $(A_0,1),$ and $(A_0,0)$ (We use the notation $(A_0,0)$ to denote a dynamic treatment regime that follows the natural treatment rule for $A_0$ while fixing $A_1 = 0$).

Table 2 summarizes the simulation results adopting Algorithm 2 under different intervention strategies and sample sizes. Across all scenarios, the estimators display small bias, and the estimated standard errors closely match the empirical standard deviations, indicating accurate variance estimation. As expected, increasing the sample size from 2000 to 5000 reduces variability and tightens confidence intervals. The empirical coverage rates are generally close to the nominal 95% level, though a modest decline is observed in certain intervention settings. Overall, the results demonstrate that the proposed method performs well and provides reliable inference in most cases.

Table 2: Simulation results under three DGPs. In each setting, the estimation is repeated 500 times to calculate the average bias (Bias), average estimated standard error (SE), empirical standard deviation (SD), and the empirical 95% coverage rate (CR).
Size	Metric	Intervention
Size	Metric	$(0,0)$	$(0,1)$	$(1,0)$	$(1,1)$	$(A_0,0)$	$(A_0,1)$
2000	Bias	.0234	.0071	.0147	.0189	.0016	.0033
	SE	.6090	.5267	.5266	.5960	.1401	.1403
	SD	.6102	.5382	.5203	.6343	.1395	.1460
	CR	.955	.959	.960	.938	.938	.929
5000	Bias	.0134	.0078	.0057	.0155	.0057	.0037
	SE	.2143	.1882	.1962	.2115	.0658	.0651
	SD	.2211	.1949	.1870	.2004	.0607	.0612
	CR	.940	.930	.942	.956	.956	.950

6 Empirical illustration↩︎

We apply our framework to study returns to schooling and post-school training as sequential treatments and to conduct a policy analysis, using the dataset provided in the supplementary materials of [56]. Schooling and post-school training are two central interventions influencing labor market outcomes such as earnings and employment [57]. To enable such analysis, [56] merged data from the Job Training Partnership Act (JTPA) Title II with additional sources on high school (HS) education, thereby constructing a dataset suitable for evaluating the effects of HS diplomas and subsidized job training as sequential treatments. The final sample comprises 9,223 individuals.

We now describe the key features of this dataset. Let $A_0$ denote whether an individual obtains a high school (HS) diploma, and let $A_1$ indicate participation in the job training program. Define $L_0$ as sex (a baseline covariate) and $L_1$ as pre-program earnings, which serve as time-varying confounders. The initial treatment $A_0$ influences subsequent pre-program earnings $L_1$, and the allocation of $A_1$ may adapt to $L_1$. The instruments are given by $Z_0$, the number of high schools per square mile, and $Z_1$, a random assignment to job training. Our target outcome is $Y$, the indicator that the potential terminal earnings exceed their empirical median.

We consider the dynamic treatment regime (DTR) $(g_0, g_1) \in \{0,1,\text{x}\} \times \{0,1,d^+,d^-\}$. For the DTR $g_0$, the value ‘0’ assigns $A_0=0$, ‘1’ assigns $A_0=1$, and ‘x’ assigns the natural selection rule (i.e., the observed assignment). For the treatment rule $g_1$, the value ‘0’ assigns $A_1=0$, ‘1’ assigns $A_1=1$, ‘$d^+$’ assigns $A_1=1$ only when $L_1$ is below the 80% quantile, and ‘$d^-$’ assigns $A_1=1$ only when $L_1$ is above the 80% quantile. The target is to estimate $\mathbb{E}[Y(g_0(H_0),g_1(H_1))]$. We use the spline methods in the R package mgcv to estimate the nuisance parameters and calculate the bootstrapped estimated mean and standard deviation based on 1,000 replications.

Table 3: Estimations for different types of DTRs are repeated 1,000 times using the bootstrap method.
DTRs	00	01	0$d^+$	0$d^-$	10	11	1$d^+$	1$d^-$	x0	x1	x$d^+$	x$d^-$
EST	.292	.321	.343	.264	.653	.668	.732	.589	.485	.550	.548	.487
SE	.124	.089	.094	.120	.155	.122	.132	.147	.017	.012	.013	.016
SD	.108	.074	.081	.103	.129	.089	.102	.126	.017	.011	.012	.015

The results are reported in Table 3. The table presents estimated mean of mean potential terminal income (EST), mean estimated standard errors (SE), and empirical standard deviations (SD) for various DTRs. Specifically, the DTRs are denoted by combinations of values in the set $\{0,1,\text{x}\}$ for the first treatment $g_0$ and $\{0,1,d^+,d^-\}$ for the second treatment $g_1$. The treatment values and their respective effects are summarized for 12 distinct combinations.

We observe that the SE and SD for (x0, x1, x$d^+$, x$d^-$) are smaller compared to the other DTRs. This is likely due to the relatively weak correlation between $Z_0$ and $A_0$. Furthermore, the SE and SD values for the same DTR are fairly consistent with each other, demonstrating the validity of the variance estimator in our algorithm.

Looking at the EST, the DTRs involving treatment rules (10, 11, 1$d^+$, 1$d^-$) generally show higher estimates compared to those involving (00, 01, 0$d^+$, 0$d^-$). The natural treatment rules (x0, x1, x$d^+$, x$d^-$) fall in between these two groups, suggesting that education level at the first time stage has a positive influence on terminal income.

On the other hand, the estimated terminal income for DTRs (0$d^+$, 1$d^+$, x$d^+$), which assign the program only to low-earning individuals, are higher than those for DTRs (00, 10, x0) and (01, 11, x1). In contrast, the estimated terminal income for DTRs (0$d^-$, 1$d^-$, x$d^-$), which assign the program only to high-earning individuals, are lower. This pattern is consistent with the findings of [56], indicating that assigning the job-training program only to low-earning individuals has a positive influence on terminal income.

7 Discussion↩︎

In this article, we develop an AIV framework for identifying causal effects for multi-categorical or continuous IVs and treatment variables. We elucidate the connection between classical TSLS estimators and the identification of causal estimands under the IV framework. Our methods are illustrated in several classical causal inference settings, including marginal structural models (MSMs) and longitudinal data. Furthermore, we analyze the efficiency, asymptotic normality, and asymptotic unbiasedness of the proposed estimator when different RWFs are employed to estimate the ATE.

Looking ahead, a promising direction is to identify other conditions analogous to AIV that guarantee the existence of a solution to the nonparametric IV problem in formula 1 . Additionally, accommodating right censoring and estimating counterfactual survival curves under a general IV framework constitute important directions for future research [58]. Furthermore, given the proposed identification result for continuous treatments, it would be of interest to propose a debiased learning approach by leveraging [59]. Finally, identifying the optimal weighting function under more general settings within the proposed framework represents another avenue for further investigation.

8 Supplementary Material↩︎

The codes of the simulation results and real data analysis are publicly available at https://github.com/chensy123-sys/Additive-IV.

Appendix↩︎

9 Additional extensions under longitudinal setting↩︎

9.1 Dynamic treatment regime↩︎

In this subsection, we illustrate the identification strategy under a dynamic treatment regime. Let $\overline{g} := [g_0, \ldots, g_T]$ denote a sequence of deterministic DTRs, where each mapping satisfies $g_t: \mathcal{H}_t \rightarrow \mathcal{A}_t$. For $s = 0, \ldots, T$, define the potential outcome under regime $\overline{g}$ as \[Y(\overline{g}) := Y\{g_0(H_0), \ldots, g_T(H_T)\},\] that is, the outcome observed when the subject follows the treatment rule $A_t = g_t(H_t)$ at each time $t$. Similarly, define \[Y(\overline{g}_s) := Y(A_0, \ldots, A_{s-1}, g_s(H_s), \ldots, g_T(H_T))\] as the potential outcome when the subject follows the rule $A_t = g_t(H_t)$ for all $t \ge s$.

Without loss of generality, we assume that $\pi_t(Z_t, H_t) = Z_t$, which is taken to be an RWF. Now we illustrate the nuisance functions used in the dynamic treatment setting. Define $\gamma_{T+1,\underline{g}_{T+1}}^o(H_{T+1}) := Y$. For $t = T, \ldots, 0$, recursively define the nuisance functions: \[\begin{align} &\delta_{t,g_t}^o(H_t) := \mathbb{E}[I\{A_t=g_t(H_t)\} \mid H_t], \\ & \eta_{t,\underline{g}_t}^o(H_t) := \mathbb{E}\bigl[I\{A_t=g_t(H_t)\} \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}) \mid H_t\bigr],\\ &\rho_t^o(H_t) := \mathbb{E}[Z_t \mid H_t], \\ &\kappa_{t,g_t}^o(H_t) := \mathrm{Cov}\!\{I\{A_t=g_t(H_t)\}, Z_t \mid H_t\}, \\ & \gamma_{t,\underline{g}_t}^o(H_t) := \frac{\mathrm{Cov}\!\{ I\{A_t=g_t(H_t)\} \gamma_{t+1,\underline{g}_{t+1}}^o(H_{t+1}), Z_t \mid H_t \}}{\mathrm{Cov}\!\{ I\{A_t=g_t(H_t)\}, Z_t \mid H_t \}}. \end{align}\] Denote the nuisance vector as \[\begin{align} \alpha_{\overline{g}}^o := \{\alpha_{t,\underline{g}_t}^o\}_{t=0}^T, \qquad \alpha_{t,\underline{g}_t}^o := [ \delta_{t,g_t}^o, \kappa_{t,g_t}^o, \rho_t^o, \eta_{t,\underline{g}_t}^o, \gamma_{t,\underline{g}_t}^o]. \end{align}\] Then the identification strategy is derived in the next proposition.

Proposition 10. Under Assumptions 6–9, let $0 \le s \le T+1$, $r \ge 0$, and $s + r \le T+1$. Suppose that, for each $t=0,\ldots,T$, $Z_t$ serves as an AIV for $A_t$, and that $\pi_t(Z_t, H_t)$ is an RWF for $A_t$. Then, \[\begin{align} \mathbb{E}\bigl[Y(\underline{g}_{s})\bigr]= \mathbb{E}\left[ \prod_{t=s}^{T-r} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)I\{A_t=g_t(H_t)\}}{ \kappa_{t,g_t}^o(H_t) } \gamma_{T+1-r,\underline{g}_{T+1-r}}^o(H_{T+1-r}) \right]. \end{align}\]

The proof for Proposition 10 is similar to that of Proposition 8, thus omitted. Set $s=0$ and $r=0$, we know that the mean potential outcomes can be identified as \[\begin{align} \mathbb{E}\bigl[Y(\overline{g})\bigr]=\psi_{\overline{g}}^o:= \mathbb{E}\left[ \prod_{t=0}^{T} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)I\{A_t=g_t(H_t)\}} { \kappa_{t,g_t}^o(H_t) }Y \right]. \end{align}\] Next, we derive the EIF for $\psi_{\overline{g}}$.

Theorem 8. Under Assumptions 6–9, suppose that for each $t=0,\ldots,T$, $\pi_t(Z_t, H_t)=Z_t$ is an RWF for $A_t$. Then, if we define $A_t^{(g_t)}:=I\{A_t = g_t(H_t)\}$, the EIF for $\psi_{\overline{g}}^o$ consists of $\varphi_{\overline{g}}(O;\psi_{\overline{g}}^o,\alpha_{\overline{g}}^o)$, where \[\begin{align} &\varphi_{\overline{g}}(O;\psi_{\overline{g}},\alpha_{\overline{g}}):= \prod_{t=0}^{T} \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(g_t)}} {\kappa_{t,g_t}(H_t)} Y-\psi_{\overline{g}} +\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(g_s)}} {\kappa_{s,g_s}(H_s)}\right) \\&\times \left\{ \left(1-\dfrac{\{Z_t-\rho_t(H_t)\}\{A_t^{(g_t)}-\delta_{t,g_t}(H_t)\}}{\kappa_{t,g_t}(H_t)} \right)\gamma_{t,\underline{g}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\times\eta_{t,\underline{g}_t}(H_t)}{\kappa_{t,g_t}(H_t)}\right\}. \end{align}\]

For estimation, we can replace $A_t^{(a_t)}$ with $A_{t}^{(g_t)}=I(A_t=g_t(H_t))$, and $\underline{a}_t$ with $\underline{g}_t$ in Algorithm 2, and construct the corresponding backward cross-fitting algorithm for dynamic treatment regimes. Concretely, Algorithm 4 can be viewed as a generalization of Algorithm 2 to estimating the mean potential outcomes under the DTRs $\overline{g}$. An analogous version of Theorem 7 can be derived for potential outcome means $\hat{\psi}_{\overline{g}}^{(n)}$ in Algorithm 4, deriving the asymptotic consistency and normality of our proposed estimator.

Theorem 9 (Asymptotic normality of $\hat{\psi}_{\overline{g}}^{(n)}$ in Algorithm 4). Under Assumptions 6–10, suppose that for each $t = 0, \ldots, T$, $Z_t$ serves as an AIV for $A_t$. Further, for each $t = 0, \ldots, T$ and $k = 1, \ldots, K$, suppose that the following rate condition holds for the nuisance functions $\hat{\alpha}_{t,\overline{g}}^{(n,k)}$ defined in Algorithm 4: \[\begin{align} \left\{\begin{array}{l} \|\hat{\kappa}_{t,g_t}^{(n,k)}- \kappa_{t,g_t}^{o}\|_2 \times\|\hat{\gamma}_{t,\underline{g}_t}^{(n,k)}-\gamma_{t,\underline{g}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\eta}_{t,\underline{g}_t}^{(n,k)}-\eta_{t,\underline{g}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\delta}_{t,g_t}^{(n,k)}- \delta_{t,g_t}^{o}\|_2 \end{array}\right\}=o_p(n^{-1/2}). \end{align}\] Furthermore, assume that for any fixed $k$ and time $t$, $\mathbb{E}[\|\hat{\alpha}_{t,\overline{g}}^{(n,k)}- \alpha_{t,\overline{g}}^o\|_2^2]=o(1)$. Then $\sqrt{n}\{\hat{\psi}_{\overline{g}}^{(n)}-\psi_{\overline{g}}^o\}/\sigma_{\overline{g}}^o$ converges in distribution to $\mathcal{N}(0,1)$, where $\hat{\psi}_{\overline{g}}^{(n)}$ is defined in Equation [eq:32AUG32estimator32longitudinal32DTR], and $(\sigma_{\overline{g}}^o)^2:=\mathbb{E}[\varphi_{\overline{g}}(O;\psi_{\overline{g}}^o,\alpha_{\overline{g}}^o)^2]$. In addition, $\hat{\sigma}_{\overline{g}}^{(n)}$ converges in probability to $\sigma_{\overline{g}}^o$.

The proofs for the two theorems above are omitted, since their proofs are similar to those of Theorems 4 and 6.

9.2 Adaptive choosing weighting function↩︎

In Section 3, we discuss that one can adaptively choose the weight by setting $\pi^o(Z,L)=\mathbb{E}[A\mid Z,L]$ when estimating the ATE of interest. In this subsection, we generalize this strategy to the longitudinal data. First, we define the nuisance functions as \[\begin{align} &\pi_{t,a_t}^o(Z_t, H_t):=\mathbb{E}[A_t^{(a_t)}\mid Z_t, H_t],\\ &\delta_{t,a_t}^o(H_t):=\mathbb{E}[A_t^{(a_t)}\mid H_t],\\ &\kappa_{t,a_t}^o(H_t):=\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t)\mid H_t\}, \\ &\xi_{t,\underline{a}_t}^o(Z_t,H_t):=\mathbb{E}[A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\mid Z_t,H_t], \\ &\eta_{t,\underline{a}_t}^o(H_t):=\mathbb{E}[A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\mid H_t],\\ &\gamma_{t,\underline{a}_t}^o(H_t):=\dfrac{\mathrm{Cov}\!\{A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}), \pi_{t,a_t}^o(Z_t,H_t) \mid H_t\}}{\mathrm{Var}\!\{ \pi_{t,a_t}^o(Z_t,H_t) \mid H_t\}}. \end{align}\] Define the nuisance vector as \[\begin{align} \beta_{\overline{a}} := \{\beta_{t,\underline{a}_t}\}_{t=0}^T,\quad\beta_{t,\underline{a}_t} := [\pi_{t,a_t}, \delta_{t,a_t}, \kappa_{t,a_t}, \xi_{t,\underline{a}_t}, \eta_{t,\underline{a}_t}, \gamma_{t,\underline{a}_t}]. \end{align}\]

Figure 4: Backward Cross-Fitting Procedure for DTRs

Figure 5: Backward Cross-Fitting Procedure for Adaptive Weighting Scenario

Following the same logic as in Proposition 8, we derive the following proposition, which enables identification using the adaptively selected weights.

Proposition 11. Under Assumptions 6–10, let $0 \le s \le T+1$, $r \ge 0$, and $s + r \le T+1$. Suppose that, for each $t=0,\ldots,T$, $Z_t$ serves as an AIV for $A_t$. Then, the mean potential outcomes $\mathbb{E}\bigl[Y(\underline{a}_{s})\bigr]$ can be expressed as \[\begin{align} \mathbb{E}\left[ \left(\prod_{t=s}^{T-r} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)}\right) \times \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r}) \right]. \end{align}\]

The proof for this proposition is similar to that of Proposition 8, thus omitted. Specifically, one can identify the mean of potential outcomes $\mathbb{E}[Y(\overline{a})]$ by \[\begin{align} \psi_{\overline{a},ada}^o := \mathbb{E}\left[ \prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \right]. \label{eq:32identification32AIV32longitudinal39322} \end{align}\tag{9}\] The next theorem derives the EIF for $\psi_{\overline{a},ada}^o$.

Theorem 10. Under Assumptions 6–10, suppose that for each $t=0,\ldots,T$, $Z_t$ serves as an AIV for $A_t$. Then, the EIF for $\psi_{\overline{a},ada}^o$ consists of $\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada}^o,\beta_{\overline{a}}^o)$, where \[\begin{align} &\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada},\beta_{\overline{a}}):= \prod_{t=0}^{T} \frac{\pi_{t,a_t}(Z_t,H_t) - \delta_{t,a_t}(H_t)}{\kappa_{t,a_t}(H_t)}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}\\ &+\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}(Z_s,H_s) - \delta_{s,a_s}(H_s)} {\kappa_{s,a_s}(H_s)}A_s^{(a_s)}\right)\times \dfrac{1}{\kappa_{t,a_t}(H_t)}\\ &\times \left\{\begin{array}{l} \{A_t^{(a_t)} - \pi_{t,a_t}(Z_t,H_t)\} \xi_{t,\underline{a}_t}(Z_t,H_t) -\{A_t^{(a_t)} - \delta_{t,a_t}(H_t)\} \eta_{t,\underline{a}_t}(H_t)\\ +\gamma_{t,\underline{a}_t}(H_t) \times\left(\kappa_t(H_t)+\{A_t^{(a_t)} - \pi_{t,a_t}(Z_t,H_t)\}^2-\{A_t^{(a_t)} - \delta_{t,a_t}(H_t)\}^2\right) \end{array}\right\}. \end{align}\]

Algorithm 5 demonstrates the backward cross-fitting procedure for training a debiased estimator for $\psi_{\overline{a},ada}^o$. Finally, we establish the asymptotic normality of $\hat{\psi}_{ada}^{(n)}$ in Equation 9 .

Theorem 11 (Asymptotic normality of $\hat{\psi}_{ada}^{(n)}$ in Algorithm 5). Under Assumptions 6–10, suppose that $Z_t$ is an AIV for $A_t$. Assume that for any $k=1,\ldots,K$ and $t=0,\ldots,T$, $\mathbb{E}[\|\hat{\beta}_{t,\underline{a}_t}^{(n,k)}-\beta_{t,\underline{a}_t}^o\|_2^2]=o(1)$, and that \[\begin{align} \left\{ \begin{array}{l} \|\hat{\gamma}_{t,\underline{a}_t}^{(n,k)}-\gamma_{t,\underline{a}_t}^o\|_2 \times \|\hat{\kappa}_{t,a_t}^{(n,k)}-\kappa_{t,a_t}^o\|_2\\ +\|\hat{\delta}_{t,a_t}^{(n,k)}-\delta_{t,a_t}^o\|_2^2+\|\hat{\pi}_{t,a_t}^{(n,k)}-\pi_{t,a_t}^o\|_2^2\\ +\|\hat{\xi}_{t,\underline{a}_t}^{(n,k)}-\xi_{t,\underline{a}_t}^o\|_2\times \|\hat{\pi}_{t,a_t}^{(n,k)}-\pi_{t,a_t}^o\|_2\\ +\|\hat{\eta}_{t,\underline{a}_t}^{(n,k)}-\eta_{t,\underline{a}_t}^o\|_2\times \|\hat{\delta}_{t,a_t}^{(n,k)}-\delta_{t,a_t}^o\|_2 \end{array}\right\} =o_p(n^{-1/2}). \end{align}\] Then $\sqrt{n}\left(\hat{\psi}_{\overline{a},ada}^{(n)}-\psi_{\overline{a},ada}^o\right)/\sigma_{\overline{a},ada}^o$ converges in distribution to $\mathcal{N}(0,1)$, where the asymptotic variance is defined as $(\sigma_{\overline{a},ada}^o)^2:=\mathbb{E}[\varphi(O;\psi_{\overline{a},ada}^o,\beta^o)^2]$. In addition, the variance estimator $(\hat{\sigma}_{ada}^{(n)})^2$ in Equation [eq:32AUG32estimator32longitudinal39] converges in probability to $(\sigma_{ada}^o)^2$.

The proof for this theorem is omitted, since its similar to the proof of Theorem 4.

10 Additional extensions under point-exposure setting↩︎

10.1 Multiplicative instrumental variable↩︎

The AIV condition is relatively strong and may fail to hold in some empirical settings. As an alternative, we consider an identification strategy based on a natural generalization of the multiplicative IV framework for binary instruments, originally proposed by [20]. This generalized framework relaxes the strict additive structure required by the AIV condition, while preserving the essential exclusion and relevance properties of a valid instrument. It highlights that the AIV condition is not the only means to ensure the well-posedness of the nonparametric IV problem in Equation 1 .

Definition 5 (Multiplicative IV). For each $a\in\mathcal{A}$, we say that $Z$ is a multiplicative IV* (MIV) for $A = a$ if there exist functions $b(U,L)$ and $c(Z,L)$ such that \[\Pr(A\neq a\mid Z,U,L) = b(U,L)\cdot c(Z,L).\]*

The MIV condition implies that, conditional on the observed confounders $L$, the instrument–treatment association on the multiplicative scale is unaffected by unmeasured confounding, effectively ruling out any $U$–$Z$ interaction. Accordingly, it relies on the analyst’s ability to observe and adjust for a sufficiently rich set of covariates to ensure that the instrument’s effect on treatment remains stable across levels of the hidden confounder. The following proposition presents the resulting identification strategy under an MIV.

Proposition 12 (MIV identification). Under Assumptions 1–4, for $a\in\mathcal{A}$, assume that $Z$ is an MIV for $A=a$, and that $\pi(Z,L)$ is an RWF. Then there exists a unique solution $f_a^o(A^{(a)}, L)$ to Equation 1 , which has explicit form \[\begin{align} &f_a^o(0,L)=\mathbb{E}[Y(a)\mid L]-\mathbb{E}[Y(a)\mid L,A\neq a],\\ &f_a^o(1,L)=\mathbb{E}[Y(a)\mid L]. \end{align}\] In particular, for any regular $\pi(Z,L)$ for $A=a$, it holds that \[\begin{align} \mathbb{E}[Y(a)]=\psi_{a,MIV}^o:=\mathbb{E}\left[(1-A^{(a)})\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}}+A^{(a)}Y\right]. \end{align}\]

Example 1 (Binary IV and treatment). Without loss of generality, set $\pi(Z,L)=Z$. Under Assumptions 1–4, if $Z$ is an MIV for $A$, one can recover a variation of the results in [20]. Specifically, Proposition 12 asserts that \[\begin{align} &\mathbb{E}[Y(1)] =\mathbb{E}\left[AY+(1-A)\frac{\mathbb{E}[AY\mid Z=1,L] - \mathbb{E}[AY\mid Z=0,L]}{\mathbb{E}[A\mid Z=1,L] - \mathbb{E}[A\mid Z=0,L]}\right]. \end{align}\]

At this stage, one might regard the MIV condition as a viable alternative to the AIV condition. However, as we demonstrate in the subsequent subsections, the AIV condition possesses several desirable properties that are not, in general, ensured under the MIV condition. This highlights the distinctive advantages of the AIV condition.

Without loss of generality, we just take $\pi(Z,L)=Z$. Denote the nuisance functions as \[\begin{align} &\delta_a^o(L) := \mathbb{E}[A^{(a)} \mid L], && \eta_a^o(L) := \mathbb{E}[A^{(a)}Y \mid L],\\ &\kappa_{a}^o(L) := \mathrm{Cov}\!\{A^{(a)}, Z \mid L\}, && \zeta_{a}^o(L) := \mathbb{E}[A^{(a)}Y Z \mid L],\\ &\rho^o(L) := \mathbb{E}[Z \mid L], && \gamma_{a}^o(L) := \dfrac{\mathrm{Cov}\!\{A^{(a)}Y, Z \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, Z \mid L\}}. \end{align}\] We can unify these nuisance functions into a nuisance vector as \[\begin{align} \alpha_{a,MIV}^o = [\delta_a^o,\eta_a^o,\kappa_{a}^o, \zeta_{a}^o,\rho^o,\gamma_{a}^o]. \end{align}\] Next, we derive the EIF for $\psi_{a,MIV}^o$.

Theorem 12. Under Assumptions 1–4, for $a\in\mathcal{A}$, assume that $\pi(Z,L)=Z$ is an RWF for $A=a$. Then, the EIF for $\psi_{a,MIV}^o$ is \[\begin{align} \varphi_{a,MIV}(O;\psi_{a,MIV}^o,\alpha_{a,MIV}^o)=& (1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left(A^{(a)}Y-\eta_a^o(L)\right)(Z-\rho^o(L))\\ &-\dfrac{(1-\delta_a^o(L))}{\kappa_a^o(L)}\gamma_a^o(L) \left(A^{(a)}-\delta_a^o(L)\right)(Z - \rho^o(L)). \end{align}\]

10.2 Marginal structural model↩︎

As an extension of Proposition 3, we identify the parameter of interest in the parametric marginal structural mean model, specified for each $a \in \mathcal{A}$, \[\label{eq:32MSM} \mathbb{E}[Y(a) \mid V] = g(a, V; \psi_{MSM}^o),\tag{10}\] where $g(a, V; \psi_{MSM}^o)$ is a known function, $V$ is a subset of the observed confounders $L$, and $\psi_{MSM}^o \in \mathbb{R}^q$ is the finite-dimensional parameter of interest. This type of model has been extensively studied in [16], [17], [60], [61]. Specifically, for any RWF $\pi(Z,L)$ of $A$, we denote the propensity score function as \[\begin{align} \omega_{\pi}^o(a,Z,L):=\dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}}. \end{align}\]

Proposition 13. Under Assumptions 1–4, suppose that $Z$ is an AIV for $A$, and that the potential outcome $Y(a)$ follows the parametric marginal structural model in Equation 10 . Then, for any RWF $\pi(Z, L)$ of $A$, the parameter $\psi_{MSM}^o \in\mathbb{R}^q$ satisfies \[\mathbb{E}\left[\omega_{\pi}^o(A, Z, L) \left\{Y - g(A, V; \psi_{MSM}^o)\right\}\mid A,V\right] = 0.\]

Notably, when $Z$ is binary, Proposition 13 reduces to the identification result in Proposition 3. Now, one may construct an estimator for $\psi_{MSM}^o$ using the GMM framework. A detailed exploration of such estimation procedures is beyond the scope of this article.

10.3 Connection to handling continuous instruments via discretization↩︎

Rather than directly identifying the causal effect as in Proposition 3, some studies adopt the practice of dichotomizing a continuous instrument $Z$ into a binary IV [14], [62]. A key observation is that if the original continuous $Z$ satisfies the AIV condition for $A$, then its discretized counterpart also satisfies the AIV condition. The following proposition formalizes this result, providing theoretical guarantees for identification based on discretized IVs and thereby enabling valid inference via the bounded IV approach of [11].

Proposition 14 (Discretized AIV identification). Let $\mathcal{S}=\{S_1, \ldots, S_M\}$ be an arbitrary partition of $\mathcal{Z}$ such that $\Pr(Z\in S_M)>0$. For $m=1,\ldots,M$, define $Z_{\mathcal{S}} := m$ if $Z \in S_m$. Assume that for any $a\in\mathcal{A}$ and $l\in \mathcal{L}$, there exist two distinct values $z_1,z_2$ such that \[\label{eq:32IV32relevance32discrete} \Pr(A=a\mid Z_{\mathcal{S}}=z_1,L=l) \neq \Pr(A=a\mid Z_{\mathcal{S}}=z_2,L=l).\qquad{(7)}\] Under Assumptions 1–3, if $Z$ is an AIV for $A$, the nonparametric IV equation \[\mathbb{E}[A^{(a)}Y\mid Z_{\mathcal{S}},L]=\mathbb{E}[f_{a,\mathcal{S}}^o(A^{(a)},L)\mid Z_{\mathcal{S}},L]\] has the unique solution $f_{a,\mathcal{S}}^o(A^{(a)},L)$ given by \[\begin{align} f_{a,\mathcal{S}}^o(0,L)=&\;\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z_{\mathcal{S}},U,L)\mid L\},\\ f_{a,\mathcal{S}}^o(1,L)=&\;\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z_{\mathcal{S}},U,L)\mid L\} +\mathbb{E}[Y(a)\mid L]. \end{align}\]

This constitutes a suboptimal identification strategy when the original $Z$ already satisfies the AIV condition. The key drawback is that once $Z_{\mathcal{S}}$ is discretized into a binary instrument, the relevance condition in Equation ?? becomes more restrictive than its counterpart in Assumption 4. Thus, although discretization may simplify estimation, we do not recommend this practice, as it can compromise estimator stability and reduce statistical efficiency.

10.4 Continuous treatment↩︎

So far, we have assumed that the treatment is either binary or multi-categorical. In practice, however, the treatment of interest may be continuous [59], [63]. For completeness, we now extend our theory and derive the identification strategy for a continuous treatment variable $A$. Let $p_{A\mid Z,U,L}(a \mid Z, U, L)$ denote the conditional probability density function of $A$ given $(Z,U,L)$.

Definition 6 (Additive IV). For each $a \in \mathcal{A}$, we say that $Z$ is an additive IV* (AIV) for $A = a$ if there exist functions $b(U,L)$ and $c(Z,L)$ such that \[p_{A\mid Z,U,L}(a \mid Z, U, L) = b(U, L) + c(Z, L).\] Moreover, we say that $Z$ is an AIV for $A$ if, for every $a\in\mathcal{A}$, $Z$ is an AIV for $A=a$.*

Definition 7 (Regular weighting function). For each $a\in\mathcal{A}$, we say that $\pi(Z,L)$ is a regular weighting function* (RWF) for $A = a$ if there exists a positive constant $\epsilon_0$ such that \[\big| \mathbb{E}[\pi(Z,L)\mid A=a,L] - \mathbb{E}[\pi(Z,L)\mid L] \big|\geq \epsilon_0 \quad \text{uniformly over L.}\]*

We make several remarks. First, the definitions of continuous AIV and RWF are natural extensions of their multi-categorical counterparts introduced in Section 2. Second, a continuous AIV $Z$ for a continuous treatment $A$ can be constructed, for example, via a Gaussian mixture model. We now present an identification strategy for the continuous treatment-response curve $\mathbb{E}[Y(a)]$.

Proposition 15 (AIV identification). Under Assumptions 1–4, for each $a\in\mathcal{A}$, if $Z$ is an AIV for $A=a$, then for any RWF $\pi(Z,L)$, it holds that \[\begin{align} \mathbb{E}[Y(a)] = \mathbb{E}\left[ \dfrac{ \mathbb{E}[Y\,\pi(Z,L)\mid A=a,L] - \mathbb{E}[Y\mid A=a,L] \, \mathbb{E}[\pi(Z,L)\mid L]}{ \mathbb{E}[\pi(Z,L)\mid A=a,L] - \mathbb{E}[\pi(Z,L)\mid L]} \right]. \end{align}\]

Example 2 (Binary IV with a continuous treatment). Assume that $Z$ takes values in $\{0,1\}$ and the treatment variable $A$ is continuous. Let $\pi(Z,L) = Z$, which serves as an RWF for $A = a$. Under Assumptions 1–4, if $Z$ is an AIV for $A$, Proposition 15 implies \[\begin{align} \mathbb{E}[Y(a)] = \mathbb{E}\left[\dfrac{\mathbb{E}[YZ \mid A=a,L] - \mathbb{E}[Y \mid A=a,L]\mathbb{E}[Z \mid L]} {\mathbb{E}[Z \mid A=a,L] - \mathbb{E}[Z \mid L]}\right]. \end{align}\]

Notably, a continuous IV can be discretized into a binary (or multi-categorical) IV while preserving the AIV property, as established in Proposition 14, even when the treatment variable is continuous. This allows the identification strategy to be naturally simplified in settings with discretized instruments. Moreover, semiparametric techniques, analogous to those developed in [59], can be applied to the result in Proposition 15 to construct a multiply robust and efficient estimator, which lies beyond the scope of this article.

11 Additional Empirical illustration↩︎

Table 4: Bootstrap results for the college distance data based on 500 replications.
	Treated	Control	ATE
EST	.4291	.1074	.1450
SE	.2610	.1199	.2184
SD	.4572	.1450	.2276

The CollegeDistance dataset, available in the AER package in R, originates from the High School and Beyond survey conducted by the U.S. Department of Education in 1980, with a follow-up survey in 1986. The survey includes data on 4,739 students from approximately 1,100 high schools. This dataset was originally analyzed in [64], who studied the impact of community colleges on educational attainment.

The dataset contains demographic, socioeconomic, and geographic variables commonly used in applied econometrics, particularly in instrumental variable (IV) analyses of educational attainment. Key variables include parental education (fcollege, mcollege), family characteristics (home, income), local economic conditions (unemp, wage), and measures of college accessibility such as the distance to the nearest four-year college and average state tuition. We adaptively estimate the weighting function $\pi^o(Z,L)$ and use the R package mgcv to estimate nuisance functions.

The treatment variable is education, measured as the number of years of schooling completed by 1986, ranging from 12 years (high school completion) to 18 years (graduate degree). For analytical convenience, we dichotomize education into two groups: individuals with 14 or more years of schooling (treated group) and those with 13 years or fewer (control group). The outcome of interest is income, a binary indicator of whether the annual family income in 1980 exceeded $25,000 (in 1980 U.S. dollars). We use distance and tuition as instrumental variables, and include fcollege and mcollege as baseline covariates $L$.

Table 4 reports the bootstrap results based on 500 replications. The estimated mean outcome for the treated group is approximately 0.43, whereas that for the control group is around 0.11, yielding an average treatment effect (ATE) of roughly 0.15. Reported standard errors range from 0.12 to 0.26, depending on the subgroup, while the empirical standard deviations are somewhat larger, particularly for the treated group. Overall, the results indicate a moderate positive effect of educational attainment on family income.

12 Proof for Theorems↩︎

The proof for Theorems 4, 5, 7, 9 and 11 are analogous. Thus, we only provide the proof for Theorem 4.

12.1 Proof for Theorem 1 ↩︎

Suppose there exists two solution $f_a^o(1,L)$ and $f_a'(1,L)$ to Equation 1 . Then it holds that \[\begin{align} 0=&\mathbb{E}[A^{(a)}Y-A^{(a)}Y\mid Z,L]=\mathbb{E}[f_a'(A^{(a)},L)-f_a^o(A^{(a)},L)\mid Z,L]\\ =&\{f_a'(1,L)-f_a^o(1,L)\}\mathbb{E}[A^{(a)}\mid Z,L] +\{f_a'(0,L)-f_a^o(0,L)\}\mathbb{E}[1-A^{(a)}\mid Z,L]\\ =&\{f_a'(1,L)-f_a'(0,L)-f_a^o(1,L)+f_a^o(0,L)\}\mathbb{E}[A^{(a)}\mid Z,L] +f_a'(0,L)-f_a^o(0,L) \end{align}\] From Assumption 4, for any $l$, there exists two distinct $z_1,z_2$, such that \[\mathbb{E}[A^{(a)}\mid Z=z_1,L=l]\neq \mathbb{E}[A^{(a)}\mid Z=z_2,L=l].\] Then we take difference to get \[\begin{align} &\{f_a'(1,l)-f_a'(0,l)-f_a^o(1,l)+f_a^o(0,l)\}\\ &\cdot \{\mathbb{E}[A^{(a)}\mid Z=z_1,L=l]-\mathbb{E}[A^{(a)}\mid Z=z_2,L=l]\}=0. \end{align}\] Now we know that \[\begin{align} &f_a'(1,L)-f_a'(0,L)-f_a^o(1,L)+f_a^o(0,L)\equiv 0,\\ &f_a^o(0,L)-f_a^o(0,L)\equiv 0. \end{align}\] Therefore $f_a^o(A^{(a)},L)\equiv f_a'(A^{(a)},L)$. This is equivalent to say that the solution to Equation 1 is unique. Second, \[\begin{align} &\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}\\ =&\mathbb{E}[A^{(a)}Y \pi(Z,L)\mid L]-\mathbb{E}[A^{(a)}Y\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}[\mathbb{E}[A^{(a)}Y\mid Z,L] \pi(Z,L)\mid L]-\mathbb{E}[\mathbb{E}[A^{(a)}Y\mid Z,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}[\mathbb{E}[f_a^o(A^{(a)},Y)\mid Z,L] \pi(Z,L)\mid L]-\mathbb{E}[\mathbb{E}[f_a^o(A^{(a)},Y)\mid Z,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}[f_a^o(A^{(a)},L)\pi(Z,L)\mid L]-\mathbb{E}[f_a^o(A^{(a)},L)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&f_a^o(1,L)\mathbb{E}[A^{(a)}\pi(Z,L)\mid L]+f_a^o(0,L)\mathbb{E}[(1-A^{(a)})\pi(Z,L)\mid L]\\ &-\{f_a^o(1,L)\mathbb{E}[A^{(a)}\mid L]+f_a^o(0,Y)\mathbb{E}[(1-A^{(a)})\mid L]\}\mathbb{E}[\pi(Z,L)\mid L]\\ =&f_a^o(1,L)\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\} +f_a^o(0,L)\mathrm{Cov}\!\{1-A^{(a)},\pi(Z,L)\mid L\}\\ =&\{f_a^o(1,L)-f_a^o(0,L)\}\cdot\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\}. \end{align}\] This show that \[\begin{align} f_a^o(1,L)-f_a^o(0,L)=\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\}}. \end{align}\] In addition, by Equation 1 , we know that \[\begin{align} &\mathbb{E}[A^{(a)}Y\mid Z,L]=\mathbb{E}[f_a^o(A^{(a)},L)\mid Z,L]\\ =&\{f_a^o(1,L)-f_a^o(0,L)\}\mathbb{E}[A^{(a)}\mid Z,L]+f_a^o(0,L)\\ =&\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\}}\mathbb{E}[A^{(a)}\mid Z,L]+f_a^o(0,L). \end{align}\] Now we can deduce the results in Equation ?? , finishing the proof for Theorem 1.

12.2 Proof for Theorem 2 ↩︎

12.3 Proof for Theorem 3 ↩︎

Recall that \[\begin{align} \gamma^o(L):=\dfrac{\zeta^o(L) - \eta^o(L)\delta^o(L)}{\kappa^o(L)},\qquad \gamma(L):=\dfrac{\zeta(L) - \eta(L)\delta(L)}{\kappa(L)}. \end{align}\] For any pathwise differentiable parameterization $p_\theta(O)$ (probability density function), we denote $\require{physics} s(O):=\nabla_{\theta}\log p_\theta(O)\eval_{\theta=0}$ as the score function, $\mathbb{E}_\theta$ as taking expectation with respect to $p_\theta(O)$. For $\zeta^o(L;\theta)=\mathbb{E}_\theta[Y\mathbb{E}_\theta[A\mid Z,L]\mid L]$, we calculate \[\require{physics} \begin{align} &\nabla_{\theta}\zeta(L;\theta)\eval_{\theta=0}=\nabla_{\theta}\mathbb{E}_\theta[Y\mathbb{E}_\theta[A\mid Z,L]\mid L]\eval_{\theta=0}\\ =&\nabla_{\theta}\mathbb{E}_\theta[Y\mathbb{E}[A\mid Z,L]\mid L]\eval_{\theta=0} +\mathbb{E}\left[Y\nabla_{\theta}\mathbb{E}_\theta[A\mid Z,L]\eval_{\theta=0}\mid L\right]\\ =&\mathbb{E}[\{Y\pi^o(Z,L)-\zeta^o(L)\}s(O)\mid L] +\mathbb{E}[Y\mathbb{E}[\{A-\pi^o(Z,L)\}s(O)\mid Z,L]\mid L]\\ =&\mathbb{E}[\{Y\pi^o(Z,L)-\zeta^o(L)\}s(O)\mid L] +\mathbb{E}[\xi^o(Z,L)\{A-\pi^o(Z,L)\}s(O)\mid L]. \end{align}\] Similarly, for $\kappa(L;\theta)=\mathbb{E}_\theta[A\mathbb{E}_\theta[A\mid Z,L]\mid L]-\mathbb{E}_{\theta}[A\mid L]^2$, it holds that \[\require{physics} \begin{align} &\nabla_{\theta}\kappa(L;\theta)\eval_{\theta=0}= \nabla_{\theta}\mathbb{E}_\theta[A\mathbb{E}_\theta[A\mid Z,L]\mid L]\eval_{\theta=0} -\nabla_{\theta}\mathbb{E}_\theta[A\mid L]^2\eval_{\theta=0}\\ =&\mathbb{E}\left[\{A\pi^o(Z,L)-\kappa^o(L)-\delta^o(L)^2\}s(O)\middle| L\right]\\ &+\mathbb{E}\left[\pi^o(Z,L)\{A-\pi^o(Z,L)\}s(O)\middle| L\right]\\ &-\mathbb{E}[2\delta^o(L)(A-\delta^o(L))s(O)\mid L]. \end{align}\] For $\eta(L;\theta)=\mathbb{E}_\theta[Y\mid L]$, we calculate \[\require{physics} \begin{align} &\nabla_{\theta}\eta(L;\theta)\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta[Y\mid L]\eval_{\theta=0}\\ =&\mathbb{E}[\{Y-\mathbb{E}[Y\mid L]\}s(O)\mid L]\\ =&\mathbb{E}\left[\{Y-\eta^o(L)\}s(O)\middle| L\right]. \end{align}\] For $\delta(L;\theta)=\mathbb{E}_\theta[A\mid L]$, we calculate \[\require{physics} \begin{align} &\nabla_{\theta}\delta(L;\theta)\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta[A\mid L]\eval_{\theta=0} =\mathbb{E}[\{A-\delta^o(L)\}s(O)\mid L]. \end{align}\] Now we calculate the EIF as follows. For \[\psi_{ada,\theta}=\mathbb{E}_\theta\left[ \dfrac{\zeta(L;\theta) - \eta(L;\theta)\delta(L;\theta)}{\kappa(L;\theta)} \right],\] we deduce the path-wise derivative as \[\require{physics} \begin{align} &\nabla_{\theta}\psi_{ada,\theta}\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta\left[ \dfrac{\zeta(L;\theta) - \eta(L;\theta)\delta(L;\theta)}{\kappa(L;\theta)} \right]\eval_{\theta=0}\notag\\ =&\mathbb{E}[(\gamma^o(L)-\psi_{ada}^o)s(O)]\notag\\ &+\mathbb{E}\left[\dfrac{\nabla_{\theta}\zeta(L;\theta)\eval_{\theta=0} -\nabla_{\theta} (L;\theta) \delta(L;\theta)\eval_{\theta=0}}{ \kappa^o(L)}\right]\notag\\ &-\mathbb{E}\left[\gamma^o(L)\dfrac{ \nabla_{\theta}\kappa(L;\theta)\eval_{\theta=0} }{ \kappa^o(L)}\right].\notag\\ =&\mathbb{E}[(\gamma^o(L)-\psi_{ada}^o)s(O)]\\ &+\mathbb{E}\left[\dfrac{1}{\kappa^o(L)} \left\{\begin{array}{l} \{Y\pi^o(Z,L)-\zeta^o(L)\}\\ +\xi^o(Z,L)\{A-\pi^o(Z,L)\} \end{array}\right\}s(O) \right]\\ &-\mathbb{E}\left[\dfrac{1}{\kappa^o(L)}\left\{ \begin{array}{l} \delta^o(L)\{Y-\eta^o(L)\}\\ +^o(L)\{A-\delta^o(L)\} \end{array} \right\}s(O)\right]\\ &+\mathbb{E}\left[\dfrac{\gamma^o(L)}{\kappa^o(L)}2\delta^o(L)\left\{ A-\delta^o(L) \right\}s(O)\right]\\ &-\mathbb{E}\left[\dfrac{\gamma^o(L)}{\kappa^o(L)}\left\{ \begin{array}{l} \{A\pi^o(Z,L)-\kappa^o(L)\}\\ +\pi^o(Z,L)\{A-\pi^o(Z,L)\} \end{array} \right\}s(O)\right]. \end{align}\] The efficient influence function finally consists of \[\begin{align} &\varphi(O;\psi_{ada}^o,\beta^o)=\gamma^o(L)-\psi_{ada}^o\\ &+\dfrac{1}{\kappa^o(L)} \left\{\begin{array}{l} \{Y\pi^o(Z,L)-\zeta^o(L)\}\\ +\xi^o(Z,L)\{A-\pi^o(Z,L)\}\\ -\delta^o(L)\{Y-\eta^o(L)\}\\ -\eta^o(L)\{A-\delta^o(L)\} \end{array}\right\}\\ &-\dfrac{\gamma^o(L)}{\kappa^o(L)}\left\{ \begin{array}{l} A\pi^o(Z,L)-\kappa^o(L)-\delta^o(L)^2\\ +\pi^o(Z,L)\{A-\pi^o(Z,L)\}\\ -2\delta^o(L)\left\{A-\delta^o(L)\right\} \end{array} \right\}\\ =&\dfrac{\pi^o(Z,L)-\delta^o(L)}{\kappa^o(L)}Y-\psi_{ada}^o\\ &+\dfrac{1}{\kappa^o(L)} \left\{ \xi^o(Z,L)(A-\pi^o(Z,L)) -\eta^o(L)(A-\delta^o(L)) \right\}\\ &+\gamma^o(L)\left\{ 1 - \dfrac{(\pi^o(Z,L)-\delta^o(L))(2A-\pi^o(Z,L)-\delta^o(L))}{\kappa^o(L)} \right\}. \end{align}\]

12.4 Proof for Theorem 4 ↩︎

Define the conditional expectation on the $k$-th fold as $\mathbb{E}_k[O]=\mathbb{E}[O\mid I_{-k}]$. From the definition of $\psi_{\pi}^o$ in Equation 5 , \[\begin{align} 0=&\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})\right]\notag\\ =&\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})\right]+\sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}\notag\\ =&\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]+\sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}\notag\\ &+\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{k}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\tag{11}\\ &+\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\{\mathbb{E}_{nk}-\mathbb{E}_{k}\}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right].\tag{12} \end{align}\] From Proposition 5, we can deduce that the quantity in Equation 11 is $o_p(1)$, since \[\begin{align} &\mathbb{E}_{k}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\\ \lesssim& \left\{\begin{array}{l} \|\hat{\kappa}_{\pi}^{(n,k)}(L)-\kappa_{\pi}^o(L)\|_2\times \|\hat{\gamma}_{\pi}^{(n,k)}(L)-\gamma_{\pi}^o(L)\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}(L)-\rho_{\pi}^o(L)\|_2\times \|\hat{\delta}^{(n,k)}(L)-\delta^o(L)\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}(L)-\rho_{\pi}^o(L)\|_2\times \|\hat{\eta}^{(n,k)}(L)-\eta^o(L)\|_2 \end{array}\right\} =o_p(n^{-1/2}). \end{align}\] Define the empirical process as $\mathbb{G}_{nk}[f(O)]:=\sqrt{n_k}\{\mathbb{E}_{nk}-\mathbb{E}_{k}\}[f(O)].$ \[\begin{align} &\Pr\left(\mathbb{G}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\geq \epsilon_0\middle| I_{-k}\right)\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathrm{Var}\left[\mathbb{G}_{nk}\left\{\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right\}\middle| I_{-k}\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathrm{Var}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\middle| I_{-k}\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathbb{E}\left[\left\{\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right\}^2\middle| I_{-k}\right]\\ \lesssim&\|\hat{\rho}_{\pi}(L)-\rho_{\pi}^o(L)\|_2^2+ \|\hat{\delta}^{(n,k)}(L)-\delta^o(L)\|_2^2+ \|\hat{\eta}^{(n,k)}(L)-\eta^o(L)\|_2^2\\&+ \|\hat{\kappa}^{(n,k)}(L)-\kappa^o(L)\|_2^2+ \|\hat{\gamma}_{\pi}^{(n,k)}(L)-\gamma_{\pi}^o(L)\|_2^2\\\lesssim& \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^o\|_2^2. \end{align}\] We can now integrates $I_{-k}$ out to deduce that \[\begin{align} \Pr\left(\mathbb{G}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\geq \epsilon_0\right)\lesssim \mathbb{E}[\|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^o\|_2^2]=o(1). \end{align}\] This is equivalent to say that the quantity in Equation 12 is $o_p(1)$. Next, we can see that \[\begin{align} \sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}=&-\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]+o_p(1)\\ =&-\dfrac{1}{\sqrt{n}}\sum_{i = 1}^n\varphi_{\pi}(O_i;\psi_{\pi}^{o},\alpha_{\pi}^{o})+o_p(1). \end{align}\] From the central limit theory and the Slutskys lemma, we know that $\sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}$ converges in distribution to $\mathcal{N}(0,(\sigma_{\pi}^o)^2)$. Next, we prove the consistency of the variance estimator \[\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2:=\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2\right].\] First, we define $\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2:=\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right]=\mathbb{E}_n\left[ \varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right].$ By the law of large number, $\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2$ converges to $\left(\sigma_{\pi}^{o}\right)^2$ in probability. Simultaneously, by the boundness for $\varphi_{\pi}(O;\psi_{\pi},\alpha_{\pi})$, we know that \[\begin{align} &\Pr\left(\left|\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2-\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\right|\geq \epsilon_0\middle| I_{-k}\right) \leq \dfrac{1}{\epsilon_0^2}\mathbb{E}_k\left[\left|\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2-\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\right|^2\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2} \mathbb{E}_{k}\left[\left|\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right]\right|\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathbb{E}_{k}\left[\left|\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right|\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\left(\mathbb{E}_{k}\left[\left|\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right|^2\right]\right)^{1/2}\\ \lesssim&\dfrac{1}{\epsilon_0^2}\left\{\left|\hat{\psi}_{\pi}^{(n,k)}-\psi_{\pi}^{o}\right|^2+ \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^{o}\|_2^2\right\}^{1/2}. \end{align}\] Now we can integrates $I_{-k}$ out to deduce that \[\begin{align} &\Pr\left(\left|\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2-\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\right|\geq \epsilon_0\right)\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathbb{E}\left[\left\{\left|\hat{\psi}_{\pi}^{(n,k)}-\psi_{\pi}^{o}\right|^2+ \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^{o}\|_2^2\right\}^{1/2}\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\left\{\mathbb{E}\left[\left|\hat{\psi}_{\pi}^{(n,k)}-\psi_{\pi}^{o}\right|^2+ \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^{o}\|_2^2\right]\right\}^{1/2}=o(1). \end{align}\] Now $\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2=\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2+o_p(1)=\left(\sigma_{\pi}^{o}\right)^2+o_p(1)$, finishing the proof for Theorem 4.

12.5 Proof for Theorem 6 ↩︎

For any pathwise differentiable parameterization $p_\theta(O)$, we denote $\require{physics} s(O):=\nabla_{\theta}\log p_\theta(O)\eval_{\theta=0}$ as the score function, $\mathbb{E}_\theta$ as taking expectation with respect to $p_\theta(O)$. Recall that the true parameter of interest is \[\begin{align} \psi_{\overline{a}}^o:=\mathbb{E}\left[ \prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y \right]. \end{align}\] Concretely, \[\require{physics} \begin{align} &\nabla_{\theta}\psi_{\overline{a},\theta}\eval_{\theta=0} \\=&\nabla_{\theta} \mathbb{E}_{\theta}\left[ \prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}_{\theta}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}_{\theta}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}_{\theta}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}_{\theta}\left[Z_t\middle| H_t\right]} \times Y \right]\eval_{\theta=0}\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\times Y\\ \times\left\{\begin{array}{l} \nabla_{\theta}\dfrac{\left(Z_t-\mathbb{E}_{\theta}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]}\displaystyle\eval_{\theta=0}\\ -\dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2}\\ \times\nabla_{\theta}\left(\begin{array}{l} \mathbb{E}_{\theta}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}_{\theta}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}_{\theta}\left[Z_t\middle| H_t\right] \end{array}\right)\eval_{\theta=0} \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\times Y\\ \times\left\{\begin{array}{l} -\dfrac{\mathbb{E}[(Z_t-\rho_t^o(H_t))s(O)\mid H_t]A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]}\\ -\dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2}\\ \times\left(\begin{array}{l} \mathbb{E}\left[(A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))s(O)\middle| H_t\right]\\ -\mathbb{E}\left[(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t))s(O)\middle| H_t\right] \rho_t^o(H_t)\\ -\mathbb{E}\left[(Z_t-\rho_t^o(H_t))s(O)\middle| H_t\right] \delta_{t,a_t}^o(H_t) \end{array}\right) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\mathbb{E}\left[A_t^{(a_t)}\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[ A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}Y\middle| H_{t}\right]\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))s(O)} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \end{array}\right\} \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\mathbb{E}\left[\begin{array}{l} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2}\\ \times\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\times Y \end{array}\middle| H_{t}\right]\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))s(O)\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)s(O)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t)s(O) \end{array}\right) \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\mathbb{E}\left[A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\middle| H_{t}\right]\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))s(O)} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \end{array}\right\} \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\begin{array}{l} \dfrac{\mathbb{E}\left[\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}\times\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\middle| H_{t}\right]} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2} \end{array}\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))s(O)\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)s(O)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t)s(O) \end{array}\right) \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)s(O)} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \end{array}\right\} \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\begin{array}{l} \dfrac{\zeta_{t,\underline{a}_t}^o(H_t)-\rho_t^o(H_t)\eta_{t,\underline{a}_t}^o(H_t)} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2} \end{array}\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t) \end{array}\right)s(O) \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)A_t^{(a_t)}} {\kappa_{s,a_s}^o(H_s)} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}^o(H_t)} \end{array}\right\}s(O) \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}\\ \times\begin{array}{l} \dfrac{\zeta_{t,\underline{a}_t}^o(H_t)-\rho_t^o(H_t)\eta_{t,\underline{a}_t}^o(H_t)} {\left\{\kappa_{t,a_t}^o(H_t)\right\}^2} \end{array}\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t) \end{array}\right)s(O) \end{array} \right]\\ =&\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o)s(O)\right], \end{align}\] where \[\begin{align} &\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o):= \prod_{t=0}^{T} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)A_t^{(a_t)}} {\kappa_{s,a_s}^o(H_s)} \times Y-\psi_{\overline{a}}^o\\ &-\sum_{t=0}^T\displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}\times \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)}{\kappa_{t,a_t}^o(H_t)}\\ &+\sum_{t=0}^T\displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)} \times\gamma_{t,\underline{a}_t}^o(H_t) \\&\times \left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)\}\{Z_t-\rho_t^o(H_t)\}}{\kappa_{t,a_t}^o(H_t)} \right). \end{align}\] This finishes the proof for Theorem 6.

12.6 Proof for Theorem 10 ↩︎

Analogous to the proof for Theorem 6, we calculate the path-wise derivative for \[\begin{align} \psi_{\overline{a},\theta} := \mathbb{E}_\theta\left[ \prod_{t=0}^{T} \frac{\pi_{t,a_t,\theta}^o(Z_t,H_t) - \delta_{t,a_t,\theta}^o(H_t)}{\mathrm{Var}_\theta\!\{\pi_{t,a_t,\theta}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \right]. \end{align}\] Concretely, \[\require{physics} \begin{align} &\nabla_{\theta}\psi_{\overline{a},\theta}\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta\left[ \prod_{t=0}^{T} \frac{\pi_{t,a_t,\theta}^o(Z_t,H_t) - \delta_{t,a_t,\theta}^o(H_t)}{\mathrm{Var}_\theta\!\{\pi_{t,a_t,\theta}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \right]\eval_{\theta=0}\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}_\theta\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}_\theta\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times\nabla_{\theta} \dfrac{\pi_{t,a_t,\theta}^o(Z_t,H_t) - \delta_{t,a_t,\theta}^o(H_t)} {\mathrm{Var}_\theta\!\{\pi_{t,a_t,\theta}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)}\displaystyle\eval_{\theta=0} \times Y \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\mathbb{E}[\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)\mid Z_t,H_t] - \mathbb{E}[\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)\mid H_t]} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}^2}A_t^{(a_t)} \times Y\\ \times\mathbb{E}\left[(\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t))s(O)\middle| H_t\right] \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\mathbb{E}[\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)\mid Z_t,H_t]} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\mathbb{E}[\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)\mid H_t]} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}} \times A_t^{(a_t)}Y \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}^2} \times A_t^{(a_t)}Y\\ \times\mathbb{E}\left[(\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t))s(O)\middle| H_t\right] \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times \dfrac{\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}\\ \mathbb{E}\left[\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\times A_t^{(a_t)} Y\middle| Z_t,H_t\right] \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\dfrac{\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}\\ \mathbb{E}\left[\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\times A_t^{(a_t)} Y\middle| H_t\right] \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times\mathbb{E}\left[\displaystyle\prod_{s=t}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times Y\middle| H_t\right]\\ \times\dfrac{\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}s(O) \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times \dfrac{\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}} \xi_{t,\underline{a}_t}^o(Z_t,H_t) \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\dfrac{\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}} \eta_{t,\underline{a}_t}^o(H_t) \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\gamma_{t,\underline{a}_t}^o(H_t)\\ \times\dfrac{\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}s(O) \end{array} \right] \end{align}\] In summary, we see that the influence function of $\psi_{\overline{a}_t}$ consists of \[\begin{align} &\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\\ &+\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\right)\times \dfrac{1}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}\\ &\times \left\{\begin{array}{l} \{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\} \xi_{t,\underline{a}_t}^o(Z_t,H_t) -\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)\\ +\gamma_{t,\underline{a}_t}^o(H_t) \times\left(\kappa_t^o(H_t)+\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}^2-\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}^2\right) \end{array}\right\}\\ =&\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\kappa_{t,a_t}^o(H_t)}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\\ &+\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\kappa_{s,a_s}^o(H_s)}A_s^{(a_s)}\right)\times \dfrac{1}{\kappa_{t,a_t}^o(H_t)}\\ &\times \left\{\begin{array}{l} \{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\} \xi_{t,\underline{a}_t}^o(Z_t,H_t) -\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)\\ +\gamma_{t,\underline{a}_t}^o(H_t) \times\left(\kappa_t^o(H_t)+\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}^2-\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}^2\right) \end{array}\right\}. \end{align}\] This finishes the proof that $\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada}^o,\beta_{\overline{a}}^o)$ is the influence function for $\psi_{\overline{a},ada}^o$. Since the tangent space spanned by the score functions equals $L_2(O)$, we know that the EIF is just $\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada}^o,\beta_{\overline{a}}^o)$, finishing the proof for Theorem 10.

12.7 Proof for Theorem 12 ↩︎

Analogous to the proof for Theorem 6, we calculate the path-wise derivative for \[\begin{align} \psi_{a,MIV}^o:= \mathbb{E}\left[ (1-A^{(a)})\dfrac{\mathrm{Cov}\{A^{(a)}Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid L\}} +A^{(a)}Y \right]. \end{align}\] The EIF can be derived as \[\require{physics} \begin{align} &\nabla_{\theta} \mathbb{E}_\theta\left[ (1-A^{(a)})\dfrac{\mathrm{Cov}_\theta\{A^{(a)}Y,Z\mid L\}}{\mathrm{Cov}_\theta\{A^{(a)},Z\mid L\}} +A^{(a)}Y \right]\eval_{\theta=0}\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +(1-A^{(a)})\nabla_{\theta}\left(\dfrac{\mathbb{E}_\theta[A^{(a)}Y(Z-\mathbb{E}_\theta[Z\mid L])\mid L]}{\mathrm{Cov}\{A^{(a)},Z\mid L\}}\right)\displaystyle\eval_{\theta=0}\\ +(1-A^{(a)})\nabla_{\theta}\left(\dfrac{\mathrm{Cov}\{A^{(a)}Y,Z\mid L\}} {\mathrm{Cov}_\theta\{A^{(a)},Z\mid L\}}\right)\displaystyle\eval_{\theta=0} \end{array} \right]\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +\dfrac{(1-A^{(a)})} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} \mathbb{E}[\left\{A^{(a)}Y (Z - \mathbb{E}[Z\mid L]) - \gamma_{a}^o(L)\kappa_{a}^o(L)\right\} s(O)\mid L]\\ -\mathbb{E}[A^{(a)}Y \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ]\mid L] \end{array} \right\}\\ -\dfrac{(1-A^{(a)})\gamma_a^o(L)}{\kappa_{a}^o(L)} \left\{\begin{array}{l} \mathbb{E}[\left\{A^{(a)} (Z - \mathbb{E}[Z\mid L]) - \kappa_{a}^o(L)\right\} s(O)\mid L]\\ -\mathbb{E}[A^{(a)} \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ]\mid L] \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} \left\{A^{(a)}Y (Z - \mathbb{E}[Z\mid L]) - \gamma_{a}^o(L)\kappa_{a}^o(L)\right\} s(O)\\ -A^{(a)}Y \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ] \end{array} \right\}\\ -\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} \left\{A^{(a)} (Z - \mathbb{E}[Z\mid L]) - \kappa_{a}^o(L)\right\} s(O)\\ -A^{(a)} \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ] \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} \left\{A^{(a)}Y (Z - \mathbb{E}[Z\mid L]) - \gamma_{a}^o(L)\kappa_{a}^o(L)\right\} s(O)\\ -\eta_a^o(L) (Z-\rho^o(L))s(O) \end{array} \right\}\\ -\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} \left\{A^{(a)} (Z - \mathbb{E}[Z\mid L]) - \kappa_{a}^o(L)\right\} s(O)\\ -\delta_a^o(L) (Z-\rho^o(L))s(O) \end{array}\right\} \end{array} \right]. \end{align}\] The EIF corresponds to be \[\begin{align} &(1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} A^{(a)}Y (Z - \rho^o(L)) - \gamma_{a}^o(L)\kappa_{a}^o(L) \\ -\eta_a^o(L) (Z-\rho^o(L)) \end{array} \right\}\\ &-\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} A^{(a)} (Z - \rho^o(L)) - \kappa_{a}^o(L) \\ -\delta_a^o(L) (Z-\rho^o(L)) \end{array}\right\}\\ =&(1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} A^{(a)}Y (Z - \rho^o(L)) \\ -\eta_a^o(L) (Z-\rho^o(L)) \end{array} \right\}\\ &-\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} A^{(a)} (Z - \rho^o(L)) \\ -\delta_a^o(L) (Z-\rho^o(L)) \end{array}\right\}\\ =&(1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left(A^{(a)}Y-\eta_a^o(L)\right)(Z-\rho^o(L))\\ &-\dfrac{(1-\delta_a^o(L))}{\kappa_a^o(L)}\gamma_a^o(L) \left(A^{(a)}-\delta_a^o(L)\right)(Z - \rho^o(L)). \end{align}\] This finishes the proof for Theorem 12.

13 Proof for Propositions↩︎

13.1 Proof for Proposition 1 ↩︎

If $Z$ is an AIV for $A = a$, then there exists $b(U,L)$ and $c(Z,L)$, such that \[\begin{align} \mathbb{E}[A^{(a)}\pi(Z,L)\mid U,L] =& \mathbb{E}[\mathbb{E}[A^{(a)}\mid Z,U,X]\pi(Z,L)\mid U,L]\\ =&\mathbb{E}[\{b(U,L)+c(Z,L)\}\pi(Z,L)\mid U,L]\\ =&b(U,L)\mathbb{E}[\pi(Z,L)\mid U,L] + \mathbb{E}[c(Z,L)\pi(Z,L)\mid U,L]\\ =&b(U,L)\mathbb{E}[\pi(Z,L)\mid L] + \mathbb{E}[c(Z,L)\pi(Z,L)\mid L];\\ \mathbb{E}[A^{(a)}\mid U,L]=&b(U,L) + \mathbb{E}[c(Z,L)\mid L];\\ \mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid U,L\}=& \mathbb{E}[c(Z,L)\pi(Z,L)\mid L]- \mathbb{E}[c(Z,L)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathrm{Cov}\{c(Z,L),\pi(Z,L)\mid L\}. \end{align}\] Analogously, one can verify that \[\begin{align} \label{exp:100} \mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid L\} = \mathrm{Cov}\{c(Z,L),\pi(Z,L)\mid L\} =\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid U,L\}. \end{align}\tag{13}\] Reversely, if Equation 13 holds true for any $\pi(Z,L)$, then we know that \[\begin{align} &\mathbb{E}[A^{(a)}\pi(Z,L)\mid U,L]-\mathbb{E}[A^{(a)}\pi(Z,L)\mid L]\\ =&\{\mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]\}\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\]

For any subset $\mathcal{S}\subseteq \mathcal{Z}$, we choose $\delta>0$ and $\pi(Z,L)=I(Z\in \mathcal{S})$ \[\begin{align} &\mathbb{E}[A^{(a)}I(Z\in \mathcal{S})\mid U,L]-\mathbb{E}[A^{(a)}I(Z\in \mathcal{S})\mid L]\\ =&\{\mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]\}\mathbb{E}[I(Z\in \mathcal{S})\mid L].\\ \Rightarrow\quad & \mathbb{E}[A^{(a)}\mid Z\in \mathcal{S}, U,L]-\mathbb{E}[A^{(a)}\mid Z\in \mathcal{S},L]= \mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]. \end{align}\] This implies that \[\begin{align} \mathbb{E}[A^{(a)}\mid Z, U,L]-\mathbb{E}[A^{(a)}\mid Z,L]= \mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]. \end{align}\] We can take $b(U,L) = \mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]$ and $c(Z,L) = \mathbb{E}[A^{(a)}\mid Z,L]$, and verify that \[\begin{align} \mathbb{E}[A^{(a)}\mid Z, U,L] = b(U,L)+c(Z,L), \end{align}\] finishing the proof for Proposition 1.

13.2 Proof for Proposition 2 ↩︎

Without loss of generality, we only consider the case that $A=1$. If Assumption 5 holds, we can take $\pi^{o}(Z,L) = \Pr(A=1 \mid Z,L)$ to verify that $\pi^{o}(Z,L)$ is an RWF. Indeed, $\pi^{o}(Z,L)$ is uniformly bounded by one, and \[\mathrm{Cov}\!\{A,\pi^{o}(Z,L) \mid L\} = \mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\}\] is uniformly bounded below by some positive constant $\epsilon_{0} > 0$.

Conversely, suppose there exists an RWF $\pi(Z,L)$. By the Cauchy–Schwarz inequality, we have \[\mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\} \ge \frac{\left|\mathrm{Cov}\!\{\Pr(A=1 \mid Z,L), \pi(Z,L) \mid L\}\right|^2}{\mathrm{Var}\!\{\pi(Z,L) \mid L\}}.\] Noting that $\mathrm{Cov}\!\{\Pr(A=1 \mid Z,L), \pi(Z,L) \mid L\} = \mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}$, the RWF property yields \[\mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\} \ge \frac{\epsilon_{0}^2}{\sup_{L \in \mathcal{L}} \mathrm{Var}\!\{\pi(Z,L) \mid L\}}.\]

The uniform boundedness of $\pi(Z,L)$ implies that $\mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\}$ is uniformly bounded away from zero. Thus, $\pi^o(Z,L):=\Pr(A=1 \mid Z,L)$ is an RWF for $A$. This completes the proof of the statement in Assumption 5.

13.3 Proof for Proposition 3 ↩︎

Define $g_a(U,L):=\mathbb{E}[Y(a)\mid U,L]$. We can observe that the left side of Equation 1 equals to \[\begin{align} &\mathbb{E}[A^{(a)}Y\mid Z,L]=\mathbb{E}[\mathbb{E}[A^{(a)}Y\mid U,Z,L]\mid Z,L]\notag\\ =&\mathbb{E}\left[\mathbb{E}[Y\mid U,L,Z,A=a]\Pr(A=a\mid Z,U,L)\middle| Z,L\right]\notag\\ =&\mathbb{E}\left[\mathbb{E}[Y(a)\mid U,L,Z,A=a]\Pr(A=a\mid Z,U,L)\middle| Z,L\right]\tag{14}\\ =&\mathbb{E}\left[g_a(U,L)\Pr(A=a\mid Z,U,L)\middle| Z,L\right].\tag{15} \end{align}\] Equation 14 follows from Assumption 1 that $Y=Y(a)$. Equation 15 follows from Assumption 2 that $Y(a)\perp\!\!\!\perp\{Z,A\}\mid L,U$. Similarly, the right side of Equation 1 equals to \[\begin{align} &\mathbb{E}[f_a^o(A^{(a)},L)\mid Z,L]=\mathbb{E}[\mathbb{E}[f_a^o(A^{(a)},L)\mid U,Z,L]\mid Z,L]\\ =&\mathbb{E}\left[f_a^o(1,L)\Pr(A=a\mid Z,U,L)\middle| Z,L\right]+\mathbb{E}\left[f_a^o(0,L)\{1-\Pr(A=a\mid Z,U,L)\}\middle| Z,L\right]\\ =&\mathbb{E}\left[\{f_a^o(1,L)-f_a^o(0,L)\}\Pr(A=a\mid Z,U,L)\middle| Z,L\right]+f_a^o(0,L). \end{align}\] Next, from Assumption 3 that $Z\perp\!\!\!\perp U\mid L$, we see that for any $z\in \mathcal{Z}$, \[\label{exp:6} \begin{align} &\mathbb{E}\left[g_a(U,L)\Pr(A=a\mid Z=z,U,L)\middle| L\right]\\ =&\mathbb{E}\left[\{f_a^o(1,L)-f_a^o(0,L)\}\Pr(A=a\mid Z=z,U,L)\middle| L\right] +f_a^o(0,L). \end{align}\tag{16}\] Then from Equation 16 and Assumption 1 that there exists $b(U,L)$ and $c(Z,L)$, such that \[\Pr(A=a\mid Z,U,L)=b(U,L)+c(Z,L).\] Then for any $z$, it holds that \[\begin{align} f_a^o(0,L) =&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\{b(U,L)+c(z,L)\}\middle| L\right]\notag \\=&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L\right]\label{exp:1}\\ &+\mathbb{E}\left[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\middle| L\right]c(z,L).\notag \end{align}\tag{17}\] From Assumption 4, for any $l$, there exists $z_1,z_2$, such that $c(z_1,l)\neq c(z_2,l)$, and we can take difference to get \[\begin{align} &\mathbb{E}[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\mid L=l]\{c(z_1,l)-c(z_2,l)\}= 0.\\ \Rightarrow& \mathbb{E}[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\mid L=l]= 0. \end{align}\] Then we see that $\mathbb{E}[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\mid L]\equiv 0.$ Next, note that for $a\in\mathcal{A}$, \[\mathbb{E}[g_a(U,L)\mid L]=\mathbb{E}[\mathbb{E}[Y(a)\mid L,U]\mid L]=\mathbb{E}[Y(a)\mid L].\] Now we know that $f_a^o(1,L)-f_a^o(0,L)=\mathbb{E}[Y(a)\mid L].$ We can now substitute this into Equation 17 to deduce that \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[\{\mathbb{E}[Y(a)\mid U,L]-\mathbb{E}[Y(a)\mid L]\}b(U,L)\middle| L\right]\\ =&\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\},\\ f_a^o(1,L)=&\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\} +\mathbb{E}[Y(a)\mid L]. \end{align}\] This step follows from Assumption 3 that $\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L],c(Z,L)\mid L\}=0$. This is equivalent to say that there only exist one solution $f_a^o(A^{(a)},L)$ to Equation 1 , finishing the proof for Proposition 3.

13.4 Proof for Proposition 4 ↩︎

From the proof for Proposition 3, we can deduce that \[\begin{align} &\mathbb{E}[Y\mid Z=z,L]=\mathbb{E}[\mathbb{E}[AY\mid Z,U,L]+\mathbb{E}[(1-A)Y\mid Z,U,L]\mid Z=z,L]\\ =&\mathbb{E}\left[g_1(U,L)\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[g_0(U,L)\Pr(A=0\mid Z=z,U,L)\middle| L\right]\\ =&\mathbb{E}\left[\{g_1(U,L)-g_0(U,L)\}\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[g_0(U,L)\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]. \end{align}\] Then from Equation ?? , we know that \[\label{exp:3} \begin{align} &\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}\left[\{f^o(1,L)-f^o(0,L)\}\Pr(A=1\mid Z=z,U,L)\middle| L\right] +f^o(0,L). \end{align}\tag{18}\] First, we consider the case when $Z$ is an AIV for $A$. In this case, there exists $b(U,L)$ and $c(Z,L)$ such that $\Pr(A=1\mid Z,U,L)=b(Z,L)+c(U,L)$, then \[\begin{align} f^o(0,L)=&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &-\mathbb{E}\left[\{f^o(1,L)-f^o(0,L)\}\Pr(A=1\mid Z=z,U,L)\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\{b(U,L)+c(z,L)\}\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &-\mathbb{E}\left[\{f^o(1,L)-f^o(0,L)\}\{b(U,L)+c(z,L)\}\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid U,L]b(U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &+c(z,L)\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid U,L]\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid U,L]b(U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &+c(z,L)\mathbb{E}\left[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\middle| L\right]. \end{align}\] Under Assumption 4, for any $l\in\mathcal{L}$, there exist $z_1,z_2\in\mathcal{Z}$, such that \[c(z_1,l)-c(z_2,l)=\Pr(A=1\mid Z=z_1,U,L=l)- \Pr(A=1\mid Z=z_2,U,L=l)\neq 0,\] then we know that \[\begin{align} &\{c(z_1,l)-c(z_2,l)\}\mathbb{E}\left[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\middle| L=l\right]\equiv 0.\\ \Rightarrow &f^o(1,L)-f^o(0,L)= \mathbb{E}\left[Y(1)-Y(0)\middle|L\right]. \end{align}\] Now we can substitute $f^o(1,L)-f^o(0,L)$ with $\mathbb{E}\left[Y(1)-Y(0)\middle|L\right]$ to deduce that \[\begin{align} f^o(0,L)=&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-\mathbb{E}\left[Y(1)-Y(0)\middle|L\right]\mid U,L]b(U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],b(U,L)\mid L\right\}+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(0)\middle| L\right],\\ f^o(0,L)=&f^o(0,L)+\mathbb{E}\left[Y(1)-Y(0)\middle|L\right]\\ =&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(1)\middle| L\right]. \end{align}\] Second, we consider the case when $Y(1)-Y(0)\perp\!\!\!\perp U\mid L$. In this case, we only need to show that \[\begin{align} f^o(0,L)=&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}\left[Y(0)\middle| L\right],\\ f^o(1,L)=&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(1)\middle| L\right]\\ =&\mathbb{E}\left[Y(1)\middle| L\right]. \end{align}\] From Equation 18 , we see that \[\begin{align} &\mathbb{E}[Y(1)-Y(0)\mid L]\mathbb{E}\left[\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\{f^o(1,L)-f^o(0,L)\}\mathbb{E}\left[\Pr(A=1\mid Z=z,U,L)\middle| L\right] +f^o(0,L). \end{align}\] Now we deduce that \[\begin{align} f^o(0,L)=&\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid L]\\ &\times\mathbb{E}\left[\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid L]\\ &\times\Pr(A=1\mid Z=z,L)+\mathbb{E}\left[Y(0)\middle| L\right]. \end{align}\] Under Assumption 4, for any $l\in\mathcal{L}$, there exist $z_1,z_2\in\mathcal{Z}$, such that $\Pr(A=1\mid Z=z_1,L=l)- \Pr(A=1\mid Z=z_2,L=l)\neq 0$, then we know that \[\begin{align} &\{\Pr(A=1\mid Z=z_1,L=l)-\Pr(A=1\mid Z=z_2,L=l)\}\\ &\times\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid L=l]=0,\text{ for any l\in\mathcal{L}.}\\ \Rightarrow&f^o(1,L)-f^o(0,L) = \mathbb{E}[Y(1)-Y(0)\mid L]. \end{align}\] We now establish that $f^o(0, L) = \mathbb{E}[Y(0) \mid L]$ and $f^o(1, L) = \mathbb{E}[Y(1) \mid L]$. The proof of Equation ?? follows analogously to the arguments in Proposition 3, and is therefore omitted for brevity. This concludes the proof of Proposition 4.

Finally, when the AIV condition does not holds, we prove the fact that \[\mathbb{E}\left[\dfrac{\mathrm{Cov}\{Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\{A,\pi(Z,L)\mid L\}}\right] =\mathbb{E}\left[\mathrm{Cov}\{A,\pi(Z,L)\mid U,L\}\mathbb{E}[Y(1)-Y(0)\mid U,L]\right].\] which is a weighted average of conditional ATE $\mathbb{E}[Y(1)-Y(0)\mid U,L]$. In fact, \[\begin{align} &\mathrm{Cov}\{Y,\pi(Z,L)\mid L\} = \mathbb{E}\left[\mathbb{E}[Y|Z,L]\pi(Z,L)\middle| L\right] - \mathbb{E}[Y\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\mathbb{E}[A\{Y(1)-Y(0)\}+Y(0)|Z,L]\pi(Z,L)\middle| L\right] \\&- \mathbb{E}[A\{Y(1)-Y(0)\}+Y(0)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\left\{\mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]+\mathbb{E}[Y(0)\mid U,L]\mid Z,L]\right\}\pi(Z,L)\middle| L\right] \\&- \mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]+\mathbb{E}[Y(0)\mid U,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[A\mathbb{E}[Y(1)-Y(0)\mid U,L]\pi(Z,L)\middle| L\right] +\mathbb{E}\left[\mathbb{E}[Y(0)\mid L]\pi(Z,L)\middle| L\right] \\ &- \mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L] - \mathbb{E}[Y(0)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\mathbb{E}[A\pi(Z,L)\mid U,L]\mathbb{E}[Y(1)-Y(0)\mid U,L]\middle| L\right] \\ &- \mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\mathbb{E}[A\pi(Z,L)\mid U,L]\mathbb{E}[Y(1)-Y(0)\mid U,L]\middle| L\right] \\ &- \mathbb{E}[\mathbb{E}[\pi(Z,L)\mid L]\mathbb{E}[A\mid U,L]\mathbb{E}[Y(1)-Y(0)\mid U,L]\mid L]\\ =&\mathbb{E}\left[\mathrm{Cov}\{A,\pi(Z,L)\mid U,L\}\mathbb{E}[Y(1)-Y(0)\mid U,L]\middle| L\right]. \end{align}\] Then we can show that \[\begin{align} \dfrac{\mathrm{Cov}\{Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\{A,\pi(Z,L)\mid L\}} =\mathbb{E}\left[\dfrac{\mathrm{Cov}\{A,\pi(Z,L)\mid U,L\}\mathbb{E}[Y(1)-Y(0)\mid U,L]} {\mathrm{Cov}\{A,\pi(Z,L)\mid L\}}\middle| L\right]. \end{align}\]

13.5 Proof for Proposition 5 ↩︎

The proof of this theorem follows the same approach as Proposition 9, which establishes a parallel result in the longitudinal setting.

13.6 Proof for Proposition 6 ↩︎

Recall that \[\begin{align} &\pi^o(Z,L)=\Pr(A=1\mid Z,L), \\ &\gamma^o(L):=\dfrac{\mathrm{Cov}\!\{Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}=\dfrac{\zeta_{\pi}^o(L)-\eta^o(L)\rho_{\pi}^o(L)} {\kappa_{\pi}^o(L)}. \end{align}\] We can see from Proposition 4 that $\gamma^o(L)$ is a function that do not related to the choose of $\pi(Z,L)$ and it solve the nonparametric IV problem in Equation ?? as \[\begin{align} \mathbb{E}[Y\mid Z,L]=\mathbb{E}[f^o(A,L)\mid Z,L]=\mathbb{E}[f^o(0,L)+A\gamma^o(L)\mid Z,L]. \end{align}\] From this equality, we can see that $\mathbb{E}[Y-A\gamma^o(L)\mid Z,L]=f^o(0,L)$. \[\begin{align} \mathrm{Var}\!\{\mathbb{E}[\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\mid L]\}= \mathrm{Var}\left\{\dfrac{\zeta_{\pi}^o(L)-\eta^o(L)\rho_{\pi}^o(L)} {\kappa_{\pi}^o(L)}\right\} =\mathrm{Var}\left\{\gamma^o(L)\right\}, \end{align}\] which is a constant that not related to $\pi(Z,L)$. \[\begin{align} &\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o) \mid L\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} Y\{\pi(Z,L)-\rho_{\pi}^o(L)\}-\eta^o(L)\pi(Z,L)\\ -\gamma^o(L)\cdot \left\{A(\pi(Z,L)-\rho_{\pi}^o(L))-\delta^o(L)\pi(Z,L)\right\} \end{array}\middle| L \right\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} \{Y-A\gamma^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\\ +\{\gamma^o(L)\delta^o(L)-\eta^o(L)\}\pi(Z,L) \end{array}\middle| L \right\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} \{Y-A\gamma^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\\ -\mathbb{E}[Y-A\gamma^o(L)\mid L]\pi(Z,L) \end{array}\middle| L \right\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} \left(\{Y-A\gamma^o(L)\}-\mathbb{E}[Y-A\gamma^o(L)\mid L]\right)\\ \cdot\{\pi(Z,L)-\rho_{\pi}^o(L)\} \end{array}\middle| L \right\}. \end{align}\] Notably, $\mathbb{E}[Y|Z,L]=\mathbb{E}[\gamma^o(L)|Z,L]$ We denote \[\begin{align} W=&Y-A\gamma^o(L)-\mathbb{E}[\{Y-A\gamma^o(L)\}\mid L]\\ =&Y-A\gamma^o(L)-f^o(0,L)=Y-f^o(A,L). \end{align}\] From the assumption that $\mathbb{E}[W^2\mid Z,L]\perp\!\!\!\perp Z\mid L$, we adopt the Cauchy Schwartz inequality to deduce that \[\begin{align} &\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o) \mid L\} =\dfrac{\mathbb{E}[W^2\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]}{\{\kappa_{\pi}^o(L)\}^2}\\ =&\dfrac{\mathbb{E}[W^2\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]} {\{\mathbb{E}[\{A-\delta^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\mid L]\}^2}\\ =&\dfrac{\mathbb{E}[\mathbb{E}[W^2\mid Z,L]\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]} {\{\mathbb{E}[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\mid L]\}^2}\\ =&\dfrac{\mathbb{E}[W^2\mid L]\mathbb{E}[\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]} {\{\mathbb{E}[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\mid L]\}^2}\\ \geq&\mathbb{E}[W^2\mid L]\left\{\mathbb{E}\left[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}^2\middle| L\right]\right\}^{-1}. \end{align}\] The inequality holds only when there exists $f(L)$, such that \[\begin{align} \pi(Z,L)-\rho_{\pi}^o(L)=f(L)\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}. \end{align}\] Apparently, when $\pi(Z,L)=\pi^o(Z,L)$, this lower bound is achieved. Now we verify the fact that \[\begin{align} &\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\}\\ =&\mathrm{Var}\!\{\mathbb{E}[\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\mid L]\} +\mathbb{E}\left[\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\mid L\}\right]\\ \geq&\mathrm{Var}\left\{\gamma^o(L)\right\}+\mathbb{E}\left[ \mathbb{E}[W^2\mid L]\left\{\mathbb{E}\left[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}^2\middle| L\right]\right\}^{-1}\right]\\ =&\mathrm{Var}\!\{\varphi_{\pi^o}(O;\psi_{ada}^o,\alpha_{\pi^o}^o)\}, \end{align}\] finishing the proof for Proposition 6.

13.7 Proof for Proposition 7 ↩︎

We finish the proof by calculating \[\begin{align} &\mathbb{E}\left[\varphi(O;\psi_{ada}^o,\beta)\right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\pi(Z,L)\xi^o(Z,L)-\delta(L)^o(L)}{\kappa(L)}-\psi_{ada}^o\\ +\gamma(L)\left\{ 1 - \dfrac{(A-\delta(L))^2-(A-\pi(Z,L))^2}{\kappa(L)} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\xi(Z,L)(\pi^o(Z,L)-\pi(Z,L))-\eta(L)(\delta^o(L)-\delta(L))\right\} \end{array} \right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\pi^o(Z,L)\xi^o(Z,L)-\delta^o(L)^o(L)}{\kappa(L)}-\psi_{ada}^o\\ +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \begin{array}{l} \kappa(L) - (A-\delta^o(L)+\delta^o(L)-\delta(L))^2\\ +(A-\pi^o(Z,L)+\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\begin{array}{l} (\xi(Z,L)-\xi^o(Z,L))(\pi^o(Z,L)-\pi(Z,L))\\ -(\eta(L)-\eta^o(L))(\delta^o(L)-\delta(L)) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\kappa^o(L)}{\kappa(L)}\gamma^o(L)-\psi_{ada}^o\\ +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \begin{array}{l} \kappa(L) - \kappa^o(L) - (\delta^o(L)-\delta(L))^2\\ +(\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\begin{array}{l} (\xi(Z,L)-\xi^o(Z,L))(\pi^o(Z,L)-\pi(Z,L))\\ -(\eta(L)-\eta^o(L))(\delta^o(L)-\delta(L)) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\kappa^o(L)}{\kappa(L)}\gamma^o(L) +\dfrac{\gamma(L)}{\kappa(L)}\{\kappa(L) - \kappa^o(L)\} -\psi_{ada}^o\\ +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \begin{array}{l} - (\delta^o(L)-\delta(L))^2\\ +(\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\begin{array}{l} (\xi(Z,L)-\xi^o(Z,L))(\pi^o(Z,L)-\pi(Z,L))\\ -(\eta(L)-\eta^o(L))(\delta^o(L)-\delta(L)) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[\dfrac{1}{\kappa(L)}\left\{\begin{array}{l} \{\gamma(L)-\gamma^o(L)\}\{\kappa(L) - \kappa^o(L)\}\\ +\gamma(L)\left\{ \begin{array}{l} - (\delta^o(L)-\delta(L))^2\\ +(\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ -(\xi(Z,L)-\xi^o(Z,L))(\pi(Z,L)-\pi^o(Z,L))\\ +(\eta(L)-\eta^o(L))(\delta(L)-\delta^o(L)) \end{array}\right\} \right]. \end{align}\]

13.8 Proof for Proposition 8 ↩︎

We provide a proof for the following result, which is stronger than the results in Equation ?? . For any $s$ with $0 \leq s \leq T+1$ and $r$ with $0 \leq r \leq T+1-s$, we have \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s}) \mid H_s]\notag\\ =&\mathbb{E}\left[\prod_{t=s}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r})\middle| H_s \right]\label{exp:13}. \end{align}\tag{19}\] which is clearly a stronger version of Equation ?? . We establish the result by induction. When $s = T+1$, Equation 19 reduces to $\mathbb{E}[Y(\overline{A}) \mid H_{T+1}] = \mathbb{E}[Y \mid H_{T+1}]$, which follows directly from the consistency assumption (Assumption 6). Now, assume that Equation 19 holds for all $s_c = T+1, T, \ldots, s+1$ and $r_c$ satisfying $s_c + r_c \leq T+1$. We verify that the equation also holds for $s_c = s$ and all $r_c = 0, \ldots, T+1 - s$. This completes the proof by induction. In particular, we begin by verifying the case $r_c = T+1 - s$. By the definition of $\gamma_{s,\underline{a}_s}^o$, this is equivalent to verifying that \[\label{exp:11} \begin{align} &\mathbb{E}[Y(\underline{a}_{s})|H_s] =\gamma_{s,\underline{a}_s}^o(H_{s}) :=\frac{\mathrm{Cov}\bigl\{ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}), \pi_s(Z_s, H_s) \mid H_s \bigr\}}{\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}. \end{align}\tag{20}\] By induction, $\mathbb{E}[Y(\underline{a}_{s+1})|H_{s+1}]=\gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})$. Then \[\begin{align} &\mathbb{E}\left[ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ I\{A_s = a_s\} \mathbb{E}[Y(\underline{a}_{s+1})\mid H_{s+1}]\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ I\{A_s = a_s\} Y(\underline{a}_{s+1})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ I\{A_s = a_s\} Y(\underline{a}_{s})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ \mathbb{E}[Y(\underline{a}_{s})\mid A_s=a_s,Z_s,H_s]\times \mathbb{E}[I\{A_s = a_s\}\mid Z_s,H_s] \pi_t(Z_s, H_s)\middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times\mathbb{E}\left[ \Pr(A_s = a_s\mid Z_s,H_s) \pi_t(Z_s, H_s)\middle| H_s \right]. \end{align}\] Since $A_s$ is an AIV for $H_s$, there exists functions $b_{s,a_s}(\overline{U}_s,H_s)$ and $c_{s,a_s}(Z_s,H_s)$ such that, for any $a_s \in \mathcal{A}_s$, \[\Pr(A_s = a_s \mid Z_s, \overline{U}_s, H_s) = b_{s,a_s}(\overline{U}_s, H_s) + c_{s,a_s}(Z_s, H_s).\] Then we can deduce that \[\begin{align} &\mathbb{E}\left[ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times\mathbb{E}\left[ \left\{b_{s,a_s}(\overline{U}_s, H_s) + c_{s,a_s}(Z_s, H_s)\right\} \pi_t(Z_s, H_s)\middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \left\{\begin{array}{l} \mathbb{E}\left[ b_{s,a_s}(\overline{U}_s, H_s)\middle| H_s\right]\times\mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right]\\ +\mathbb{E}\left[ c_{s,a_s}(Z_s, H_s) \pi_t(Z_s, H_s)\middle| H_s \right] \end{array}\right\}. \end{align}\] Similarly, we can calculate \[\begin{align} &\mathbb{E}\left[ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}) \middle| H_s \right] \times\mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \left\{\begin{array}{l} \mathbb{E}\left[ b_{s,a_s}(\overline{U}_s, H_s)\middle| H_s\right]\\ +\mathbb{E}\left[ c_{s,a_s}(Z_s, H_s) \middle| H_s \right] \end{array}\right\}\times\mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right];\\\\ &\mathrm{Cov}\bigl\{ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \left\{\begin{array}{l} \mathbb{E}\left[ c_{s,a_s}(Z_s, H_s) \pi_t(Z_s, H_s)\middle| H_s \right]\\ -\mathbb{E}\left[ c_{s,a_s}(Z_s, H_s)\middle| H_s \right]\times \mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right] \end{array}\right\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{c_{s,a_s}(Z_s, H_s), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{ b_{s,a_s}(\overline{U}_s, H_s) +c_{s,a_s}(Z_s, H_s), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{\Pr(A_s = a_s \mid Z_s, \overline{U}_s, H_s), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{I\{A_s = a_s\} , \pi_s(Z_s, H_s) \mid H_s \bigr\}. \end{align}\] This finishes the proof for Equation 20 , and we finish the proof for $s_c=s$ and $r_c=T+1-s$. For $r_c=T-s$, \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s})|H_s] =\frac{\mathrm{Cov}\bigl\{ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}), \pi_s(Z_s, H_s) \mid H_s \bigr\}}{\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}\notag\\ =&\mathbb{E}\left[\frac{ \left\{\pi_s(Z_s, H_s)-\mathbb{E}\left[\pi_s(Z_s, H_s)\middle| H_s\right]\right\} I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})}{\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}\middle| H_s\right].\label{exp:12} \end{align}\tag{21}\] By induction, $\gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})$ equals to \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s+1}) \mid H_{s+1}]\notag\\ =&\mathbb{E}\left[\prod_{t=s+1}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r})\middle| H_{s+1} \right]. \end{align}\] We can substitute this quantity into Equation 21 to deduce that \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s})|H_s]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{ \left\{\pi_s(Z_s, H_s)-\mathbb{E}\left[\pi_s(Z_s, H_s)\middle| H_s\right]\right\} I\{A_s = a_s\}} {\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}\\ \times \displaystyle\prod_{t=s+1}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r}) \end{array}\middle| H_s\right]\\ =&\mathbb{E}\left[\prod_{t=s}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r})\middle| H_{s} \right]. \end{align}\] This finishes the proof for Equation 19 .

13.9 Proof for Proposition 9 ↩︎

We first calculate that \[\begin{align} &\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t,\overline{a}},\underline{\alpha}_{t+1,\overline{a}}^o) -\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t-1,\overline{a}},\underline{\alpha}_{t,\overline{a}}^o)\\ =&\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)} \times\left( \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} -\dfrac{\left\{Z_t-\rho_t^o(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}^o(H_t)} \right)\\ &\times\left\{ \begin{array}{l} \displaystyle\prod_{s=t+1}^{T}\dfrac{\left\{Z_s-\rho_t^o(H_t)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}Y-\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\\ +\displaystyle\sum_{s=t+1}^T\left(\displaystyle\prod_{r=t+1}^{s-1} \dfrac{\left\{Z_r-\rho_r^o(H_r)\right\}A_r^{(a_r)}} {\kappa_{r,a_r}^o(H_r) - \delta_{r,a_r}^o(H_r)\rho_r^o(H_r)}\right)\\ \times\left\{ \left(1-\dfrac{\{A_s^{(a_s)}-\delta_{s,a_s}^o(H_s)\}\{Z_s-\rho_s^o(H_s)\}}{\kappa_{s,a_s}^o(H_s)} \right)\gamma_{s,\underline{a}_s}^o(H_s) - \dfrac{(Z_s-\rho_s^o(H_s))\times\eta_{s,\underline{a}_s}^o(H_s)}{\kappa_{s,a_s}^o(H_s)}\right\} \end{array} \right\}\\ &+\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\\ &\times\left\{\begin{array}{l} \left( \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} -\dfrac{\left\{Z_t-\rho_t^o(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}^o(H_t)} \right)\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\\ +\left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}\{Z_t-\rho_t(H_t)\}}{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\times\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\\ -\left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)\}\{Z_t-\rho_t^o(H_t)\}}{\kappa_{t,a_t}^o(H_t)} \right)\gamma_{t,\underline{a}_t}^o(H_t) - \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)}{\kappa_{t,a_t}^o(H_t)} \end{array}\right\}\\ =&\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)} \times\left( \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} -\dfrac{\left\{Z_t-\rho_t^o(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}^o(H_t)} \right)Q_{11}\\ &+\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)} \left\{ Q_{12}+Q_{13}-Q_{14} \right\}. \end{align}\] Since the nuisance functions in $Q_{11}$ all equal to the corresponding true form, it is straight forward to verify that $\mathbb{E}[Q_{11}\mid H_{t+1}]\equiv 0$. Similarly, one can verify that $\mathbb{E}[Q_{14}\mid H_t]\equiv 0$. From \[\begin{align} &\mathbb{E}\left[\dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})} {\kappa_{t,a_t}(H_t)}\middle| H_t\right]\\ =&\dfrac{\mathbb{E}\left[\left\{ Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\middle| H_t\right]} {\kappa_{t,a_t}(H_t)}\\ =&\dfrac{\zeta_{t,\underline{a}_t}^o(H_t)-\rho_t(H_t)\eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}(H_t)} =\dfrac{\gamma_{t,\underline{a}_t}^o(H_t)\kappa_{t,a_t}^o(H_t)+\{\rho_t^o(H_t)-\rho_t(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}(H_t)}, \end{align}\] we know that \[\begin{align} &\mathbb{E}[Q_{12}\mid H_t]=\gamma_{t,\underline{a}_t}^o(H_t)\left(\dfrac{\kappa_{t,a_t}^o(H_t)}{\kappa_{t,a_t}(H_t)}-1\right) +\dfrac{\{\rho_t^o(H_t)-\rho_t(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}(H_t)};\\ &\mathbb{E}[Q_{13}\mid H_t]\\=& \mathbb{E}\left[\left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}\{Z_t-\rho_t(H_t)\}}{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\middle| H_t\right]\\=& \left(1-\dfrac{ \mathbb{E}\left[\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}\{Z_t-\rho_t(H_t)\}\middle| H_t\right] }{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(\rho_t^o(H_t)-\rho_t(H_t))\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\\=& \left(1-\dfrac{ \kappa_{t,a_t}^o(H_t) }{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) +\dfrac{1}{\kappa_{t,a_t}(H_t)}(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t)\\&- \dfrac{(\rho_t^o(H_t)-\rho_t(H_t))\times\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)};\\ &\mathbb{E}[Q_{12}+Q_{13}\mid H_t]\\=& \dfrac{1}{\kappa_{t,a_t}(H_t)}\left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ &+\dfrac{1}{\kappa_{t,a_t}(H_t)}(\rho_t(H_t)-\rho_t^o(H_t))\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ &+\dfrac{1}{\kappa_{t,a_t}(H_t)}(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t). \end{align}\] In summary, we can deduce that \[\begin{align} &\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t,\overline{a}},\underline{\alpha}_{t+1,\overline{a}}^o) -\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t-1,\overline{a}},\underline{\alpha}_{t,\overline{a}}^o)\middle| H_t\right] =\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\\ &\times \dfrac{1}{\kappa_{t,a_t}(H_t)}\left\{\begin{array}{l} \left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\rho_t(H_t)-\rho_t^o(H_t))\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t). \end{array}\right\}. \end{align}\] Finally, we deduce that \[\begin{align} &\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}})\right]\\ =&\sum_{t=0}^T\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t,\overline{a}},\underline{\alpha}_{t+1,\overline{a}}^o) -\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t-1,\overline{a}},\underline{\alpha}_{t,\overline{a}}^o)\right]\\ =&\mathbb{E}\left[\begin{array}{l} \displaystyle\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\right) \times \dfrac{1}{\kappa_{t,a_t}(H_t)}\\ \times\left\{\begin{array}{l} \left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\rho_t(H_t)-\rho_t^o(H_t))\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t) \end{array}\right\} \end{array}\right]. \end{align}\] This finishes the proof for Proposition 9.

13.10 Proof for Proposition 12 ↩︎

From Equation 16 , we see that \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[g_a(U,L)\{ 1-\Pr(A\neq a\mid Z=z,U,L)\}\middle| L\right]\\ &-\mathbb{E}\left[\{f_a^o(1,L)-f_a^o(0,L)\}\{1-\Pr(A\neq a\mid Z=z,U,L)\}\middle| L\right]. \end{align}\] If there exists $b(U,L)$ and $c(Z,L)$, such that $\Pr(A\neq a\mid Z=z,U,L)=b(U,L)c(Z,L)$, and \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\{1-b(U,L)c(z,L)\}\middle| L\right] \notag\\ =&\mathbb{E}\left[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\middle| L\right]\label{exp:2}\\ &-c(z,L)\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L\right].\notag \end{align}\tag{22}\] If for any $l$, there exists $z_1,z_2$, such that $c(z_1,l)\neq c(z_2,l)$, then we can take difference to deduce that \[\begin{align} &\{c(z_1,l)-c(z_2,l)\}\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L=l\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\Pr(A\neq a\mid U,L)\middle| L\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}I(A\neq a)\middle| L\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\middle| L,A\neq a\right]=0. \end{align}\] Now we know that \[\begin{align} &f_a^o(1,L)-f_a^o(0,L)=\mathbb{E}\left[g_a(U,L)\middle| L,A\neq a\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(a)\mid U,L]\middle| L,A\neq a\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(a)\mid U,L,A\neq a]\middle| L,A\neq a\right]\\ =&\mathbb{E}\left[Y(a)\middle| L,A\neq a\right]. \end{align}\] We can now substitute this expression into Equation 22 to deduce that \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\middle| L\right]\\ =&\mathbb{E}[Y(a)\mid L]-\mathbb{E}[Y(a)\mid L,A\neq a].\\ f_a^o(1,L)=&f_a^o(0,L)+\mathbb{E}[Y(a)\mid L,A\neq a]=\mathbb{E}[Y(a)\mid L]. \end{align}\] This finishes the proof for Proposition 12.

13.11 Proof for Proposition 13 ↩︎

For any measurable and bounded $f(A,V)$, we directly calculate \[\begin{align} &\mathbb{E}\left[\omega_{\pi}^o(A, Z, L) f(A, V) \left\{Y - g(A, V; \psi_{MSM}^o)\right\}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\mathbb{E}\left[I\{A=a\}\omega_{\pi}^o(a, Z, L) \left\{Y - g(a, V; \psi_{MSM}^o)\right\}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\mathbb{E}\left[I\{A=a\} \dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} \left\{Y - g(a, V; \psi_{MSM}^o)\right\}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\mathbb{E}\left[ \dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} A^{(a)}Y\middle| L\right]\\ &-\sum_{a\in\mathcal{A}}f(a, V)g(a, V; \psi_{MSM}^o)\mathbb{E}\left[ \dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} A^{(a)}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\dfrac{\mathrm{Cov}\!\{I\{A=a\}Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} \\ &-\sum_{a\in\mathcal{A}}f(a, V)g(a, V; \psi_{MSM}^o)\dfrac{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} {\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}}\\ =&\mathbb{E}\left[\sum_{a\in\mathcal{A}}f(a, V)\{Y(a)-g(a,V;\psi_{MSM}^o)\}\middle| L\right]. \end{align}\] The last step follows from Proposition 3. Now we can take expectation with respect to $L$ and get \[\begin{align} &\mathbb{E}\left[\omega_{\pi}^o(A, Z, L) f(A, V) \left\{Y - g(A, V; \psi_{MSM}^o)\right\}\right]\\ =&\mathbb{E}\left[\sum_{a\in\mathcal{A}}f(a, V)\{Y(a)-g(a,V;\psi_{MSM}^o)\}\right] \\ =&\mathbb{E}\left[\sum_{a\in\mathcal{A}}f(a, V)\{\mathbb{E}[Y(a)\mid V]-g(a,V;\psi_{MSM}^o)\}\right]=0. \end{align}\] The final step follows from the definition of $g(a,V;\psi_{MSM}^o)$ in Equation 10 . Since this equality holds for any $f(A,V)$, we conclude that \[\begin{align} \mathbb{E}\left[\omega_{\pi}^o(A, Z, L)\left\{Y - g(A, V; \psi_{MSM}^o)\right\}\mid A,V\right]=0. \end{align}\]

13.12 Proof for Proposition 14 ↩︎

We first verify one important fact, that is, if $Z$ is an AIV for $A=a$, then $Z_\mathcal{S}$ is also an AIV for $A=a$. Since $Z$ is an AIV for $A=a$, there exists $b(U,L)$ and $c(Z,L)$, such that \[\Pr(A = a \mid Z, U, L) = b(U, L) + c(Z, L).\] We define $\tilde{c}(Z_{\mathcal{S}},L)=\int c(z,L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid Z_\mathcal{S},L)$. Then we can use the law of total probability to verify that \[\begin{align} &\Pr(A = a \mid Z_{\mathcal{S}}=z_\mathcal{S}, U, L)\notag\\ =&\int\Pr(A = a \mid Z=z, Z_{\mathcal{S}}=z_\mathcal{S}, U, L)\text{d} P_{Z\mid Z_{\mathcal{S}},U,L}(z\mid z_\mathcal{S},U,L)\notag\\ =&\int\Pr(A = a \mid Z=z, Z_{\mathcal{S}}=z_\mathcal{S}, U, L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\tag{23}\\ =&\int\Pr(A = a \mid Z=z, U, L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\tag{24}\\ =&\int b(U,L)+c(z,L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\notag\\ =&b(U,L)+\int c(z,L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\notag\\ =&b(U,L)+\tilde{c}(z_{\mathcal{S}},L).\notag \end{align}\] Equation 23 follows by Assumption 3 that $Z\perp\!\!\!\perp U\mid L$. Equation 24 follows by when $Z=z$ and $P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)>0$, it must holds that $Z_{\mathcal{S}}=z_\mathcal{S}$.

This is equivalent to say that $Z_\mathcal{S}$ is an AIV for $A=a$. Furthermore, Assumptions 1–4 still holds for $Z_\mathcal{S}$. Now we adopt Proposition 3 for $Z_\mathcal{S}$ to deduce the results.

13.13 Proof for Proposition 15 ↩︎

For any fixed $a\in\mathcal{A}$, denote $g(u,L):=\mathbb{E}[Y(a)\mid U,L]$. We first observe that \[\begin{align} &\mathbb{E}[Y\pi(Z,L)\mid A=a,L]=\mathbb{E}[\mathbb{E}[Y\mid Z,A=a,U,L]\pi(Z,L)\mid A=a,L]\\ =&\mathbb{E}[\mathbb{E}[Y(a)\mid U,L]\pi(Z,L)\mid A=a,L]\\ =&\int g(u,L)\pi(z,L)p_{Z,U\mid A,L}(z,u\mid a,L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)p_{A,Z,U\mid L}(a,z,u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)p_{A\mid Z,U,L}(a\mid z,u,L)p_{Z\mid L}(z\mid L)p_{U\mid L}(u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)b(z,L)p_{Z\mid L}(z\mid L)p_{U\mid L}(u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)c(u,L)p_{Z\mid L}(z\mid L)p_{U\mid L}(u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u)\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)c(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u)\int \pi(z,L)p_{Z\mid L}(z\mid L)\text{d}\mu_Z(z)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)c(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u)\cdot\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\] Similarly, we can just set $\pi(Z,L)\equiv 1$ to verify that \[\begin{align} \mathbb{E}[Y\mid A=a,L]=&\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)c(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u). \end{align}\] Finally, we conclude that \[\begin{align} &\mathbb{E}[Y\pi(Z,L)\mid A=a,L]-\mathbb{E}[Y\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &-\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\] Similarly, we can verify that \[\begin{align} &\mathbb{E}[1\cdot\pi(Z,L)\mid A=a,L]-\mathbb{E}[1\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &-\dfrac{1}{p_{A\mid L}(a\mid L)}\cdot\int b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\] Now we get the result that \[\begin{align} &\mathbb{E}[Y\pi(Z,L)\mid A=a,L]-\mathbb{E}[Y\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\{\mathbb{E}[\pi(Z,L)\mid A=a,L]-\mathbb{E}[\pi(Z,L)\mid L]\}\mathbb{E}[Y(a)\mid L]. \end{align}\] From this equality, we know that \[\begin{align} \mathbb{E}[Y(a)]=\mathbb{E}\left[\dfrac{\mathbb{E}[Y\pi(Z,L)\mid A=a,L]-\mathbb{E}[Y\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]} {\mathbb{E}[\pi(Z,L)\mid A=a,L]-\mathbb{E}[\pi(Z,L)\mid L]}\right]. \end{align}\]

References↩︎

[1]

Miguel A. Hernán and James M. Robins. Causal Inference: What If. Chapman & Hall/CRC, 2020. Available at: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.

[2]

James M. Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics - Theory and Methods, 23 (8): 2379–2412, 1994. .

[3]

Guido W Imbens and Joshua D Angrist. Identification and estimation of local average treatment effects. Econometrica, 62 (2): 467–475, 1994.

[4]

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91 (434): 444–455, 1996.

[5]

Alberto Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of econometrics, 113 (2): 231–263, 2003.

[6]

Zhiqiang Tan. Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101 (476): 1607–1618, 2006.

[7]

Joshua D Angrist and Guido W Imbens. Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association, 90 (430): 431–442, 1995.

[8]

Edward Vytlacil. Independence, monotonicity, and latent index models: An equivalence result. Econometrica, 70 (1): 331–341, 2002.

[9]

Edward H Kennedy, Scott Lorch, and Dylan S Small. Robust causal inference with continuous instruments using the local instrumental variable curve. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81 (1): 121–143, 2019.

[10]

Eric J Tchetgen Tchetgen. The nudge average treatment effect. arXiv e-prints, pages arXiv–2410, 2024.

[11]

Linbo Wang and Eric Tchetgen Tchetgen. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80 (3): 531–550, 2018.

[12]

F. P. Hartwig, Linbo Wang, George Davey Smith, and Neil M. Davies. Average causal effect estimation via instrumental variables: the no simultaneous heterogeneity assumption. Epidemiology, 34 (3): 325–332, 2023. .

[13]

Linbo Wang, Eric Tchetgen Tchetgen, Torben Martinussen, and Stijn Vansteelandt. Instrumental variable estimation of the causal hazard ratio. Biometrics, 79 (2): 539–550, 2023.

[14]

Yifan Cui, Haben Michael, Frank Tanser, and Eric Tchetgen Tchetgen. Instrumental variable estimation of the marginal structural cox model for time-varying treatments. Biometrika, 110 (1): 101–118, 2023.

[15]

Benjamin R Baer, Robert L Strawderman, and Ashkan Ertefaie. Discussion on “instrumental variable estimation of the causal hazard ratio,” by linbo wang, eric tchetgen tchetgen, torben martinussen, and stijn vansteelandt. Biometrics, 79 (2): 554–558, 2023.

[16]

Eric J Tchetgen Tchetgen, Haben Michael, and Yifan Cui. Marginal structural models for time-varying endogenous treatments: A time-varying instrumental variable approach. arXiv preprint arXiv:1809.05422, 2018.

[17]

Haben Michael, Yifan Cui, Scott A Lorch, and Eric J Tchetgen Tchetgen. Instrumental variable estimation of marginal structural mean models for time-varying treatment. Journal of the American Statistical Association, 119 (546): 1240–1251, 2024.

[18]

Yifan Cui and Eric Tchetgen Tchetgen. A semiparametric instrumental variable approach to optimal treatment regimes under endogeneity. Journal of the American Statistical Association, 116 (533): 162–173, 2021.

[19]

Yifan Cui, Jianhua Guo, Wendong Li, Frank Tanser, and Dongdong Xiang. Learning optimal treatment regimes with survival data under imperfect compliance: An instrumental variable approach. Statistica Sinica, 2025.

[20]

Jiewen Liu, Chan Park, Yonghoon Lee, Yunshu Zhang, Mengxin Yu, James M. Robins, and Eric J. Tchetgen Tchetgen. The multiplicative instrumental variable model, 2025. URL https://arxiv.org/abs/2507.09302.

[21]

Victor Chernozhukov and Christian Hansen. An iv model of quantile treatment effects. Econometrica, 73 (1): 245–261, 2005.

[22]

Victor Chernozhukov, Christian Hansen, and Kaspar Wuthrich. Instrumental variable quantile regression, 2020. URL https://arxiv.org/abs/2009.00436.

[23]

Victor Chernozhukov, Iván Fernández-Val, Sukjin Han, and Kaspar Wüthrich. Estimating causal effects of discrete and continuous treatments with binary instruments, 2024. URL https://arxiv.org/abs/2403.05850.

[24]

Eric Tchetgen Tchetgen, BaoLuo Sun, and Stefan Walter. The genius approach to robust mendelian randomization inference. Statistical Science, 36 (3), August 2021. ISSN 0883-4237. . URL http://dx.doi.org/10.1214/20-STS802.

[25]

Baoluo Sun, Yifan Cui, and Eric Tchetgen Tchetgen. Selective machine learning of the average treatment effect with an invalid instrumental variable. Journal of Machine Learning Research, 23 (204): 1–40, 2022.

[26]

Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society, pages 1029–1054, 1982.

[27]

John D Sargan. The estimation of economic relationships using instrumental variables. Econometrica: Journal of the econometric society, pages 393–415, 1958.

[28]

Jeffrey M Wooldridge. Econometric analysis of cross section and panel data. MIT press, 2010.

[29]

Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71 (5): 1565–1578, 2003.

[30]

Serge Darolles, Yanqin Fan, Jean-Pierre Florens, and Eric Renault. Nonparametric instrumental regression. Econometrica, 79 (5): 1541–1565, 2011.

[31]

Thomas A Severini and Gautam Tripathi. Efficiency bounds for estimating linear functionals of nonparametric regression models with endogenous regressors. Journal of Econometrics, 170 (2): 491–498, 2012.

[32]

Chunrong Ai and Xiaohong Chen. The semiparametric efficiency bound for models of sequential moment restrictions containing unknown functions. Journal of Econometrics, 170 (2): 442–457, 2012. .

[33]

Whitney K Newey. Nonparametric instrumental variables estimation. American Economic Review, 103 (3): 550–556, 2013.

[34]

Xiaohong Chen and Timothy M. Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression: Nonlinear functionals of nonparametric iv. Quantitative Economics, 9 (1): 39–84, March 2018. ISSN 1759-7323. . URL http://dx.doi.org/10.3982/QE722.

[35]

Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105 (4): 987–993, 2018.

[36]

Eric J Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao. An introduction to proximal causal inference. Statistical Science, 39 (3): 375–390, 2024.

[37]

Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference. Journal of the American Statistical Association, 119 (546): 1348–1359, 2024.

[38]

Jersey Neyman. Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych, 10 (1): 1–51, 1923.

[39]

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66 (5): 688, 1974.

[40]

Aad W. van der Vaart. Asymptotic Statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998. ISBN 9780521784504.

[41]

Mark J. van der Laan and James M. Robins. Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics. Springer, 2003. ISBN 9780387402710.

[42]

Anastasios A. Tsiatis. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer, New York, 2006. ISBN 9780387251450.

[43]

Edward H Kennedy. Semiparametric theory and empirical processes in causal inference. Statistical causal inferences and their applications in public health research, pages 141–167, 2016.

[44]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. . The Econometrics Journal, 21 (1): C1–C68, 01 2018. ISSN 1368-4221. . URL https://doi.org/10.1111/ectj.12097.

[45]

Ting Ye, Zhonghua Liu, Baoluo Sun, and Eric Tchetgen Tchetgen. Genius-mawii: For robust mendelian randomization with many weak invalid instruments, 2024. URL https://arxiv.org/abs/2107.06238.

[46]

Stijn Vansteelandt and Oliver Dukes. Assumption-lean inference for generalised linear model parameters. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 (3): 657–685, 2022.

[47]

Stijn Vansteelandt, Oliver Dukes, Kelly Van Lancker, and Torben Martinussen. Assumption-lean cox regression. Journal of the American Statistical Association, 119 (545): 475–484, 2024.

[48]

Whitney K Newey and Daniel McFadden. Large sample estimation and hypothesis testing. Handbook of econometrics, 4: 2111–2245, 1994.

[49]

Andrea Rotnitzky, Ezequiel Smucler, and James M Robins. Characterization of parameters with a mixed bias property. Biometrika, 108 (1): 231–238, 2021.

[50]

Thomas Wiemann. Optimal categorical instrumental variables. arXiv preprint arXiv:2311.17021, 2023.

[51]

Alicia Curth, Ahmed M. Alaa, and Mihaela van der Schaar. Estimating structural target functions using machine learning and influence functions, 2021. URL https://arxiv.org/abs/2008.06461.

[52]

Dylan J. Foster and Vasilis Syrgkanis. Orthogonal statistical learning, 2023. URL https://arxiv.org/abs/1901.09036.

[53]

Pawel Morzywolek, Johan Decruyenaere, and Stijn Vansteelandt. On weighted orthogonal learners for heterogeneous treatment effects, 2024. URL https://arxiv.org/abs/2303.12687.

[54]

Simon N Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73 (1): 3–36, 2011.

[55]

Oleg Sofrygin, Mark J van der Laan, and Romain Neugebauer. simcausal r package: conducting transparent and reproducible simulation studies of causal effect estimation with complex longitudinal data. Journal of statistical software, 81: 1–47, 2017.

[56]

Sukjin Han. Optimal dynamic treatment regimes and partial welfare ordering. Journal of the American Statistical Association, 119 (547): 2000–2010, 2024.

[57]

Orley Ashenfelter and David Card. Handbook of labor economics. Elsevier, 2010.

[58]

Ted Westling, Alex Luedtke, Peter B Gilbert, and Marco Carone. Inference for treatment-specific survival curves using machine learning. Journal of the American Statistical Association, 119 (546): 1541–1553, 2024.

[59]

Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small. Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79 (4): 1229–1245, 2017.

[60]

James M Robins. Marginal structural models versus structural nested models as tools for causal inference. In Statistical models in epidemiology, the environment, and clinical trials, pages 95–133. Springer, 2000.

[61]

Miguel A Hernán, Babette Brumback, and James M Robins. Marginal structural models to estimate the joint causal effect of nonrandomized treatments. Journal of the American Statistical Association, 96 (454): 440–448, 2001.

[62]

Michael Baiocchi, Jing Cheng, and Dylan S Small. Instrumental variable methods for causal inference. Statistics in medicine, 33 (13): 2297–2340, 2014.

[63]

Keisuke Hirano and Guido W. Imbens. The propensity score with continuous treatments. In Andrew Gelman and Xiao-Li Meng, editors, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An Essential Journey with Donald Rubin’s Statistical Family, Wiley Series in Probability and Statistics, pages 73–84. Wiley, July 2004. ISBN 978-0-470-09043-5, 978-0-470-09045-9. .

[64]

Cecilia Elena Rouse. Democratization or diversion? the effect of community colleges on educational attainment. Journal of Business & Economic Statistics, 13 (2): 217–224, 1995.

The authors were partially supported by the National Key R&D Program of China (2024YFA1015600), the National Natural Science Foundation of China (12471266 and U23A2064). Correspondence to cuiyf@zju.edu.cn.↩︎

DGP	Metric	Adaptive weight			Prespecified (\(\pi(Z,L)=Z\))
DGP	Metric	Treated	Control	ATE	Treated	Control	ATE
[Y1],[A1]	Bias	.0011	.0011	.0019	.0018	.0032	.0028
	SE	.0543	.0545	.0807	.0626	.0626	.0920
	SD	.0577	.0581	.0844	.0612	.0648	.0922
	CR	.927	.938	.939	.947	.935	.950
[Y2],[A1]	Bias	.0023	.0078	.0064	.0043	.0044	.0012
	SE	.0864	.0611	.1060	.1006	.0691	.1220
	SD	.0906	.0637	.1092	.0987	.0703	.1200
	CR	.926	.946	.941	.951	.951	.952
[Y1],[A2]	Bias	.0006	.0020	.0013	.0006	.0013	.0003
	SE	.0554	.0561	.0820	.0565	.0569	.0833
	SD	.0547	.0575	.0830	.0563	.0557	.0828
	CR	.954	.934	.938	.956	.956	.944
[Y2],[A2]	Bias	.0000	.0318	.0315	.0027	.0269	.0301
	SE	.0886	.0615	.1079	.0908	.0621	.1100
	SD	.0891	.0619	.1089	.0915	.0618	.1110
	CR	.948	.912	.936	.955	.924	.941

Identification and Debiased Learning of Causal Effects with General Instrumental Variables