October 23, 2025
Instrumental variable methods are fundamental to causal inference when treatment assignment is confounded by unobserved variables. In this article, we develop a general nonparametric framework for identification and learning with multi-categorical or continuous instrumental variables. Specifically, we propose an additive instrumental variable framework to identify mean potential outcomes and the average treatment effect with a weighting function. Leveraging semiparametric theory, we derive efficient influence functions and construct consistent, asymptotically normal estimators via debiased machine learning. Extensions to longitudinal data, dynamic treatment regimes, and multiplicative instrumental variables are further developed. We demonstrate the proposed method by employing simulation studies and analyzing real data from the Job Training Partnership Act program.
1
1
Identification and Debiased Learning of Causal Effects with General Instrumental Variables
Keywords: Additive instrumental variables, Causal inference, Debiased machine learning, Dynamic treatment regimes, Longitudinal data.
Observational studies are commonly employed to estimate treatment effects in biomedical and economic research [1]. In the presence of unmeasured confounding, instrumental variable (IV) methods have been widely used to identify causal effects [2]–[6]. These approaches exploit exogenous variation in treatment induced by instruments that influence treatment assignment but are conditionally independent of the latent confounding effects.
Under the monotonicity condition that the instrument does not decrease (or increase) the probability of receiving treatment for any individual, IV methods identify causal effects within the complier subpopulation. [3], [6], [7] showed that, with multi-categorical instruments, identification is feasible for local treatment effects within specific complier subgroups. Building on this, [8], [9] proposed estimators for the local IV effect curve, capturing the treatment effect among individuals who would comply when the instrument exceeds a certain threshold. [10] relaxed the monotonicity assumption, establishing identification of the average treatment effect in the nudge subgroup, which involves mixtures of compliers and defiers.
Unlike approaches relying on the monotonicity condition, the identification strategy of [11], [12] does not require monotonicity. Instead, they identify the average treatment effect (ATE) by imposing no-interaction assumptions between the IV and latent confounding in the treatment model. [13]–[15] extended this framework to the marginal structural Cox model, while [16], [17] further generalized it to longitudinal settings. Furthermore, [18] proposed to identify optimal treatment regimes under a no unmeasured common effect modifier assumption, and [19] extended the approach to learn optimal treatment regimes for censored survival data. Building on this foundation, [20] proposed an identifying condition that excludes any multiplicative interaction between the IV and latent confounders in the treatment model, thereby enabling the nonparametric identification of the average treatment effect on the treated (ATT).
In parallel, [21], [22] developed IV quantile regression (IVQR) methods to estimate quantile treatment effects (QTE), allowing both the instrument and treatment variables to be non-binary. [23] adopt a novel type of copula invariance condition to identify the treatment effects for the entire population, allowing the treatment to be binary, multi-categorical or continuous. [24], [25] investigate scenarios involving invalid instrumental variables, where the exclusion restriction may be violated.
Collectively, these approaches, however, do not offer nonparametric identification strategies for the ATE in settings where either the instrument or the treatment variable is non-binary, which limits their applicability in empirical contexts involving multi-categorical or continuous treatments and instruments.
Despite identifying the ATE, there are several classical frameworks dealing with continuous IV. The generalized method of moments (GMM) provides a unifying and flexible parametric framework under the IV setting [26]. It estimates structural parameters of interest by solving the corresponding moment conditions. It encompasses traditional methods like two-stage least squares [27], [28] as a special case and extends naturally to situations with multiple instruments, heteroskedasticity, and other complexities.
Nonparametric IV methods represent another flexible and widely adopted framework, particularly suited for settings where the structural relationships between variables are complex and cannot be adequately captured by parametric models [29]–[34]. By avoiding restrictive assumptions on functional forms, these methods enable flexible, data-driven estimation of treatment and outcome models. Notably, proximal causal inference methods are closely related to nonparametric IV techniques and have emerged as effective approaches to address unmeasured confounding [35]–[37].
Inspired by existing literature, our study investigates the fundamental problem of estimating the mean potential outcomes [38], [39] in settings involving multi-categorical or continuous IVs and treatments, which substantially broadens the scope beyond the conventional binary treatment–binary IV framework. This generalization not only captures a wider range of empirical applications but also poses new methodological challenges for achieving valid identification.
We now outline the contents of our paper. In Section 2, we generalize the no-interaction condition between a binary IV and latent confounding [11], and introduce the additive IV framework for multi-categorical or continuous IVs to establish identification of mean potential outcomes and the ATE. We connect our strategy to the solution of a specific nonparametric IV problem, providing both theoretical insight and intuitive interpretation of the proposed estimands.
In Section 3, we use semiparametric theory [40]–[42] to derive efficient influence functions (EIFs) for target estimands defined by different weighting functions. For ATE identification, we show that under homoskedastic latent confounding, the optimal weighting function achieves the efficiency bound. Using the debiased machine learning (DML) framework with cross-fitting [43], [44], we construct two estimators for the ATE. The first estimator uses a fixed weighting function, and the second estimator leverages an adaptive procedure that selects the optimal weighting function. Both estimators are consistent and asymptotically normal.
In Section 4, we extend the identification strategy from point-exposure settings to longitudinal data. Other extensions to dynamic treatment regimes and multiplicative instrumental variables are given in the appendix. In Section 5, we conduct simulation studies to assess the validity of the proposed estimators under both point-exposure and longitudinal settings. We analyze Job Training Partnership Act program data in Section 6. We conclude our paper with a discussion in Section 7.
Let \(A \in \mathcal{A} := \{0, \ldots, M\}\) denote a multi-categorical treatment variable, where \(M = 1\) corresponds to the binary treatment setting. Let \(Z \in \mathcal{Z}\subseteq\mathbb{R}^{|\mathcal{Z}|}\) denote the IV, which may be multi-categorical or continuous. Let \(U\in\mathcal{U} \subseteq\mathbb{R}^{|\mathcal{U}|}\) represent unmeasured confounders, \(L\in\mathcal{L} \subseteq\mathbb{R}^{|\mathcal{L}|}\) observable confounders, \(Y\in\mathcal{Y} \subseteq\mathbb{R}\) the observed outcome, and \(Y(a)\) the potential outcome under treatment level \(A = a\). The observed data consist of \(O = \{Z, A, Y, L\} \in \mathcal{Z}\times \mathcal{A}\times \mathcal{Y}\times\mathcal{L}\). We introduce four fundamental assumptions under the IV setting.
Assumption 1 (Consistency). \(Y=Y(A)\).
Assumption 2 (Latent ignorability). For any \(a\in\mathcal{A}\), \(Y(a)\perp\!\!\!\perp\{A,Z\}\mid U,L\).
Assumption 3 (IV independence). \(Z\perp\!\!\!\perp U\mid L\).
Assumption 4 (IV relevance). For any \(a\in\mathcal{A}\) and \(l\in\mathcal{L}\), \(Z\not\perp\!\!\!\perp I\{A=a\}\mid L=l\). That is, there exist two distinct values \(z_0,z_1\in\mathcal{Z}\) such that \(\Pr(A=a\mid Z=z_0,L=l)\neq \Pr(A=a\mid Z=z_1,L=l).\)
Assumption 2 posits that, conditional on both the observed covariates and unmeasured confounders, the potential outcome \(Y(a)\) is independent of the treatment and IVs. Notably, it implies the IV exclusion restriction: the instrument \(Z\) affects the outcome \(Y\) only through its effect on the treatment \(A\). Assumption 3 states that \(Z\) is independent of the unmeasured confounders \(U\) given the observed covariates \(L\).
Assumption 4 requires that \(Z\) has a nontrivial effect on the treatment \(A\), conditional on any level of \(L\). This condition is slightly stronger than \(A \not\perp\!\!\!\perp Z \mid L\), which only requires the existence of some \(l \in \mathcal{L}\), \(a \in \mathcal{A}\), and \(z_0, z_1 \in \mathcal{Z}\) such that Assumption 4 holds.
Next, we provide the definition of additive IV, which serves as a key condition for identifying the causal estimands of interest.
Definition 1 (Additive IV). For each \(a \in \mathcal{A}\), \(Z\) is an additive IV* (AIV) for \(A = a\) if there exist functions \(b(U,L)\) and \(c(Z,L)\) such that \(\Pr(A = a \mid Z, U, L) = b(U, L) + c(Z, L).\) Furthermore, \(Z\) is an AIV for \(A\) if it is an AIV for \(A=a\) for all \(a\in\mathcal{A}\).*
According to [11], the definition of AIV originates from the no-interaction condition between \(Z\) and \(U\) in the treatment model. [24], [25], [45] also use this type of no-interaction condition to make inference with an invalid IV. The following proposition gives an alternative definition for AIV.
Proposition 1. For each \(a \in \mathcal{A}\), \(Z\) is an AIV for \(A = a\) if and only if for any \(\pi(Z,L)\), \(\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid U,L\}=\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid L\}.\)
Specifically, when \(Z\) is binary, Definition 1 holds if and only if \(\Pr(A=a \mid Z=1, U, L) - \Pr(A=a \mid Z=0, U, L) \perp\!\!\!\perp U \mid L,\) meaning that the differential effect of the instrument on treatment \(A\) is conditionally independent of the unmeasured confounders \(U\), given the observed covariates \(L\).
Next, we introduce the concept of a regular weighting function, which is analogous to the positivity condition commonly assumed in causal inference.
Definition 2 (Regular weighting function). For each \(a\in\mathcal{A}\), a function \(\pi(Z,L)\) is a regular weighting function* (RWF) for \(A = a\) if it is uniformly bounded and there exists a positive constant \(\epsilon_0\) such that \(|\mathrm{Cov}\!\{I\{A=a\}, \pi(Z,L) \mid L\} | \geq \epsilon_0\) uniformly for all \(L\).*
The existence of an RWF for every \(a \in \mathcal{A}\) implies the IV relevance condition in Assumption 4. Specifically, requiring the absolute value of the conditional covariance to be uniformly bounded below by a positive constant \(\epsilon_0\) rules out scenarios where \(\pi(Z,L)\) is irrelevant to \(A^{(a)}\), which would undermine the stability and validity of causal effect identification. To guarantee the existence of an RWF, the following positivity assumption is required.
Assumption 5 (Positivity). There exists a positive constant \(\epsilon_0\) such that for any \(l\in\mathcal{L}\) and \(a\in\mathcal{A}\), \(\mathrm{Var}\!\{\Pr(A=a\mid Z,L) \mid L=l\} \geq \epsilon_0.\)
Assumption 5 clearly entails the IV relevance condition in Assumption 4. The following proposition highlights its necessity. In particular, when \(Z\) is binary, Proposition 2 reduces to the standard positivity condition in the IV literature.
Proposition 2 (Existence). There exists an RWF \(\pi(Z,L)\) for \(A\) if and only if Assumption 5 holds. If there exists an RWF \(\pi(Z,L)\) for \(A=a\), then \(\pi^o(Z,L):=\Pr(A=a\mid Z,L)\) must be an RWF for \(A=a\).
In this subsection, we propose a strategy to identify the potential outcome mean \(\mathbb{E}[Y(a)]\) by formulating and solving a class of nonparametric IV models. Throughout, we define \(A^{(a)} := I\{A = a\}\) for each \(a \in \mathcal{A}\) for notational convenience. Our first primary goal is, for each \(a \in \mathcal{A}\), to identify a function \(f_a^o(A^{(a)}, L)\) that satisfies the conditional moment restriction given by \[\label{eq:32npiv} \mathbb{E}[A^{(a)} Y \mid Z, L] = \mathbb{E}[f_a(A^{(a)}, L) \mid Z, L].\tag{1}\]
Such conditional moment equation is common in the nonparametric IV literature [29]. The following theorem establishes the uniqueness of the solution to Equation 1 and provides an explicit representation in terms of any RWF \(\pi(Z, L)\).
Theorem 1 (Uniqueness and closed form solution). Under Assumption 4, for each \(a\in\mathcal{A}\), if a solution \(f_a^o(A^{(a)}, L)\) to Equation 1 exists, it is unique. Moreover, for any RWF \(\pi(Z, L)\) for \(A=a\) with \(\mathrm{Cov}\!\{A^{(a)}, \pi(Z, L) \mid L\} \neq 0\), the solution satisfies \[\label{eq:32explicit32form} \begin{align} f_a^o(0,L) &= -\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}} \mathbb{E}[A^{(a)}\mid Z,L] + \mathbb{E}[A^{(a)}Y\mid Z,L],\\ f_a^o(1,L) &= \dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}} \{1-\mathbb{E}[A^{(a)}\mid Z,L]\} + \mathbb{E}[A^{(a)}Y\mid Z,L]. \end{align}\qquad{(1)}\]
However, Theorem 1 ensures uniqueness without guaranteeing existence of a solution to Equation 1 . Indeed, Equation 1 is over-identified when \(Z\) is non-binary, meaning that a solution may not exist in general. In the proximal causal inference literature, the existence of a solution to the nonparametric IV (bridge) equation is typically guaranteed under a completeness condition [36]. Likewise, within the IV framework, the AIV condition plays a central role in ensuring existence. This is formally stated in the following proposition.
Proposition 3 (Identification of mean potential outcomes). Under Assumptions 1–4, for each \(a\in\mathcal{A}\), if \(Z\) is an AIV for \(A=a\), then there exists a unique solution \(f_a^o(A^{(a)}, L)\) to Equation 1 , given by \[\begin{align} f_a^o(0,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\},\\ f_a^o(1,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\} + \mathbb{E}[Y(a)\mid L]. \end{align}\] In particular, if Assumption 5 holds and \(\pi(Z,L)\) is an RWF for \(A=a\), \[\begin{align} \label{eq:32identification32AIV} \mathbb{E}[Y(a)] = \mathbb{E}\left[\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}}\right]. \end{align}\qquad{(2)}\]
Our identification strategy holds even when the treatment space \(\mathcal{A}\) is multi-categorical. Intuitively, this is because we transform the multi-categorical treatment \(A\) into a binary variable \(A^{(a)}\) for each \(a \in \mathcal{A}\).
Moreover, if there are no latent confounders \(U\), then \(f_a^o(0, L) \equiv 0\) by Proposition 3 and Assumption 2 that \(Y(a)\perp\!\!\!\perp A\mid U,L\). In other words, a significant deviation of \(f_a^o(0, L)\) from zero indicates the existence of latent confounding. This observation provides a novel criterion for detecting unmeasured confounder \(U\), which is beyond the scope of our article.
Furthermore, when \(Z\) is a non-binary IV, Proposition 3 provides a practical approach to assess whether \(Z\) qualifies as an AIV for \(A = a\), which constitutes a relatively strong condition. Specifically, under Assumptions 1–4, if \(Z\) is indeed an AIV for \(A = a\), the choice of the weighting function \(\pi(Z, L)\) does not affect the value of the expression on the right-hand side of Equation ?? . Therefore, one can select two distinct RWFs \(\pi_1(Z, L)\) and \(\pi_2(Z, L)\) and compare the resulting values. A discrepancy between these values would indicate a violation of the AIV condition.
In practical applications with a binary treatment \(A\), researchers are also interested in the ATE. The following proposition summarizes the identification results for the ATE.
Proposition 4 (Identification of ATE). Assume \(A\) is binary. Under Assumptions 1–5, if either \(Z\) is an AIV for \(A\) or \(Y(1)-Y(0)\perp\!\!\!\perp U \mid L\), \[\label{eq:32npiv32Y40141-Y40041} \mathbb{E}[Y\mid Z,L]=\mathbb{E}[f(A,L)\mid Z,L]\qquad{(3)}\] has a unique solution \(f^o(A, L)\) as \[\begin{align} f^o(0,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(1)-Y(0)\mid U,L], \Pr(A=1\mid Z,U,L)\mid L\} + \mathbb{E}[Y(0)\mid L],\\ f^o(1,L) &= \mathrm{Cov}\!\{\mathbb{E}[Y(1)-Y(0)\mid U,L], \Pr(A=1\mid Z,U,L)\mid L\} + \mathbb{E}[Y(1)\mid L]. \end{align}\] In particular, for any RWF \(\pi(Z, L)\) for \(A\), the ATE is identified as \[\begin{align} \label{eq:32identification32AIV32Y40141-Y40041} \mathbb{E}[Y(1)-Y(0)] = \psi_{\pi}^o := \mathbb{E}\left[\dfrac{\mathrm{Cov}\!\{Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}\right]. \end{align}\qquad{(4)}\]
Proposition 4 shows that the AIV condition is not strictly necessary for identifying the ATE. Instead, the alternative condition \(Y(1) - Y(0) \perp\!\!\!\perp U \mid L\) also ensures the existence of a solution to Equation ?? .
Remark 1. In fact, under Assumptions 1–5, without the AIV condition, the following equation still holds: \[\begin{align} \psi_{\pi}^o= \mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\dfrac{\mathrm{Cov}\!\{A, \pi(Z,L) \mid U,L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}\right]. \end{align}\] This result suggests that, even when \(Z\) fails to be an AIV, the estimand \(\psi_{\pi}^o\) can still be interpreted as a weighted average of the conditional average treatment effect \(\mathbb{E}[Y(1)-Y(0)\mid U,L]\), as long as \(\mathrm{Cov}\!\{A, \pi(Z,L) \mid U,L\}\) and \(\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}\) have the same sign. This representation is analogous to the assumption-lean inference of [46], [47], in which the estimands remain meaningful and interpretable even under model misspecification. In addition, Equation ?? holds when \(\mathrm{Cov}\!\{Y(1)-Y(0),\mathrm{Cov}\!\{A, \pi(Z,L) \mid U,L\}\mid L\}=0.\) This result can be viewed as an extension of the findings in [18].
Remark 2. When \(L = \emptyset\) and \(Z\) is univariate, the parameter \(\psi_{\pi}^o\) in Equation ?? reduces to a form resembling the two-stage least squares (TSLS) estimator if we set \(\pi(Z,L) = Z\). Moreover, several classical works interpret \(\psi_{\pi}^o\) as \(\tau_0\), the solution to the conditional moment equation \(\mathbb{E}[Y - \tau_0 A \mid Z] = 0\) [26], [28], [48], where \(\tau_0\) represents the effect of a marginal change in the endogenous variable \(A\) on the outcome. Consequently, our identification strategy can be viewed as a nonparametric generalization of the TSLS and GMM approaches, which additionally accounts for confounding effects through \(L\).
Remark 3. For a continuous treatment, we provide a similar identification result in the appendix.
In this section, we consider the case of a binary treatment assignment \(A\), a setting commonly encountered in causal inference. Our main objective is to rigorously characterize the semiparametric efficiency bound for the ATE under this framework. Leveraging semiparametric theory [42], we identify the minimum asymptotic variance attainable by any regular, asymptotically linear estimator of the ATE. This efficiency bound serves as a benchmark for guiding the construction of adaptive estimators.
Proposition 4 establishes that, under Assumptions 1–4, if \(Z\) is an AIV for the binary treatment \(A\), the choice of the RWF \(\pi(Z, L)\) does not affect the value of \(\psi_{\pi}^o\) in Equation ?? . However, the semiparametrically efficiency bound of \(\psi_{\pi}^o\) still depends on the choice of \(\pi(Z, L)\). Consequently, it is necessary to derive the EIFs corresponding to all possible RWFs \(\pi(Z, L)\). To this end, we define the following nuisance functions: \[\begin{align} &\delta^o(L) := \mathbb{E}[A \mid L], && \eta^o(L) := \mathbb{E}[Y \mid L],\\ &\kappa_{\pi}^o(L) := \mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}, && \zeta_{\pi}^o(L) := \mathbb{E}[Y \pi(Z,L) \mid L],\\ &\rho_{\pi}^o(L) := \mathbb{E}[\pi(Z,L) \mid L], && \gamma_{\pi}^o(L) := \dfrac{\mathrm{Cov}\!\{Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}. \end{align}\] For notational convenience, we collect them into a unified nuisance vector: \[\begin{align} \label{eq:32nuisance32function32fixed32pi} \alpha_{\pi}^o(L) := [\delta^o(L), \kappa_{\pi}^o(L), \rho_{\pi}^o(L), \eta^o(L), \gamma_{\pi}^o(L)]. \end{align}\tag{2}\] Note that \(\zeta_{\pi}^o(L)\) serves only as an intermediate nuisance function which will be used in our proof and does not appear in the unified vector \(\alpha_{\pi}^o(L)\). The following theorem derives the efficient influence function (EIF) of \(\psi_{\pi}^o\) for any choice of RWF \(\pi(Z,L)\).
Theorem 2. Under Assumptions 1–5, for any RWF \(\pi(Z,L)\) for \(A\), the EIF for \(\psi_{\pi}^o\) in Equation ?? is given by \(\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o)\), where \(\varphi_{\pi}(O;\psi_{\pi},\alpha_{\pi})\) equals to \[\begin{align} \dfrac{1}{\kappa_{\pi}(L)} \left\{\pi(Z,L)-\rho_{\pi}(L)\right\} \{Y-\eta(L)\} - \psi_{\pi} + \left(1 - \dfrac{\{\pi(Z,L)-\rho_{\pi}(L)\}\{A-\delta(L)\}}{\kappa_{\pi}(L)} \right) \gamma_{\pi}(L). \end{align}\]
Remark 4. We carry out our semiparametric analysis in the fully nonparametric model. In this case, the tangent space coincides with \(L_2(O)\), and the unique influence function corresponds to the EIF.
This theorem underpins the construction of efficient estimators for \(\psi_{\pi}^o\). In practice, the true nuisance vector \(\alpha_{\pi}^o\) is unknown and must be estimated from the data. The following proposition quantifies the bias introduced by substituting an estimated \(\alpha_{\pi}\) for the true vector. This mixed bias property is well-documented in the existing literature [44], [49].
Proposition 5 (Mixed bias property). Under Assumptions 1–5, for any RWF \(\pi(Z,L)\) and any fixed \(\alpha_{\pi}\) in Equation 2 , \(\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi})\) satisfies that \[\begin{align} &\mathbb{E}[\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi})] =\mathbb{E}\left[\dfrac{1}{\kappa_{\pi}(L)}\left\{\begin{array}{l} \{\kappa_{\pi}(L)-\kappa_{\pi}^o(L)\}\{\gamma_{\pi}(L)-\gamma_{\pi}^o(L)\}\\ +\{\rho_{\pi}(L)-\rho_{\pi}^o(L)\}\{\eta(L)-\eta^o(L)\}\\ +\{\rho_{\pi}(L)-\rho_{\pi}^o(L)\}\{\delta(L)-\delta^o(L)\}\gamma_{\pi}(L) \end{array}\right\}\right]. \end{align}\]
Next, we aim to select a function \(\pi(Z, L)\) from the class of RWFs that minimizes the asymptotic variance of the estimator for \(\psi_{\pi}^o\). Specifically, we seek the optimal \(\pi(Z, L)\) that minimizes the second moment of the efficient influence function. Minimizing this quantity yields the most statistically efficient estimator among all estimators based on different RWFs.
Intuitively, Proposition 2 suggests that \(\pi^o(Z,L) := \Pr(A = 1 \mid Z, L)\) is a natural candidate for the optimal weighting function. The following proposition characterizes the \(\pi(Z,L)\) that attains this minimum variance bound.
Proposition 6 (Optimal RWF). Under the conditions of Theorem 2, suppose that the solution \(f^o(A,L)\) to the nonparametric IV problem in Equation ?? satisfies \[\label{eq:32homoskedastic} \mathbb{E}\left[\{Y-f^o(A,L)\}^2\middle| Z,L\right]\perp\!\!\!\perp Z\mid L.\qquad{(5)}\] Then the quantity \(\mathbb{E}[\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o)^2]\) attains its lower bound when \(\pi(Z,L)=\Pr(A=1\mid Z,L)\).
The condition in Equation ?? implies that, after conditioning on \(L\), the instrument \(Z\) provides no additional information about the residual variation in \(Y\). [50] leverage a similar assumption to derive the “optimal instruments” in the special case where \(L=\emptyset\).
As established in the previous subsection, setting \(\pi^o(Z,L) := \Pr(A = 1 \mid Z, L)\) yields the optimal efficiency for estimating \(\psi_{\pi}^o\) in the nonparametric model. In practice, however, \(\pi^o(Z,L)\) is unknown. A natural approach is to treat \(\pi^o(Z,L)\) as a nuisance function and estimate it from the data adaptively. Another important motivation is that Proposition 2 implies that if \(\pi^o(Z,L)\) fails to be a valid RWF for \(A_t\), then no alternative RWF exists.
Therefore, we develop a strategy for adaptively estimating the optimal weighting function. First, define the following nuisance functions: \[\begin{align} &\delta^o(L):=\mathbb{E}[A\mid L], &&\eta^o(L):=\mathbb{E}[Y\mid L],\\ &\kappa^o(L):=\mathrm{Cov}\!\{A,\pi^o(Z,L)\mid L\}, &&\zeta^o(L):=\mathbb{E}[Y\pi^o(Z,L)\mid L],\\ &\xi^o(Z,L):=\mathbb{E}[Y\mid Z,L], &&\gamma^o(L):=\dfrac{\mathrm{Cov}\!\{Y, \pi^o(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi^o(Z,L) \mid L\}}. \end{align}\] We unified these nuisance functions into a nuisance vector: \[\label{eq:32nuisance32function} \beta^o(Z,L):=[\pi^o(Z,L),\delta^o(L),\kappa^o(L), \xi^o(Z,L),\eta^o(L),\gamma^o(L)].\tag{3}\] Note that \(\zeta^o(L)\) is only an auxiliary quantity and is not included in \(\beta^o(Z,L)\). According to Proposition 4, the ATE can then be identified as \[\label{eq:32identification32AIV32Y40141-Y4004132unknown32weight} \psi_{ada}^o:=\mathbb{E}\left[\gamma^o(L)\right] =\mathbb{E}\left[\dfrac{\pi^o(Z,L)-\delta^o(L)}{\kappa^o(L)}Y\right].\tag{4}\] We now proceed to derive the EIF for \(\psi_{ada}^o\).
Theorem 3. Under Assumptions 1–5, the EIF for \(\psi_{ada}^o\) in Equation 4 is given by \(\varphi(O;\psi_{ada}^o,\beta^o)\), where \[\begin{align} \varphi(O;\psi_{ada},\beta)=&\dfrac{\pi(Z,L)-\delta(L)}{\kappa(L)}Y-\psi_{ada} +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \kappa(L) + (A-\pi(Z,L))^2 - (A-\delta(L))^2 \right\}\\ &+\dfrac{1}{\kappa(L)} \left\{\xi(Z,L)(A-\pi(Z,L))-\eta(L)(A-\delta(L))\right\}. \end{align}\]
Remark 5. Similar to Theorem 2, this theorem still holds even if \(Z_t\) fails to be an AIV for \(A_t\), because the definition of \(\psi_{ada}^o\) does not rely on the AIV condition.
Next, the EIF characterization in Theorem 3 forms the foundation for analyzing the robustness of the proposed estimator, which is demonstrated in the next proposition.
Proposition 7 (Mixed bias property). Under Assumptions 1–5, for any fixed nuisance vector \(\beta\) in Equation 3 , \(\varphi(O;\psi_{ada}^o,\beta)\) satisfies \[\begin{align} \mathbb{E}[\varphi(O;\psi_{ada}^o,\beta)]= \mathbb{E}\left[\dfrac{1}{\kappa(L)}\left\{\begin{array}{l} \{\gamma(L)-\gamma^o(L)\}\{\kappa(L) - \kappa^o(L)\}\\ - \gamma(L)(\delta^o(L)-\delta(L))^2\\ +\gamma(L)(\pi^o(Z,L)-\pi(Z,L))^2\\ -(\xi(Z,L)-\xi^o(Z,L))(\pi(Z,L)-\pi^o(Z,L))\\ +(\eta(L)-\eta^o(L))(\delta(L)-\delta^o(L)) \end{array}\right\}\right]. \end{align}\]
In this subsection, we adopt the cross-fitting procedure proposed by [43], [44] to construct debiased estimators for \(\psi_{\pi}^o\) and \(\psi_{ada}^o\) in Equations ?? and 4 . Without loss of generality, assume that the sample size \(n\) is evenly divisible by the number of folds \(K\). We randomly partition the sample into \(K\) folds of equal size. Let \(I_k\) denote the set of indices belonging to the \(k\)-th fold, and let \(I_{-k}\) denote its complement. Denote by \(|I_k|\) the size of fold \(I_k\). For any random variable \(O\), define the empirical average over fold \(k\) as \(\mathbb{E}_{nk}[O] := \sum_{i \in I_k} O_i/|I_k|.\) We further define the \(L_2\) norm of the nuisance vector \(\alpha_{\pi}(L)\) from Equation 2 as \[\|\alpha_{\pi}(L)\|_2^2 := \|\delta(L)\|_2^2 + \|\eta(L)\|_2^2 + \|\kappa_{\pi}(L)\|_2^2 + \|\rho_{\pi}(L)\|_2^2 + \|\gamma_{\pi}(L)\|_2^2,\] and the \(L_2\) norm of the nuisance vector \(\beta(Z,L)\) from Equation 3 as \[\|\beta(Z,L)\|_2^2 := \|\pi(Z,L)\|_2^2 + \|\xi(Z,L)\|_2^2 + \|\delta(L)\|_2^2 + \|\eta(L)\|_2^2 + \|\kappa(L)\|_2^2 + \|\gamma(L)\|_2^2.\]
Next, for any fixed fold \(I_k\), the nuisance estimators \(\hat{\alpha}_{\pi}^{(n,k)}\) are trained using only the observations in \(I_{-k}\) with any suitable machine learning methods. By construction, \(\hat{\alpha}_{\pi}^{(n,k)}\) is independent of the samples in \(I_k\). We derive the estimator \(\hat{\psi}_{\pi}^{(n)}\) as the solution to the equation \[\begin{align} \label{eq:32AUG32estimator32prespecified32weight} \sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})\right]=0. \end{align}\tag{5}\] Next, we establish the consistency and asymptotic normality of \(\hat{\psi}_{\pi}^{(n)}\) defined in 5 .
Theorem 4 (Asymptotic normality of \(\hat{\psi}_{\pi}^{(n)}\)). Under Assumptions 1–4, suppose that \(\pi(Z,L)\) is an RWF for \(A\). Assume further that, for any \(k=1,\ldots,K\), \(\mathbb{E}[\|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^o\|_2^2]=o(1)\), and that \[\begin{align} \left\{\begin{array}{l} \|\hat{\kappa}_{\pi}^{(n,k)}-\kappa_{\pi}^o\|_2\times \|\hat{\gamma}_{\pi}^{(n,k)}-\gamma_{\pi}^o\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}-\rho_{\pi}^o\|_2\times \|\hat{\delta}^{(n,k)}-\delta^o\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}-\rho_{\pi}^o\|_2\times \|\hat{\eta}^{(n,k)}-\eta^o\|_2\\ \end{array}\right\} =o_p(n^{-1/2}). \end{align}\] Then \(\sqrt{n}\left(\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^o\right)/\sigma_{\pi}^o\) converges in distribution to \(\mathcal{N}(0,1)\), where the asymptotic variance is defined as \((\sigma_{\pi}^o)^2:=\mathbb{E}[\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o)^2]\). In addition, if we define the variance estimator for \((\sigma_{\pi}^o)^2\) as \((\hat{\sigma}_{\pi}^{(n)})^2:=\sum_{k=1}^K\mathbb{E}_{nk}[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2]/K,\) then \((\hat{\sigma}_{\pi}^{(n)})^2\) converges in probability to \((\sigma_{\pi}^o)^2\).
Analogously, for any fixed fold \(k = 1, \ldots, K\), the nuisance estimators \(\hat{\beta}^{(n,k)}\) are trained using only the observations in \(I_{-k}\) with any suitable machine learning method. The estimator \(\hat{\psi}_{ada}^{(n)}\) is then defined as the solution to \[\begin{align} \label{eq:32AUG32estimator32adaptive32weight} \sum_{k=1}^K \mathbb{E}_{nk}\!\left[\varphi\!\left(O;\hat{\psi}_{ada}^{(n)},\hat{\beta}^{(n,k)}\right)\right] = 0. \end{align}\tag{6}\] Next, we establish the consistency and asymptotic normality of \(\hat{\psi}_{ada}^{(n)}\) defined in 6 .
Theorem 5 (Asymptotic normality of \(\hat{\psi}_{ada}^{(n)}\)). Under Assumptions 1–5, suppose that either \(Z\) is an AIV for \(A\) or \(Y(1)-Y(0)\perp\!\!\!\perp U\mid L\). Assume that for any \(k=1,\ldots,K\), \(\mathbb{E}[\|\hat{\beta}^{(n,k)}-\beta^o\|_2^2]=o(1)\), and that \[\begin{align} \left\{ \begin{array}{l} \|\hat{\gamma}^{(n,k)}-\gamma^o\|_2 \times \|\hat{\kappa}^{(n,k)}-\kappa^o\|_2 +\|\hat{\delta}^{(n,k)}-\delta^o\|_2^2+\|\hat{\pi}^{(n,k)}-\pi^o\|_2^2\\ +\|\hat{\xi}^{(n,k)}-\xi^o\|_2\times \|\hat{\pi}^{(n,k)}-\pi^o\|_2 +\|\hat{\eta}^{(n,k)}-\eta^o\|_2\times \|\hat{\delta}^{(n,k)}-\delta^o\|_2 \end{array}\right\} =o_p(n^{-1/2}) \end{align}\] Then \(\sqrt{n}\left(\hat{\psi}_{ada}^{(n)}-\psi_{ada}^o\right)/\sigma_{ada}^o\) converges in distribution to \(\mathcal{N}(0,1)\), where the asymptotic variance is defined as \((\sigma_{ada}^o)^2:=\mathbb{E}[\varphi(O;\psi_{ada}^o,\beta^o)^2]\). In addition, if we define the variance estimator for \((\sigma_{ada}^o)^2\) as \((\hat{\sigma}_{ada}^{(n)})^2:=\sum_{k=1}^K\mathbb{E}_{nk}[\varphi(O;\hat{\psi}_{ada}^{(n)},\hat{\beta}^{(n,k)})^2]/K,\) then \((\hat{\sigma}_{ada}^{(n)})^2\) converges in probability to \((\sigma_{ada}^o)^2\).
In this section, we extend the identification strategy introduced in Section 2 to a longitudinal setting with a sequence of IVs observed at each time point. Specifically, consider a longitudinal study with measurements collected at \(T+1\) discrete time points, indexed by \(t = 0, 1, \ldots, T\), where \(T\) is a fixed nonnegative integer. The special case \(T = 0\) reduces to the panel data framework discussed in Section 2.
For notation, let \(\overline{a}_t := [a_0, a_1, \ldots, a_t]\), \(\underline{a}_t := [a_t, a_{t+1}, \ldots, a_T]\), \(a_t^s := [a_t, a_{t+1}, \ldots, a_s]\), and \(\overline{a} := [a_0, a_1, \ldots, a_T]\). By convention, we set \(\underline{a}_{T+1} = a_t^{t-1} := \emptyset\) for any \(t\), and note that \(\overline{a} = \underline{a}_0 = \overline{a}_T\) for brevity. At each time point \(t\), let \(L_t\in\mathcal{L}_t\) denote the vector of observed confounders, \(U_t\in\mathcal{U}_t\) the vector of unobserved confounders, \(Z_t\in\mathcal{Z}_t\) the IV (which may be multi-categorical or continuous), and \(A_t\in\mathcal{A}_t\) the multi-categorical treatment assignment. The observed data are given by \(O := [\overline{Z}_T, \overline{A}_T, \overline{L}_T, Y],\) where \(Y\) is the outcome of interest, observed only at time \(T+1\).
At each time point \(t\), define the history \(H_t := [\overline{Z}_{t-1}, \overline{A}_{t-1}, \overline{L}_t]\in\mathcal{H}_t\), and let \(H_{T+1} := O\) denote the full observed data. Notably, the histories satisfy the recursive relation \(H_t = [H_{t-1}, Z_{t-1}, A_{t-1}, L_t].\) Let \(Y(\overline{a})\) denote the potential outcome under treatment history \(\overline{A} = \overline{a}\). Next, we extend the preceding assumptions to a longitudinal setting with valid instrumental variables.
Assumption 6 (Consistency). \(Y=Y(\overline{A})\).
Assumption 7 (Latent ignorability). For any fixed \(t\), \(\{Z_t,A_t\}\perp\!\!\!\perp Y(\underline{a}_t) \mid H_t,\overline{U}_t\).
Assumption 8 (IV independence). For any fixed \(t\), \(Z_t\perp\!\!\!\perp\overline{U}_t\mid H_t\).
Assumption 9 (IV relevance). For any fixed \(t\), \(a_t\in\mathcal{A}_t\), \(h_t\in\mathcal{H}_t\), \(Z_t\not\perp\!\!\!\perp I\{A_t=a_t\}\mid H_t=h_t.\)
Assumption 10 (Positivity). For each time \(t=0,\ldots,T\), there exists a positive constant \(\epsilon_0\) such that for any \(l_t\in\mathcal{L}_t\) and \(a_t\in\mathcal{A}_t\), \(\mathrm{Var}\!\{\Pr(A_t=a_t\mid Z_t,L_t) \mid L_t=l_t\} \geq \epsilon_0.\)
Assumptions 6–10 can be regarded as a longitudinal extension of Assumptions 1–5. For illustration, Figure 1 displays a sequential directed acyclic graph (DAG) for the IV setting with \(T = 2\) under the one-step Markov property, where Assumptions 7 and 8 hold.
Next, we generalize the definitions of AIV and RWF from the point-exposure setting to accommodate longitudinal data.
Definition 3 (Longitudinal AIV). For any fixed \(t\), we say that \(Z_t\) is an AIV for \(A_t\) if there exist functions \(b_{t,a_t}(\overline{U}_t,H_t)\) and \(c_{t,a_t}(Z_t,H_t)\) such that, for all \(a_t \in \mathcal{A}_t\), \[\Pr(A_t = a_t \mid Z_t, \overline{U}_t, H_t) = b_{t,a_t}(\overline{U}_t, H_t) + c_{t,a_t}(Z_t, H_t).\]
Definition 4 (Longitudinal RWF). A function \(\pi_t(Z_t, H_t)\) is said to be an RWF for \(A_t\) if it is uniformly bounded and, for each \(a_t \in \mathcal{A}_t\), there exists a constant \(\epsilon_0 > 0\) such that \[\left| \mathrm{Cov}\!\left\{ I\{A_t = a_t\}, \pi_t(Z_t, H_t) \mid H_t \right\} \right| \geq \epsilon_0, \quad \text{uniformly over } H_t.\]
These definitions are required for identifying the longitudinal mean potential outcomes. For a fixed sequence of RWFs \(\pi_t(Z_t,H_t)\), define \(\gamma_{T+1,\underline{a}_{T+1}}^o(H_{T+1}) := Y.\) Then, for \(t = T, \ldots, 0\), recursively define the nuisance function \[\gamma_{t,\underline{a}_t}^o(H_t) := \frac{\mathrm{Cov}\!\left\{ I\{A_t = a_t\} \, \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}), \pi_t(Z_t, H_t) \mid H_t \right\}}{\mathrm{Cov}\!\left\{ I\{A_t = a_t\}, \pi_t(Z_t, H_t) \mid H_t \right\}}.\] Intuitively, \(\gamma_{t,\underline{a}_t}^o(H_t)\) can be interpreted as an estimator of the conditional mean potential outcomes \(\mathbb{E}[Y(\underline{a}_t) \mid H_t]\). This relationship will be formally established in the next proposition.
Proposition 8 (Longitudinal AIV identification). Under Assumptions 6–9, let \(0 \le s \le T+1\), \(r \ge 0\), and \(s + r \le T+1\). Suppose that, for each \(t=0,\ldots,T\), \(Z_t\) serves as an AIV for \(A_t\), and that \(\pi_t(Z_t, H_t)\) is an RWF for \(A_t\). Then, the mean potential outcomes \(\mathbb{E}\bigl[Y(\underline{a}_{s})\bigr]\) can be expressed as \[\label{eq:32identification32AIV32longitudinal} \begin{align} \mathbb{E}\left[ \prod_{t=s}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r}) \right]. \end{align}\qquad{(6)}\]
Notably, consider the special case of Proposition 8 with \(s = 0\) and \(r = 0\), which corresponds to identifying the mean of potential outcomes from the initial time point through the final time \(T\) without truncation. Without loss of generality, if we set \(\pi(Z_t,H_t) = Z_t\), the identification formula simplifies to \[\begin{align} \mathbb{E}[Y(\overline{a})] = \psi_{\overline{a}}^o := \mathbb{E}\Biggl[ \prod_{t=0}^{T} \frac{(Z_t - \mathbb{E}[Z_t \mid H_t]) \, I\{A_t = a_t\}}{\mathrm{Cov}\!\{ I\{A_t = a_t\}, Z_t \mid H_t \}} \times Y \Biggr].\label{eq:32identification32AIV32longitudinal322} \end{align}\tag{7}\] Alternatively, by setting \(s = 0\) and \(r = T+1\), the identification formula reduces to \(\psi_{\overline{a}}^o = \mathbb{E}\bigl[\gamma_{0,\overline{a}}^o(H_0)\bigr].\)
In this subsection, without loss of generality, we focus on the case where \(Z_t\) is univariate and that \(\pi_t(Z_t, H_t)=Z_t\) is an RWF for \(A_t\). We derive the EIFs for \(\psi_{\overline{a}}^o\) in Equation 7 when \(\pi_t(Z_t, H_t) = Z_t\); for a general RWF \(\pi_t(Z_t, H_t)\), we can define \(Z_t^{\pi} := \pi_t(Z_t, H_t)\) and replace \(Z_t\) with \(Z_t^{\pi}\) in the subsequent analysis.
For notational convenience, define \(\gamma_{T+1,\underline{a}_{T+1}}^o(H_{T+1}) := Y\) and \(A_t^{(a_t)} := I\{A_t = a_t\}\). For \(t = T, \ldots, 0\), recursively define the nuisance functions: \[\begin{align} &\kappa_{t,a_t}^o(H_t) := \mathrm{Cov}\!\{A_t^{(a_t)}, Z_t \mid H_t\},\\ &\delta_{t,a_t}^o(H_t) := \mathbb{E}[A_t^{(a_t)} \mid H_t], && \eta_{t,\underline{a}_t}^o(H_t) := \mathbb{E}\bigl[A_t^{(a_t)} \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}) \mid H_t\bigr],\\ &\rho_t^o(H_t) := \mathbb{E}[Z_t \mid H_t], && \gamma_{t,\underline{a}_t}^o(H_t) := \frac{\mathrm{Cov}\!\{ A_t^{(a_t)} \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}), Z_t \mid H_t \}}{\mathrm{Cov}\!\{ A_t^{(a_t)}, Z_t \mid H_t \}}. \end{align}\] We summarize these nuisance functions into a unified nuisance vector: \[\begin{align} \label{eq:32nuisance32function32longitudinal} \alpha_{\overline{a}}^o := \{\alpha_{t,\underline{a}_t}^o\}_{t=0}^T, \qquad \alpha_{t,\underline{a}_t}^o := [ \delta_{t,a_t}^o, \kappa_{t,a_t}^o, \rho_t^o, \eta_{t,\underline{a}_t}^o, \gamma_{t,\underline{a}_t}^o]. \end{align}\tag{8}\] We now proceed to derive the EIF for \(\psi_{\overline{a}}^o\) in Equation 7 in the following theorem.
Theorem 6. Under Assumptions 6–9, suppose that for each \(t=0,\ldots,T\), \(\pi_t(Z_t, H_t)=Z_t\) is an RWF for \(A_t\). Then, the EIF for \(\psi_{\overline{a}}^o\) consists of \(\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o)\), where \[\begin{align} &\varphi_{\overline{a}}(O;\psi_{\overline{a}},\alpha_{\overline{a}}):= \prod_{t=0}^{T} \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} Y-\psi_{\overline{a}} +\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\right) \\&\times \left\{ \left(1-\dfrac{\{Z_t-\rho_t(H_t)\}\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}}{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\times\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\right\}. \end{align}\]
The next proposition derives the mixed bias property for the EIF in Theorem 6.
Proposition 9 (Mixed bias property). Under the conditions of Theorem 6, for any fixed nuisance vector \(\alpha_{\overline{a}}\) in Equation 8 , \(\mathbb{E}[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}})]\) equals to \[\begin{align} &\mathbb{E}\left[\begin{array}{l} \displaystyle\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\right) \times \dfrac{1}{\kappa_{t,a_t}(H_t)}\\ \times\left(\begin{array}{l} \left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ +\left\{\rho_t(H_t) - \rho_t^o(H_t)\right\}\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ +\left\{\rho_t(H_t) - \rho_t^o(H_t)\right\}\left\{\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t)\right\} \gamma_{t,\underline{a}_t}(H_t) \end{array}\right) \end{array}\right]. \end{align}\]
We develop a cross-fitting procedure for estimating \(\psi_{\overline{a}}^o\). This estimator can be intuitively understood as resulting from a backward fitting strategy. Let \(\hat{\mathbb{E}}^{(n,k)}[\phi(O) \mid H_t]\) denote an estimate of the conditional expectation \(\mathbb{E}[\phi(O)\mid H_t]\), and \(\widehat{\text{Cov}}^{(n,k)}\{\phi_1(O), \phi_2(O)\mid H_t\}\) denote an estimate of the conditional covariance \(\text{Cov}\{\phi_1(O),\phi_2(O)\mid H_t\}\). Both estimates are obtained using only the observations in \(I_{-k}\) and fitted using an appropriate machine learning method. Algorithm 2 summarizes the backward cross-fitting procedure.
Figure 3 provides a graphical illustration of Algorithm 2 for \(T=2\). The top row represents the evaluation set (\(I_k\)), while the bottom row corresponds to the training set (\(I_{-k}\)). The middle row depicts the nuisance estimators \(\hat{\alpha}_{t,\overline{a}}^{(n,k)}\), which are fitted using the training data and contribute to the final predictions. Importantly, \(\hat{\alpha}_{t,\overline{a}}^{(n,k)}\) is constructed without using any samples from the evaluation set \(I_k\), ensuring their independence from the evaluation data.
In particular, in Algorithm 2, we introduce the random variable \(\hat{\Psi}_{t,\underline{a}_{t}}^{(n,k)}\). By induction, one can verify that for any \(t=0,\ldots,T\), \[\begin{align} &\hat{\Psi}_{t,\underline{a}_{t}}^{(n,k)}=\prod_{s=t}^{T} \dfrac{\left\{Z_s-\hat{\rho}_s^{(n,k)}(H_s)\right\}A_s^{(a_s)}} {\hat{\kappa}_{s,a_s}^{(n,k)}(H_s)} Y +\sum_{s=t}^T\left(\displaystyle\prod_{r=t}^{s-1}\dfrac{\left\{Z_r-\hat{\rho}_r^{(n,k)}(H_r)\right\}A_r^{(a_r)}} {\hat{\kappa}_{r,a_r}^{(n,k)}(H_r)}\right) \\&\times \left\{\begin{array}{c} \left(1-\dfrac{\{Z_s-\hat{\rho}_s^{(n,k)}(H_s)\}\{A_s^{(a_s)}-\hat{\delta}_{s,a_s}^{(n,k)}(H_s)\}}{\hat{\kappa}_{s,a_s}^{(n,k)}(H_s)} \right)\hat{\gamma}_{s,\underline{a}_s}^{(n,k)}(H_s) \\- \dfrac{(Z_s-\hat{\rho}_s^{(n,k)}(H_s))\times\hat{\eta}_{s,\underline{a}_s}^{(n,k)}(H_s)}{\hat{\kappa}_{s,a_s}^{(n,k)}(H_s)} \end{array}\right\}. \end{align}\] Intuitively, \(\hat{\Psi}_{t,\underline{a}_t}^{(n,k)}\) provides an estimate of the true conditional mean potential outcomes \(\gamma_{t,\underline{a}_t}^o(H_t)\). That is, if the nuisance functions all equal to the truth, then \(\mathbb{E}[\hat{\Psi}_{t,\underline{a}_t}^{(n,k)}\mid H_t]=\gamma_{t,\underline{a}_t}^o(H_t)\). This type of estimator is referred to as a DR-Learner (or IF-Learner) in [51]–[53], where its theoretical properties are also established.
In addition, one can verify that \(\varphi_{\overline{a}}(O;\psi_{\overline{a}},\hat{\alpha}_{\overline{a}}^{(n,k)}) = \hat{\Psi}_{0,\overline{a}}^{(n,k)} - \psi_{\overline{a}}.\) This representation naturally motivates the construction of the estimator \(\hat{\psi}_{\overline{a}}^{(n)}\) as defined in Equation [eq:32AUG32estimator32longitudinal].
Next, we establish that, under the IV assumptions and the required convergence rates for the nuisance estimators, \(\hat{\psi}_{\overline{a}}^{(n)}\) is asymptotically normal, and its variance estimator is consistent.
Theorem 7 (Asymptotic normality of \(\hat{\psi}_{\overline{a}}^{(n)}\)). Under Assumptions 6–9, suppose that for each \(t = 0, \ldots, T\), \(Z_t\) serves as an AIV for \(A_t\), and that \(\pi_t(Z_t, H_t) = Z_t\) is an RWF for \(A_t\). Further, for each \(t = 0, \ldots, T\) and \(k = 1, \ldots, K\), suppose that the following rate condition holds for the nuisance functions \(\hat{\alpha}_{t,\overline{a}}^{(n,k)}\) defined in Algorithm 2: \[\begin{align} \left\{\begin{array}{l} \|\hat{\kappa}_{t,a_t}^{(n,k)}- \kappa_{t,a_t}^{o}\|_2 \times\|\hat{\gamma}_{t,\underline{a}_t}^{(n,k)}-\gamma_{t,\underline{a}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\eta}_{t,\underline{a}_t}^{(n,k)}-\eta_{t,\underline{a}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\delta}_{t,a_t}^{(n,k)}- \delta_{t,a_t}^{o}\|_2 \end{array}\right\}=o_p(n^{-1/2}). \end{align}\] Furthermore, assume that for any fixed \(k\) and time \(t\), \(\mathbb{E}[\|\hat{\alpha}_{t,\overline{a}}^{(n,k)}- \alpha_{t,\overline{a}}^o\|_2^2]=o(1)\). Then \(\sqrt{n}\{\hat{\psi}_{\overline{a}}^{(n)}-\psi_{\overline{a}}^o\}/\sigma_{\overline{a}}^o\) converges in distribution to \(\mathcal{N}(0,1)\), where \(\hat{\psi}_{\overline{a}}^{(n)}\) is defined in Equation [eq:32AUG32estimator32longitudinal], and \((\sigma_{\overline{a}}^o)^2:=\mathbb{E}[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o)^2]\). In addition, \(\hat{\sigma}_{\overline{a}}^{(n)}\) converges in probability to \(\sigma_{\overline{a}}^o\).
We note that one can adaptively select the RWF \(\pi_t(Z_t,H_t)\), as discussed in Section 3, to obtain an efficient estimator of \(\psi_{\overline{a}}^o\). A detailed discussion of this adaptive selection is provided in the appendix. Intuitively, a natural candidate for the adaptive RWF is the conditional probability \(\pi_t^o(Z_t,H_t) = \Pr(A_t = a_t \mid Z_t,H_t)\), which directly characterizes the treatment assignment mechanism given the instrument and history.
Finally, although we focus on the evaluation of static treatment rules, the proposed methods can be extended to facilitate the evaluation of dynamic treatment rules, which is aslo demonstrated in the appendix. This extension would allow for the incorporation of time-varying treatment strategies and enable more comprehensive assessments of treatment effects over time, enhancing the applicability of our approach to dynamic decision-making processes.
In this section, we conduct simulation studies with a binary treatment and a continuous IV to illustrate the asymptotic results established in Section 3.
We generate \(U\) and \(L\) independently from the uniform distribution \(U(-1,1)\). The continuous IV is constructed as \(Z = L + \sin(3L) + 2\epsilon_Z\), where \(\epsilon_Z\) is an exogenous error term drawn from the standard normal distribution. The binary treatment \(A\) is generated under the following two designs:
\(A \sim \mathrm{Bernoulli}\big(0.7\Phi(-2Z + 2L) + 0.3\Phi(3U - L)\big)\), where \(\Phi\) denotes the cumulative distribution function of the standard normal distribution.
\(A \sim \mathrm{Bernoulli}\big(\{1 + \exp(-(Z - L + U))\}^{-1}\big)\).
It is straightforward to verify that \(Z\) is an AIV for \(A\) in [A1], whereas in [A2] the AIV conditions are violated. Next, we independently generate \(\epsilon_Y \sim N(0,1)\). The outcome \(Y\) is then generated according to the following two models:
\(Y = Y(A) = 2U - 2L + 4AL + \epsilon_Y\).
\(Y = Y(A) = (1 - A)\{3\cos(2U) - 3\cos(2L)\} + A\{3\sin(2U) + 2L\} + \epsilon_Y\).
In [Y1], the condition \(Y(1) - Y(0)\not\perp\!\!\!\perp U \mid L\) holds, while in [Y2]
this condition is violated. We set the sample size to \(n=5000\) and use \(K=2\) folds for cross-fitting. Nuisance functions are estimated via spline methods implemented in the
mgcv package in R [54].
The results are summarized in Table 1. The simulations suggest that both estimators are nearly unbiased under correct model specification, with estimated standard errors closely tracking empirical variability and coverage rates remaining near the nominal 95% level. Under misspecification ([Y2], [A2]), we observe somewhat larger bias and a modest decline in coverage. We also find that the adaptive weighting method generally yields smaller variance, indicating that it tends to be more efficient.
| DGP | Metric | Adaptive weight | Prespecified (\(\pi(Z,L)=Z\)) | ||||
| Treated | Control | ATE | Treated | Control | ATE | ||
| [Y1],[A1] | Bias | .0011 | .0011 | .0019 | .0018 | .0032 | .0028 |
| SE | .0543 | .0545 | .0807 | .0626 | .0626 | .0920 | |
| SD | .0577 | .0581 | .0844 | .0612 | .0648 | .0922 | |
| CR | .927 | .938 | .939 | .947 | .935 | .950 | |
| [Y2],[A1] | Bias | .0023 | .0078 | .0064 | .0043 | .0044 | .0012 |
| SE | .0864 | .0611 | .1060 | .1006 | .0691 | .1220 | |
| SD | .0906 | .0637 | .1092 | .0987 | .0703 | .1200 | |
| CR | .926 | .946 | .941 | .951 | .951 | .952 | |
| [Y1],[A2] | Bias | .0006 | .0020 | .0013 | .0006 | .0013 | .0003 |
| SE | .0554 | .0561 | .0820 | .0565 | .0569 | .0833 | |
| SD | .0547 | .0575 | .0830 | .0563 | .0557 | .0828 | |
| CR | .954 | .934 | .938 | .956 | .956 | .944 | |
| [Y2],[A2] | Bias | .0000 | .0318 | .0315 | .0027 | .0269 | .0301 |
| SE | .0886 | .0615 | .1079 | .0908 | .0621 | .1100 | |
| SD | .0891 | .0619 | .1089 | .0915 | .0618 | .1110 | |
| CR | .948 | .912 | .936 | .955 | .924 | .941 | |
To simplify the setting, we fix the time horizon at \(T = 1\). For each time point \(t = 0,\,1\), we independently generate noise terms \(\epsilon_{U_t}\), \(\epsilon_{L_t}\), \(\epsilon_{Z_t}\), and \(\epsilon_Y\) from standard normal distributions. Based on these, the variables are simulated according to the following DGP: \[\begin{align} &L_0 \sim 1.5\epsilon_{L_0},\quad U_0 \mid H_0 \sim 1.5\epsilon_{U_0};\\ &Z_0 \mid H_0,U_0 \sim 0.3 L_0 + \sin(1.5 L_0)+ 2\epsilon_{Z_0};\\ &A_0 \mid H_0,U_0,Z_0 \sim \text{Ber}(1,\,0.7\Phi(-2Z_0+0.6L_0) + 0.3\Phi(3U_0-L_0));\\ &L_1 \mid H_0,U_0,Z_0,A_0 \sim (A_0-0.5)+0.5L_0+0.3U_0+0.5 \epsilon_{L_1};\\ &U_1 \mid H_1,U_0 \sim (A_0-0.5)+0.5U_0+0.3L_1+0.5 \epsilon_{U_1};\\ &Z_1 \mid H_1,\overline{U}_1 \sim 0.5L_1-0.5(A_0-0.5)-0.3Z_0+2\epsilon_{Z_1};\\ &A_1 \mid H_1,\overline{U}_1,Z_1 \sim \text{Ber}(1,\,0.7\Phi(-2Z_1+L_1) + 0.3\Phi(3U_1-L_1));\\ &Y \mid H_1,\overline{U}_1,Z_1,A_1 \sim (A_1-0.5)+2L_1+U_1+0.5\epsilon_{Y}. \end{align}\] Notably, “Ber” represents the binomial distribution. \(Z_0\) and \(Z_1\) serve as AIVs for \(A_0\) and \(A_1\), respectively.
Data are generated using the R package simcausal [55]. The sample sizes are set to \(2000\) and \(5000\), with cross-fitting performed using \(K=2\) folds. Nuisance functions are estimated via spline methods implemented in the R package
mgcv. Because the true treatment effects are not analytically available under the constructed DGPs, we estimate them using 100,000 samples by simulating potential outcomes under modified data-generating processes, where pairs of treatment
assignments, \((A_0,A_1)\), are set deterministically to \((0,0)\), \((1,1)\), \((0,1)\), \((1,0)\), \((A_0,1),\) and \((A_0,0)\) (We use the notation \((A_0,0)\) to denote a dynamic treatment regime that follows the
natural treatment rule for \(A_0\) while fixing \(A_1 = 0\)).
Table 2 summarizes the simulation results adopting Algorithm 2 under different intervention strategies and sample sizes. Across all scenarios, the estimators display small bias, and the estimated standard errors closely match the empirical standard deviations, indicating accurate variance estimation. As expected, increasing the sample size from 2000 to 5000 reduces variability and tightens confidence intervals. The empirical coverage rates are generally close to the nominal 95% level, though a modest decline is observed in certain intervention settings. Overall, the results demonstrate that the proposed method performs well and provides reliable inference in most cases.
| Size | Metric | Intervention | ||||||
| \((0,0)\) | \((0,1)\) | \((1,0)\) | \((1,1)\) | \((A_0,0)\) | \((A_0,1)\) | |||
| 2000 | Bias | .0234 | .0071 | .0147 | .0189 | .0016 | .0033 | |
| SE | .6090 | .5267 | .5266 | .5960 | .1401 | .1403 | ||
| SD | .6102 | .5382 | .5203 | .6343 | .1395 | .1460 | ||
| CR | .955 | .959 | .960 | .938 | .938 | .929 | ||
| 5000 | Bias | .0134 | .0078 | .0057 | .0155 | .0057 | .0037 | |
| SE | .2143 | .1882 | .1962 | .2115 | .0658 | .0651 | ||
| SD | .2211 | .1949 | .1870 | .2004 | .0607 | .0612 | ||
| CR | .940 | .930 | .942 | .956 | .956 | .950 | ||
We apply our framework to study returns to schooling and post-school training as sequential treatments and to conduct a policy analysis, using the dataset provided in the supplementary materials of [56]. Schooling and post-school training are two central interventions influencing labor market outcomes such as earnings and employment [57]. To enable such analysis, [56] merged data from the Job Training Partnership Act (JTPA) Title II with additional sources on high school (HS) education, thereby constructing a dataset suitable for evaluating the effects of HS diplomas and subsidized job training as sequential treatments. The final sample comprises 9,223 individuals.
We now describe the key features of this dataset. Let \(A_0\) denote whether an individual obtains a high school (HS) diploma, and let \(A_1\) indicate participation in the job training program. Define \(L_0\) as sex (a baseline covariate) and \(L_1\) as pre-program earnings, which serve as time-varying confounders. The initial treatment \(A_0\) influences subsequent pre-program earnings \(L_1\), and the allocation of \(A_1\) may adapt to \(L_1\). The instruments are given by \(Z_0\), the number of high schools per square mile, and \(Z_1\), a random assignment to job training. Our target outcome is \(Y\), the indicator that the potential terminal earnings exceed their empirical median.
We consider the dynamic treatment regime (DTR) \((g_0, g_1) \in \{0,1,\text{x}\} \times \{0,1,d^+,d^-\}\). For the DTR \(g_0\), the value ‘0’ assigns \(A_0=0\), ‘1’ assigns \(A_0=1\), and ‘x’ assigns the natural selection rule (i.e., the observed assignment). For the treatment rule \(g_1\), the value ‘0’ assigns
\(A_1=0\), ‘1’ assigns \(A_1=1\), ‘\(d^+\)’ assigns \(A_1=1\) only when \(L_1\) is below
the 80% quantile, and ‘\(d^-\)’ assigns \(A_1=1\) only when \(L_1\) is above the 80% quantile. The target is to estimate \(\mathbb{E}[Y(g_0(H_0),g_1(H_1))]\). We use the spline methods in the R package mgcv to estimate the nuisance parameters and calculate the bootstrapped estimated mean and standard deviation based on 1,000
replications.
| DTRs | 00 | 01 | 0\(d^+\) | 0\(d^-\) | 10 | 11 | 1\(d^+\) | 1\(d^-\) | x0 | x1 | x\(d^+\) | x\(d^-\) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EST | .292 | .321 | .343 | .264 | .653 | .668 | .732 | .589 | .485 | .550 | .548 | .487 |
| SE | .124 | .089 | .094 | .120 | .155 | .122 | .132 | .147 | .017 | .012 | .013 | .016 |
| SD | .108 | .074 | .081 | .103 | .129 | .089 | .102 | .126 | .017 | .011 | .012 | .015 |
The results are reported in Table 3. The table presents estimated mean of mean potential terminal income (EST), mean estimated standard errors (SE), and empirical standard deviations (SD) for various DTRs. Specifically, the DTRs are denoted by combinations of values in the set \(\{0,1,\text{x}\}\) for the first treatment \(g_0\) and \(\{0,1,d^+,d^-\}\) for the second treatment \(g_1\). The treatment values and their respective effects are summarized for 12 distinct combinations.
We observe that the SE and SD for (x0, x1, x\(d^+\), x\(d^-\)) are smaller compared to the other DTRs. This is likely due to the
relatively weak correlation between \(Z_0\) and \(A_0\). Furthermore, the SE and SD values for the same DTR are fairly consistent with each other, demonstrating the validity of the variance
estimator in our algorithm.
Looking at the EST, the DTRs involving treatment rules (10, 11, 1\(d^+\), 1\(d^-\)) generally show higher estimates compared to those
involving (00, 01, 0\(d^+\), 0\(d^-\)). The natural treatment rules (x0, x1, x\(d^+\), x\(d^-\)) fall in between these two groups, suggesting that education level at the first time stage has a positive influence on terminal income.
On the other hand, the estimated terminal income for DTRs (0\(d^+\), 1\(d^+\), x\(d^+\)), which assign the program
only to low-earning individuals, are higher than those for DTRs (00, 10, x0) and (01, 11, x1). In contrast, the estimated terminal income for DTRs (0\(d^-\), 1\(d^-\), x\(d^-\)), which assign the program only to high-earning individuals, are lower. This pattern is consistent with the
findings of [56], indicating that assigning the job-training program only to low-earning individuals has a positive influence on terminal
income.
In this article, we develop an AIV framework for identifying causal effects for multi-categorical or continuous IVs and treatment variables. We elucidate the connection between classical TSLS estimators and the identification of causal estimands under the IV framework. Our methods are illustrated in several classical causal inference settings, including marginal structural models (MSMs) and longitudinal data. Furthermore, we analyze the efficiency, asymptotic normality, and asymptotic unbiasedness of the proposed estimator when different RWFs are employed to estimate the ATE.
Looking ahead, a promising direction is to identify other conditions analogous to AIV that guarantee the existence of a solution to the nonparametric IV problem in formula 1 . Additionally, accommodating right censoring and estimating counterfactual survival curves under a general IV framework constitute important directions for future research [58]. Furthermore, given the proposed identification result for continuous treatments, it would be of interest to propose a debiased learning approach by leveraging [59]. Finally, identifying the optimal weighting function under more general settings within the proposed framework represents another avenue for further investigation.
The codes of the simulation results and real data analysis are publicly available at https://github.com/chensy123-sys/Additive-IV.
In this subsection, we illustrate the identification strategy under a dynamic treatment regime. Let \(\overline{g} := [g_0, \ldots, g_T]\) denote a sequence of deterministic DTRs, where each mapping satisfies \(g_t: \mathcal{H}_t \rightarrow \mathcal{A}_t\). For \(s = 0, \ldots, T\), define the potential outcome under regime \(\overline{g}\) as \[Y(\overline{g}) := Y\{g_0(H_0), \ldots, g_T(H_T)\},\] that is, the outcome observed when the subject follows the treatment rule \(A_t = g_t(H_t)\) at each time \(t\). Similarly, define \[Y(\overline{g}_s) := Y(A_0, \ldots, A_{s-1}, g_s(H_s), \ldots, g_T(H_T))\] as the potential outcome when the subject follows the rule \(A_t = g_t(H_t)\) for all \(t \ge s\).
Without loss of generality, we assume that \(\pi_t(Z_t, H_t) = Z_t\), which is taken to be an RWF. Now we illustrate the nuisance functions used in the dynamic treatment setting. Define \(\gamma_{T+1,\underline{g}_{T+1}}^o(H_{T+1}) := Y\). For \(t = T, \ldots, 0\), recursively define the nuisance functions: \[\begin{align} &\delta_{t,g_t}^o(H_t) := \mathbb{E}[I\{A_t=g_t(H_t)\} \mid H_t], \\ & \eta_{t,\underline{g}_t}^o(H_t) := \mathbb{E}\bigl[I\{A_t=g_t(H_t)\} \gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}) \mid H_t\bigr],\\ &\rho_t^o(H_t) := \mathbb{E}[Z_t \mid H_t], \\ &\kappa_{t,g_t}^o(H_t) := \mathrm{Cov}\!\{I\{A_t=g_t(H_t)\}, Z_t \mid H_t\}, \\ & \gamma_{t,\underline{g}_t}^o(H_t) := \frac{\mathrm{Cov}\!\{ I\{A_t=g_t(H_t)\} \gamma_{t+1,\underline{g}_{t+1}}^o(H_{t+1}), Z_t \mid H_t \}}{\mathrm{Cov}\!\{ I\{A_t=g_t(H_t)\}, Z_t \mid H_t \}}. \end{align}\] Denote the nuisance vector as \[\begin{align} \alpha_{\overline{g}}^o := \{\alpha_{t,\underline{g}_t}^o\}_{t=0}^T, \qquad \alpha_{t,\underline{g}_t}^o := [ \delta_{t,g_t}^o, \kappa_{t,g_t}^o, \rho_t^o, \eta_{t,\underline{g}_t}^o, \gamma_{t,\underline{g}_t}^o]. \end{align}\] Then the identification strategy is derived in the next proposition.
Proposition 10. Under Assumptions 6–9, let \(0 \le s \le T+1\), \(r \ge 0\), and \(s + r \le T+1\). Suppose that, for each \(t=0,\ldots,T\), \(Z_t\) serves as an AIV for \(A_t\), and that \(\pi_t(Z_t, H_t)\) is an RWF for \(A_t\). Then, \[\begin{align} \mathbb{E}\bigl[Y(\underline{g}_{s})\bigr]= \mathbb{E}\left[ \prod_{t=s}^{T-r} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)I\{A_t=g_t(H_t)\}}{ \kappa_{t,g_t}^o(H_t) } \gamma_{T+1-r,\underline{g}_{T+1-r}}^o(H_{T+1-r}) \right]. \end{align}\]
The proof for Proposition 10 is similar to that of Proposition 8, thus omitted. Set \(s=0\) and \(r=0\), we know that the mean potential outcomes can be identified as \[\begin{align} \mathbb{E}\bigl[Y(\overline{g})\bigr]=\psi_{\overline{g}}^o:= \mathbb{E}\left[ \prod_{t=0}^{T} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)I\{A_t=g_t(H_t)\}} { \kappa_{t,g_t}^o(H_t) }Y \right]. \end{align}\] Next, we derive the EIF for \(\psi_{\overline{g}}\).
Theorem 8. Under Assumptions 6–9, suppose that for each \(t=0,\ldots,T\), \(\pi_t(Z_t, H_t)=Z_t\) is an RWF for \(A_t\). Then, if we define \(A_t^{(g_t)}:=I\{A_t = g_t(H_t)\}\), the EIF for \(\psi_{\overline{g}}^o\) consists of \(\varphi_{\overline{g}}(O;\psi_{\overline{g}}^o,\alpha_{\overline{g}}^o)\), where \[\begin{align} &\varphi_{\overline{g}}(O;\psi_{\overline{g}},\alpha_{\overline{g}}):= \prod_{t=0}^{T} \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(g_t)}} {\kappa_{t,g_t}(H_t)} Y-\psi_{\overline{g}} +\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(g_s)}} {\kappa_{s,g_s}(H_s)}\right) \\&\times \left\{ \left(1-\dfrac{\{Z_t-\rho_t(H_t)\}\{A_t^{(g_t)}-\delta_{t,g_t}(H_t)\}}{\kappa_{t,g_t}(H_t)} \right)\gamma_{t,\underline{g}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\times\eta_{t,\underline{g}_t}(H_t)}{\kappa_{t,g_t}(H_t)}\right\}. \end{align}\]
For estimation, we can replace \(A_t^{(a_t)}\) with \(A_{t}^{(g_t)}=I(A_t=g_t(H_t))\), and \(\underline{a}_t\) with \(\underline{g}_t\) in Algorithm 2, and construct the corresponding backward cross-fitting algorithm for dynamic treatment regimes. Concretely, Algorithm 4 can be viewed as a generalization of Algorithm 2 to estimating the mean potential outcomes under the DTRs \(\overline{g}\). An analogous version of Theorem 7 can be derived for potential outcome means \(\hat{\psi}_{\overline{g}}^{(n)}\) in Algorithm 4, deriving the asymptotic consistency and normality of our proposed estimator.
Theorem 9 (Asymptotic normality of \(\hat{\psi}_{\overline{g}}^{(n)}\) in Algorithm 4). Under Assumptions 6–10, suppose that for each \(t = 0, \ldots, T\), \(Z_t\) serves as an AIV for \(A_t\). Further, for each \(t = 0, \ldots, T\) and \(k = 1, \ldots, K\), suppose that the following rate condition holds for the nuisance functions \(\hat{\alpha}_{t,\overline{g}}^{(n,k)}\) defined in Algorithm 4: \[\begin{align} \left\{\begin{array}{l} \|\hat{\kappa}_{t,g_t}^{(n,k)}- \kappa_{t,g_t}^{o}\|_2 \times\|\hat{\gamma}_{t,\underline{g}_t}^{(n,k)}-\gamma_{t,\underline{g}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\eta}_{t,\underline{g}_t}^{(n,k)}-\eta_{t,\underline{g}_t}^o\|_2\\ +\|\hat{\rho}_t^{(n,k)} - \rho_t^o\|_2\times \|\hat{\delta}_{t,g_t}^{(n,k)}- \delta_{t,g_t}^{o}\|_2 \end{array}\right\}=o_p(n^{-1/2}). \end{align}\] Furthermore, assume that for any fixed \(k\) and time \(t\), \(\mathbb{E}[\|\hat{\alpha}_{t,\overline{g}}^{(n,k)}- \alpha_{t,\overline{g}}^o\|_2^2]=o(1)\). Then \(\sqrt{n}\{\hat{\psi}_{\overline{g}}^{(n)}-\psi_{\overline{g}}^o\}/\sigma_{\overline{g}}^o\) converges in distribution to \(\mathcal{N}(0,1)\), where \(\hat{\psi}_{\overline{g}}^{(n)}\) is defined in Equation [eq:32AUG32estimator32longitudinal32DTR], and \((\sigma_{\overline{g}}^o)^2:=\mathbb{E}[\varphi_{\overline{g}}(O;\psi_{\overline{g}}^o,\alpha_{\overline{g}}^o)^2]\). In addition, \(\hat{\sigma}_{\overline{g}}^{(n)}\) converges in probability to \(\sigma_{\overline{g}}^o\).
The proofs for the two theorems above are omitted, since their proofs are similar to those of Theorems 4 and 6.
In Section 3, we discuss that one can adaptively choose the weight by setting \(\pi^o(Z,L)=\mathbb{E}[A\mid Z,L]\) when estimating the ATE of interest. In this subsection, we generalize this strategy to the longitudinal data. First, we define the nuisance functions as \[\begin{align} &\pi_{t,a_t}^o(Z_t, H_t):=\mathbb{E}[A_t^{(a_t)}\mid Z_t, H_t],\\ &\delta_{t,a_t}^o(H_t):=\mathbb{E}[A_t^{(a_t)}\mid H_t],\\ &\kappa_{t,a_t}^o(H_t):=\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t)\mid H_t\}, \\ &\xi_{t,\underline{a}_t}^o(Z_t,H_t):=\mathbb{E}[A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\mid Z_t,H_t], \\ &\eta_{t,\underline{a}_t}^o(H_t):=\mathbb{E}[A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\mid H_t],\\ &\gamma_{t,\underline{a}_t}^o(H_t):=\dfrac{\mathrm{Cov}\!\{A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1}), \pi_{t,a_t}^o(Z_t,H_t) \mid H_t\}}{\mathrm{Var}\!\{ \pi_{t,a_t}^o(Z_t,H_t) \mid H_t\}}. \end{align}\] Define the nuisance vector as \[\begin{align} \beta_{\overline{a}} := \{\beta_{t,\underline{a}_t}\}_{t=0}^T,\quad\beta_{t,\underline{a}_t} := [\pi_{t,a_t}, \delta_{t,a_t}, \kappa_{t,a_t}, \xi_{t,\underline{a}_t}, \eta_{t,\underline{a}_t}, \gamma_{t,\underline{a}_t}]. \end{align}\]
Following the same logic as in Proposition 8, we derive the following proposition, which enables identification using the adaptively selected weights.
Proposition 11. Under Assumptions 6–10, let \(0 \le s \le T+1\), \(r \ge 0\), and \(s + r \le T+1\). Suppose that, for each \(t=0,\ldots,T\), \(Z_t\) serves as an AIV for \(A_t\). Then, the mean potential outcomes \(\mathbb{E}\bigl[Y(\underline{a}_{s})\bigr]\) can be expressed as \[\begin{align} \mathbb{E}\left[ \left(\prod_{t=s}^{T-r} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)}\right) \times \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r}) \right]. \end{align}\]
The proof for this proposition is similar to that of Proposition 8, thus omitted. Specifically, one can identify the mean of potential outcomes \(\mathbb{E}[Y(\overline{a})]\) by \[\begin{align} \psi_{\overline{a},ada}^o := \mathbb{E}\left[ \prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \right]. \label{eq:32identification32AIV32longitudinal39322} \end{align}\tag{9}\] The next theorem derives the EIF for \(\psi_{\overline{a},ada}^o\).
Theorem 10. Under Assumptions 6–10, suppose that for each \(t=0,\ldots,T\), \(Z_t\) serves as an AIV for \(A_t\). Then, the EIF for \(\psi_{\overline{a},ada}^o\) consists of \(\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada}^o,\beta_{\overline{a}}^o)\), where \[\begin{align} &\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada},\beta_{\overline{a}}):= \prod_{t=0}^{T} \frac{\pi_{t,a_t}(Z_t,H_t) - \delta_{t,a_t}(H_t)}{\kappa_{t,a_t}(H_t)}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}\\ &+\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}(Z_s,H_s) - \delta_{s,a_s}(H_s)} {\kappa_{s,a_s}(H_s)}A_s^{(a_s)}\right)\times \dfrac{1}{\kappa_{t,a_t}(H_t)}\\ &\times \left\{\begin{array}{l} \{A_t^{(a_t)} - \pi_{t,a_t}(Z_t,H_t)\} \xi_{t,\underline{a}_t}(Z_t,H_t) -\{A_t^{(a_t)} - \delta_{t,a_t}(H_t)\} \eta_{t,\underline{a}_t}(H_t)\\ +\gamma_{t,\underline{a}_t}(H_t) \times\left(\kappa_t(H_t)+\{A_t^{(a_t)} - \pi_{t,a_t}(Z_t,H_t)\}^2-\{A_t^{(a_t)} - \delta_{t,a_t}(H_t)\}^2\right) \end{array}\right\}. \end{align}\]
Algorithm 5 demonstrates the backward cross-fitting procedure for training a debiased estimator for \(\psi_{\overline{a},ada}^o\). Finally, we establish the asymptotic normality of \(\hat{\psi}_{ada}^{(n)}\) in Equation 9 .
Theorem 11 (Asymptotic normality of \(\hat{\psi}_{ada}^{(n)}\) in Algorithm 5). Under Assumptions 6–10, suppose that \(Z_t\) is an AIV for \(A_t\). Assume that for any \(k=1,\ldots,K\) and \(t=0,\ldots,T\), \(\mathbb{E}[\|\hat{\beta}_{t,\underline{a}_t}^{(n,k)}-\beta_{t,\underline{a}_t}^o\|_2^2]=o(1)\), and that \[\begin{align} \left\{ \begin{array}{l} \|\hat{\gamma}_{t,\underline{a}_t}^{(n,k)}-\gamma_{t,\underline{a}_t}^o\|_2 \times \|\hat{\kappa}_{t,a_t}^{(n,k)}-\kappa_{t,a_t}^o\|_2\\ +\|\hat{\delta}_{t,a_t}^{(n,k)}-\delta_{t,a_t}^o\|_2^2+\|\hat{\pi}_{t,a_t}^{(n,k)}-\pi_{t,a_t}^o\|_2^2\\ +\|\hat{\xi}_{t,\underline{a}_t}^{(n,k)}-\xi_{t,\underline{a}_t}^o\|_2\times \|\hat{\pi}_{t,a_t}^{(n,k)}-\pi_{t,a_t}^o\|_2\\ +\|\hat{\eta}_{t,\underline{a}_t}^{(n,k)}-\eta_{t,\underline{a}_t}^o\|_2\times \|\hat{\delta}_{t,a_t}^{(n,k)}-\delta_{t,a_t}^o\|_2 \end{array}\right\} =o_p(n^{-1/2}). \end{align}\] Then \(\sqrt{n}\left(\hat{\psi}_{\overline{a},ada}^{(n)}-\psi_{\overline{a},ada}^o\right)/\sigma_{\overline{a},ada}^o\) converges in distribution to \(\mathcal{N}(0,1)\), where the asymptotic variance is defined as \((\sigma_{\overline{a},ada}^o)^2:=\mathbb{E}[\varphi(O;\psi_{\overline{a},ada}^o,\beta^o)^2]\). In addition, the variance estimator \((\hat{\sigma}_{ada}^{(n)})^2\) in Equation [eq:32AUG32estimator32longitudinal39] converges in probability to \((\sigma_{ada}^o)^2\).
The proof for this theorem is omitted, since its similar to the proof of Theorem 4.
The AIV condition is relatively strong and may fail to hold in some empirical settings. As an alternative, we consider an identification strategy based on a natural generalization of the multiplicative IV framework for binary instruments, originally proposed by [20]. This generalized framework relaxes the strict additive structure required by the AIV condition, while preserving the essential exclusion and relevance properties of a valid instrument. It highlights that the AIV condition is not the only means to ensure the well-posedness of the nonparametric IV problem in Equation 1 .
Definition 5 (Multiplicative IV). For each \(a\in\mathcal{A}\), we say that \(Z\) is a multiplicative IV* (MIV) for \(A = a\) if there exist functions \(b(U,L)\) and \(c(Z,L)\) such that \[\Pr(A\neq a\mid Z,U,L) = b(U,L)\cdot c(Z,L).\]*
The MIV condition implies that, conditional on the observed confounders \(L\), the instrument–treatment association on the multiplicative scale is unaffected by unmeasured confounding, effectively ruling out any \(U\)–\(Z\) interaction. Accordingly, it relies on the analyst’s ability to observe and adjust for a sufficiently rich set of covariates to ensure that the instrument’s effect on treatment remains stable across levels of the hidden confounder. The following proposition presents the resulting identification strategy under an MIV.
Proposition 12 (MIV identification). Under Assumptions 1–4, for \(a\in\mathcal{A}\), assume that \(Z\) is an MIV for \(A=a\), and that \(\pi(Z,L)\) is an RWF. Then there exists a unique solution \(f_a^o(A^{(a)}, L)\) to Equation 1 , which has explicit form \[\begin{align} &f_a^o(0,L)=\mathbb{E}[Y(a)\mid L]-\mathbb{E}[Y(a)\mid L,A\neq a],\\ &f_a^o(1,L)=\mathbb{E}[Y(a)\mid L]. \end{align}\] In particular, for any regular \(\pi(Z,L)\) for \(A=a\), it holds that \[\begin{align} \mathbb{E}[Y(a)]=\psi_{a,MIV}^o:=\mathbb{E}\left[(1-A^{(a)})\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, \pi(Z,L) \mid L\}}+A^{(a)}Y\right]. \end{align}\]
Example 1 (Binary IV and treatment). Without loss of generality, set \(\pi(Z,L)=Z\). Under Assumptions 1–4, if \(Z\) is an MIV for \(A\), one can recover a variation of the results in [20]. Specifically, Proposition 12 asserts that \[\begin{align} &\mathbb{E}[Y(1)] =\mathbb{E}\left[AY+(1-A)\frac{\mathbb{E}[AY\mid Z=1,L] - \mathbb{E}[AY\mid Z=0,L]}{\mathbb{E}[A\mid Z=1,L] - \mathbb{E}[A\mid Z=0,L]}\right]. \end{align}\]
At this stage, one might regard the MIV condition as a viable alternative to the AIV condition. However, as we demonstrate in the subsequent subsections, the AIV condition possesses several desirable properties that are not, in general, ensured under the MIV condition. This highlights the distinctive advantages of the AIV condition.
Without loss of generality, we just take \(\pi(Z,L)=Z\). Denote the nuisance functions as \[\begin{align} &\delta_a^o(L) := \mathbb{E}[A^{(a)} \mid L], && \eta_a^o(L) := \mathbb{E}[A^{(a)}Y \mid L],\\ &\kappa_{a}^o(L) := \mathrm{Cov}\!\{A^{(a)}, Z \mid L\}, && \zeta_{a}^o(L) := \mathbb{E}[A^{(a)}Y Z \mid L],\\ &\rho^o(L) := \mathbb{E}[Z \mid L], && \gamma_{a}^o(L) := \dfrac{\mathrm{Cov}\!\{A^{(a)}Y, Z \mid L\}}{\mathrm{Cov}\!\{A^{(a)}, Z \mid L\}}. \end{align}\] We can unify these nuisance functions into a nuisance vector as \[\begin{align} \alpha_{a,MIV}^o = [\delta_a^o,\eta_a^o,\kappa_{a}^o, \zeta_{a}^o,\rho^o,\gamma_{a}^o]. \end{align}\] Next, we derive the EIF for \(\psi_{a,MIV}^o\).
Theorem 12. Under Assumptions 1–4, for \(a\in\mathcal{A}\), assume that \(\pi(Z,L)=Z\) is an RWF for \(A=a\). Then, the EIF for \(\psi_{a,MIV}^o\) is \[\begin{align} \varphi_{a,MIV}(O;\psi_{a,MIV}^o,\alpha_{a,MIV}^o)=& (1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left(A^{(a)}Y-\eta_a^o(L)\right)(Z-\rho^o(L))\\ &-\dfrac{(1-\delta_a^o(L))}{\kappa_a^o(L)}\gamma_a^o(L) \left(A^{(a)}-\delta_a^o(L)\right)(Z - \rho^o(L)). \end{align}\]
As an extension of Proposition 3, we identify the parameter of interest in the parametric marginal structural mean model, specified for each \(a \in \mathcal{A}\), \[\label{eq:32MSM} \mathbb{E}[Y(a) \mid V] = g(a, V; \psi_{MSM}^o),\tag{10}\] where \(g(a, V; \psi_{MSM}^o)\) is a known function, \(V\) is a subset of the observed confounders \(L\), and \(\psi_{MSM}^o \in \mathbb{R}^q\) is the finite-dimensional parameter of interest. This type of model has been extensively studied in [16], [17], [60], [61]. Specifically, for any RWF \(\pi(Z,L)\) of \(A\), we denote the propensity score function as \[\begin{align} \omega_{\pi}^o(a,Z,L):=\dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}}. \end{align}\]
Proposition 13. Under Assumptions 1–4, suppose that \(Z\) is an AIV for \(A\), and that the potential outcome \(Y(a)\) follows the parametric marginal structural model in Equation 10 . Then, for any RWF \(\pi(Z, L)\) of \(A\), the parameter \(\psi_{MSM}^o \in\mathbb{R}^q\) satisfies \[\mathbb{E}\left[\omega_{\pi}^o(A, Z, L) \left\{Y - g(A, V; \psi_{MSM}^o)\right\}\mid A,V\right] = 0.\]
Notably, when \(Z\) is binary, Proposition 13 reduces to the identification result in Proposition 3. Now, one may construct an estimator for \(\psi_{MSM}^o\) using the GMM framework. A detailed exploration of such estimation procedures is beyond the scope of this article.
Rather than directly identifying the causal effect as in Proposition 3, some studies adopt the practice of dichotomizing a continuous instrument \(Z\) into a binary IV [14], [62]. A key observation is that if the original continuous \(Z\) satisfies the AIV condition for \(A\), then its discretized counterpart also satisfies the AIV condition. The following proposition formalizes this result, providing theoretical guarantees for identification based on discretized IVs and thereby enabling valid inference via the bounded IV approach of [11].
Proposition 14 (Discretized AIV identification). Let \(\mathcal{S}=\{S_1, \ldots, S_M\}\) be an arbitrary partition of \(\mathcal{Z}\) such that \(\Pr(Z\in S_M)>0\). For \(m=1,\ldots,M\), define \(Z_{\mathcal{S}} := m\) if \(Z \in S_m\). Assume that for any \(a\in\mathcal{A}\) and \(l\in \mathcal{L}\), there exist two distinct values \(z_1,z_2\) such that \[\label{eq:32IV32relevance32discrete} \Pr(A=a\mid Z_{\mathcal{S}}=z_1,L=l) \neq \Pr(A=a\mid Z_{\mathcal{S}}=z_2,L=l).\qquad{(7)}\] Under Assumptions 1–3, if \(Z\) is an AIV for \(A\), the nonparametric IV equation \[\mathbb{E}[A^{(a)}Y\mid Z_{\mathcal{S}},L]=\mathbb{E}[f_{a,\mathcal{S}}^o(A^{(a)},L)\mid Z_{\mathcal{S}},L]\] has the unique solution \(f_{a,\mathcal{S}}^o(A^{(a)},L)\) given by \[\begin{align} f_{a,\mathcal{S}}^o(0,L)=&\;\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z_{\mathcal{S}},U,L)\mid L\},\\ f_{a,\mathcal{S}}^o(1,L)=&\;\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z_{\mathcal{S}},U,L)\mid L\} +\mathbb{E}[Y(a)\mid L]. \end{align}\]
This constitutes a suboptimal identification strategy when the original \(Z\) already satisfies the AIV condition. The key drawback is that once \(Z_{\mathcal{S}}\) is discretized into a binary instrument, the relevance condition in Equation ?? becomes more restrictive than its counterpart in Assumption 4. Thus, although discretization may simplify estimation, we do not recommend this practice, as it can compromise estimator stability and reduce statistical efficiency.
So far, we have assumed that the treatment is either binary or multi-categorical. In practice, however, the treatment of interest may be continuous [59], [63]. For completeness, we now extend our theory and derive the identification strategy for a continuous treatment variable \(A\). Let \(p_{A\mid Z,U,L}(a \mid Z, U, L)\) denote the conditional probability density function of \(A\) given \((Z,U,L)\).
Definition 6 (Additive IV). For each \(a \in \mathcal{A}\), we say that \(Z\) is an additive IV* (AIV) for \(A = a\) if there exist functions \(b(U,L)\) and \(c(Z,L)\) such that \[p_{A\mid Z,U,L}(a \mid Z, U, L) = b(U, L) + c(Z, L).\] Moreover, we say that \(Z\) is an AIV for \(A\) if, for every \(a\in\mathcal{A}\), \(Z\) is an AIV for \(A=a\).*
Definition 7 (Regular weighting function). For each \(a\in\mathcal{A}\), we say that \(\pi(Z,L)\) is a regular weighting function* (RWF) for \(A = a\) if there exists a positive constant \(\epsilon_0\) such that \[\big| \mathbb{E}[\pi(Z,L)\mid A=a,L] - \mathbb{E}[\pi(Z,L)\mid L] \big|\geq \epsilon_0 \quad \text{uniformly over L.}\]*
We make several remarks. First, the definitions of continuous AIV and RWF are natural extensions of their multi-categorical counterparts introduced in Section 2. Second, a continuous AIV \(Z\) for a continuous treatment \(A\) can be constructed, for example, via a Gaussian mixture model. We now present an identification strategy for the continuous treatment-response curve \(\mathbb{E}[Y(a)]\).
Proposition 15 (AIV identification). Under Assumptions 1–4, for each \(a\in\mathcal{A}\), if \(Z\) is an AIV for \(A=a\), then for any RWF \(\pi(Z,L)\), it holds that \[\begin{align} \mathbb{E}[Y(a)] = \mathbb{E}\left[ \dfrac{ \mathbb{E}[Y\,\pi(Z,L)\mid A=a,L] - \mathbb{E}[Y\mid A=a,L] \, \mathbb{E}[\pi(Z,L)\mid L]}{ \mathbb{E}[\pi(Z,L)\mid A=a,L] - \mathbb{E}[\pi(Z,L)\mid L]} \right]. \end{align}\]
Example 2 (Binary IV with a continuous treatment). Assume that \(Z\) takes values in \(\{0,1\}\) and the treatment variable \(A\) is continuous. Let \(\pi(Z,L) = Z\), which serves as an RWF for \(A = a\). Under Assumptions 1–4, if \(Z\) is an AIV for \(A\), Proposition 15 implies \[\begin{align} \mathbb{E}[Y(a)] = \mathbb{E}\left[\dfrac{\mathbb{E}[YZ \mid A=a,L] - \mathbb{E}[Y \mid A=a,L]\mathbb{E}[Z \mid L]} {\mathbb{E}[Z \mid A=a,L] - \mathbb{E}[Z \mid L]}\right]. \end{align}\]
Notably, a continuous IV can be discretized into a binary (or multi-categorical) IV while preserving the AIV property, as established in Proposition 14, even when the treatment variable is continuous. This allows the identification strategy to be naturally simplified in settings with discretized instruments. Moreover, semiparametric techniques, analogous to those developed in [59], can be applied to the result in Proposition 15 to construct a multiply robust and efficient estimator, which lies beyond the scope of this article.
| Treated | Control | ATE | |
|---|---|---|---|
| EST | .4291 | .1074 | .1450 |
| SE | .2610 | .1199 | .2184 |
| SD | .4572 | .1450 | .2276 |
The CollegeDistance dataset, available in the AER package in R, originates from the High School and Beyond survey conducted by the U.S. Department of Education in 1980, with a follow-up survey in
1986. The survey includes data on 4,739 students from approximately 1,100 high schools. This dataset was originally analyzed in [64], who studied the impact of community colleges on educational attainment.
The dataset contains demographic, socioeconomic, and geographic variables commonly used in applied econometrics, particularly in instrumental variable (IV) analyses of educational attainment. Key variables include parental education
(fcollege, mcollege), family characteristics (home, income), local economic conditions (unemp, wage), and measures of college accessibility such as the distance to
the nearest four-year college and average state tuition. We adaptively estimate the weighting function \(\pi^o(Z,L)\) and use the R package mgcv to estimate nuisance functions.
The treatment variable is education, measured as the number of years of schooling completed by 1986, ranging from 12 years (high school completion) to 18 years (graduate degree). For analytical convenience, we dichotomize
education into two groups: individuals with 14 or more years of schooling (treated group) and those with 13 years or fewer (control group). The outcome of interest is income, a binary indicator of whether the annual family income
in 1980 exceeded $25,000 (in 1980 U.S. dollars). We use distance and tuition as instrumental variables, and include fcollege and mcollege as baseline covariates \(L\).
Table 4 reports the bootstrap results based on 500 replications. The estimated mean outcome for the treated group is approximately 0.43, whereas that for the control group is around 0.11, yielding an average treatment effect (ATE) of roughly 0.15. Reported standard errors range from 0.12 to 0.26, depending on the subgroup, while the empirical standard deviations are somewhat larger, particularly for the treated group. Overall, the results indicate a moderate positive effect of educational attainment on family income.
The proof for Theorems 4, 5, 7, 9 and 11 are analogous. Thus, we only provide the proof for Theorem 4.
Suppose there exists two solution \(f_a^o(1,L)\) and \(f_a'(1,L)\) to Equation 1 . Then it holds that \[\begin{align} 0=&\mathbb{E}[A^{(a)}Y-A^{(a)}Y\mid Z,L]=\mathbb{E}[f_a'(A^{(a)},L)-f_a^o(A^{(a)},L)\mid Z,L]\\ =&\{f_a'(1,L)-f_a^o(1,L)\}\mathbb{E}[A^{(a)}\mid Z,L] +\{f_a'(0,L)-f_a^o(0,L)\}\mathbb{E}[1-A^{(a)}\mid Z,L]\\ =&\{f_a'(1,L)-f_a'(0,L)-f_a^o(1,L)+f_a^o(0,L)\}\mathbb{E}[A^{(a)}\mid Z,L] +f_a'(0,L)-f_a^o(0,L) \end{align}\] From Assumption 4, for any \(l\), there exists two distinct \(z_1,z_2\), such that \[\mathbb{E}[A^{(a)}\mid Z=z_1,L=l]\neq \mathbb{E}[A^{(a)}\mid Z=z_2,L=l].\] Then we take difference to get \[\begin{align} &\{f_a'(1,l)-f_a'(0,l)-f_a^o(1,l)+f_a^o(0,l)\}\\ &\cdot \{\mathbb{E}[A^{(a)}\mid Z=z_1,L=l]-\mathbb{E}[A^{(a)}\mid Z=z_2,L=l]\}=0. \end{align}\] Now we know that \[\begin{align} &f_a'(1,L)-f_a'(0,L)-f_a^o(1,L)+f_a^o(0,L)\equiv 0,\\ &f_a^o(0,L)-f_a^o(0,L)\equiv 0. \end{align}\] Therefore \(f_a^o(A^{(a)},L)\equiv f_a'(A^{(a)},L)\). This is equivalent to say that the solution to Equation 1 is unique. Second, \[\begin{align} &\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}\\ =&\mathbb{E}[A^{(a)}Y \pi(Z,L)\mid L]-\mathbb{E}[A^{(a)}Y\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}[\mathbb{E}[A^{(a)}Y\mid Z,L] \pi(Z,L)\mid L]-\mathbb{E}[\mathbb{E}[A^{(a)}Y\mid Z,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}[\mathbb{E}[f_a^o(A^{(a)},Y)\mid Z,L] \pi(Z,L)\mid L]-\mathbb{E}[\mathbb{E}[f_a^o(A^{(a)},Y)\mid Z,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}[f_a^o(A^{(a)},L)\pi(Z,L)\mid L]-\mathbb{E}[f_a^o(A^{(a)},L)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&f_a^o(1,L)\mathbb{E}[A^{(a)}\pi(Z,L)\mid L]+f_a^o(0,L)\mathbb{E}[(1-A^{(a)})\pi(Z,L)\mid L]\\ &-\{f_a^o(1,L)\mathbb{E}[A^{(a)}\mid L]+f_a^o(0,Y)\mathbb{E}[(1-A^{(a)})\mid L]\}\mathbb{E}[\pi(Z,L)\mid L]\\ =&f_a^o(1,L)\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\} +f_a^o(0,L)\mathrm{Cov}\!\{1-A^{(a)},\pi(Z,L)\mid L\}\\ =&\{f_a^o(1,L)-f_a^o(0,L)\}\cdot\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\}. \end{align}\] This show that \[\begin{align} f_a^o(1,L)-f_a^o(0,L)=\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\}}. \end{align}\] In addition, by Equation 1 , we know that \[\begin{align} &\mathbb{E}[A^{(a)}Y\mid Z,L]=\mathbb{E}[f_a^o(A^{(a)},L)\mid Z,L]\\ =&\{f_a^o(1,L)-f_a^o(0,L)\}\mathbb{E}[A^{(a)}\mid Z,L]+f_a^o(0,L)\\ =&\dfrac{\mathrm{Cov}\!\{A^{(a)}Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A^{(a)},\pi(Z,L)\mid L\}}\mathbb{E}[A^{(a)}\mid Z,L]+f_a^o(0,L). \end{align}\] Now we can deduce the results in Equation ?? , finishing the proof for Theorem 1.
For any pathwise differentiable parameterization \(p_\theta(O)\), we denote \(\require{physics} s(O):=\nabla_{\theta}\log p_\theta(O)\eval_{\theta=0}\) as the score function, \(\mathbb{E}_\theta\) as taking expectation with respect to \(p_\theta(O)\). Denote \[\begin{align} \psi_{\pi;\theta}=& \mathbb{E}_\theta\left[\dfrac{\mathbb{E}_\theta[Y\pi(Z,L)\mid L]-\mathbb{E}_\theta[Y\mid L]\mathbb{E}_\theta[\pi(Z,L)\mid L]}{\mathbb{E}_\theta[A\pi(Z,L)\mid L]-\mathbb{E}_\theta[A\mid L]\mathbb{E}_\theta[\pi(Z,L)\mid L]}\right]. \end{align}\] We calculate the path-wise derivative as \[\require{physics} \begin{align} \nabla_{\theta}\psi_{\pi;\theta}\eval_{\theta=0}=&\nabla_{\theta} \mathbb{E}_\theta\left[\dfrac{\mathbb{E}_\theta[Y\pi(Z,L)\mid L]-\mathbb{E}_\theta[Y\mid L]\mathbb{E}_\theta[\pi(Z,L)\mid L]}{\mathbb{E}_\theta[A\pi(Z,L)\mid L]-\mathbb{E}_\theta[A\mid L]\mathbb{E}_\theta[\pi(Z,L)\mid L]}\right]\eval_{\theta=0}\\ =&\mathbb{E}[\{\gamma_{\pi}^o(L)-\psi_{\pi}^o\}s(O)]\\ &+\mathbb{E}\left[\dfrac{\{Y\pi(Z,L)-\zeta_{\pi}^o(L)\}s(O)}{\mathbb{E}[A\pi(Z,L)\mid L]-\mathbb{E}[A\mid L]\mathbb{E}[\pi(Z,L)\mid L]}\right]\\ &-\mathbb{E}\left[\dfrac{\rho_{\pi}^o(L)\{Y-\eta^o(L)\}s(O)}{\mathbb{E}[A\pi(Z,L)\mid L]-\mathbb{E}[A\mid L]\mathbb{E}[\pi(Z,L)\mid L]}\right]\\ &-\mathbb{E}\left[\dfrac{\eta^o(L)\{\pi(Z,L)-\rho_{\pi}^o(L)\}s(O)}{\mathbb{E}[A\pi(Z,L)\mid L]-\mathbb{E}[A\mid L]\mathbb{E}[\pi(Z,L)\mid L]}\right]\\ &-\mathbb{E}\left[\dfrac{\gamma_{\pi}^o(L)}{\kappa_{\pi}^o(L)}\left\{ \begin{array}{l} A\pi(Z,L)-\kappa_{\pi}^o(L)\\ -\rho_{\pi}^o(L)\{A-\delta^o(L)\}\\ -\delta^o(L)\{\pi(Z,L)-\rho_{\pi}^o(L)\} \end{array} \right\}s(O)\right]. \end{align}\] Thus, one influence function for \(\psi_{\pi}^o\) consists of \[\begin{align} &\varphi_{\pi}(O;\psi_{\pi}^o,\alpha_{\pi}^o):=\gamma_{\pi}^o(L)-\psi_{\pi}^o\\ &+\dfrac{1}{\kappa_{\pi}^o(L)} \cdot\left\{\begin{array}{l} Y\pi(Z,L)-\zeta_{\pi}^o(L)\\ -\rho_{\pi}^o(L)\{Y-\eta^o(L)\}\\ -\eta^o(L)\{\pi(Z,L)-\rho_{\pi}^o(L)\} \end{array}\right\}\\ &-\dfrac{\gamma_{\pi}^o(L)}{\kappa_{\pi}^o(L)}\left\{ \begin{array}{l} A\pi(Z,L)-\kappa_{\pi}^o(L)\\ -\rho_{\pi}^o(L)\{A-\delta^o(L)\}\\ -\delta^o(L)\{\pi(Z,L)-\rho_{\pi}^o(L)\} \end{array} \right\}\\ =&\dfrac{\left\{\pi(Z,L)-\rho_{\pi}^o(L)\right\}} {\kappa_{\pi}^o(L)} Y-\psi_{\pi}^o \\&+ \left(1-\dfrac{\{A-\delta^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}}{\kappa_{\pi}^o(L)} \right)\gamma_{\pi}^o(L) \\ &-\dfrac{(\pi(Z,L)-\rho_{\pi}^o(L))^o(L)}{\kappa_{\pi}^o(L)}. \end{align}\] Since the tangent space spanned by the score function \(s(O)\) equals \(L_2(O)\), any influence function is automatically the EIF.
Recall that \[\begin{align} \gamma^o(L):=\dfrac{\zeta^o(L) - \eta^o(L)\delta^o(L)}{\kappa^o(L)},\qquad \gamma(L):=\dfrac{\zeta(L) - \eta(L)\delta(L)}{\kappa(L)}. \end{align}\] For any pathwise differentiable parameterization \(p_\theta(O)\) (probability density function), we denote \(\require{physics} s(O):=\nabla_{\theta}\log p_\theta(O)\eval_{\theta=0}\) as the score function, \(\mathbb{E}_\theta\) as taking expectation with respect to \(p_\theta(O)\). For \(\zeta^o(L;\theta)=\mathbb{E}_\theta[Y\mathbb{E}_\theta[A\mid Z,L]\mid L]\), we calculate \[\require{physics} \begin{align} &\nabla_{\theta}\zeta(L;\theta)\eval_{\theta=0}=\nabla_{\theta}\mathbb{E}_\theta[Y\mathbb{E}_\theta[A\mid Z,L]\mid L]\eval_{\theta=0}\\ =&\nabla_{\theta}\mathbb{E}_\theta[Y\mathbb{E}[A\mid Z,L]\mid L]\eval_{\theta=0} +\mathbb{E}\left[Y\nabla_{\theta}\mathbb{E}_\theta[A\mid Z,L]\eval_{\theta=0}\mid L\right]\\ =&\mathbb{E}[\{Y\pi^o(Z,L)-\zeta^o(L)\}s(O)\mid L] +\mathbb{E}[Y\mathbb{E}[\{A-\pi^o(Z,L)\}s(O)\mid Z,L]\mid L]\\ =&\mathbb{E}[\{Y\pi^o(Z,L)-\zeta^o(L)\}s(O)\mid L] +\mathbb{E}[\xi^o(Z,L)\{A-\pi^o(Z,L)\}s(O)\mid L]. \end{align}\] Similarly, for \(\kappa(L;\theta)=\mathbb{E}_\theta[A\mathbb{E}_\theta[A\mid Z,L]\mid L]-\mathbb{E}_{\theta}[A\mid L]^2\), it holds that \[\require{physics} \begin{align} &\nabla_{\theta}\kappa(L;\theta)\eval_{\theta=0}= \nabla_{\theta}\mathbb{E}_\theta[A\mathbb{E}_\theta[A\mid Z,L]\mid L]\eval_{\theta=0} -\nabla_{\theta}\mathbb{E}_\theta[A\mid L]^2\eval_{\theta=0}\\ =&\mathbb{E}\left[\{A\pi^o(Z,L)-\kappa^o(L)-\delta^o(L)^2\}s(O)\middle| L\right]\\ &+\mathbb{E}\left[\pi^o(Z,L)\{A-\pi^o(Z,L)\}s(O)\middle| L\right]\\ &-\mathbb{E}[2\delta^o(L)(A-\delta^o(L))s(O)\mid L]. \end{align}\] For \(\eta(L;\theta)=\mathbb{E}_\theta[Y\mid L]\), we calculate \[\require{physics} \begin{align} &\nabla_{\theta}\eta(L;\theta)\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta[Y\mid L]\eval_{\theta=0}\\ =&\mathbb{E}[\{Y-\mathbb{E}[Y\mid L]\}s(O)\mid L]\\ =&\mathbb{E}\left[\{Y-\eta^o(L)\}s(O)\middle| L\right]. \end{align}\] For \(\delta(L;\theta)=\mathbb{E}_\theta[A\mid L]\), we calculate \[\require{physics} \begin{align} &\nabla_{\theta}\delta(L;\theta)\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta[A\mid L]\eval_{\theta=0} =\mathbb{E}[\{A-\delta^o(L)\}s(O)\mid L]. \end{align}\] Now we calculate the EIF as follows. For \[\psi_{ada,\theta}=\mathbb{E}_\theta\left[ \dfrac{\zeta(L;\theta) - \eta(L;\theta)\delta(L;\theta)}{\kappa(L;\theta)} \right],\] we deduce the path-wise derivative as \[\require{physics} \begin{align} &\nabla_{\theta}\psi_{ada,\theta}\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta\left[ \dfrac{\zeta(L;\theta) - \eta(L;\theta)\delta(L;\theta)}{\kappa(L;\theta)} \right]\eval_{\theta=0}\notag\\ =&\mathbb{E}[(\gamma^o(L)-\psi_{ada}^o)s(O)]\notag\\ &+\mathbb{E}\left[\dfrac{\nabla_{\theta}\zeta(L;\theta)\eval_{\theta=0} -\nabla_{\theta} (L;\theta) \delta(L;\theta)\eval_{\theta=0}}{ \kappa^o(L)}\right]\notag\\ &-\mathbb{E}\left[\gamma^o(L)\dfrac{ \nabla_{\theta}\kappa(L;\theta)\eval_{\theta=0} }{ \kappa^o(L)}\right].\notag\\ =&\mathbb{E}[(\gamma^o(L)-\psi_{ada}^o)s(O)]\\ &+\mathbb{E}\left[\dfrac{1}{\kappa^o(L)} \left\{\begin{array}{l} \{Y\pi^o(Z,L)-\zeta^o(L)\}\\ +\xi^o(Z,L)\{A-\pi^o(Z,L)\} \end{array}\right\}s(O) \right]\\ &-\mathbb{E}\left[\dfrac{1}{\kappa^o(L)}\left\{ \begin{array}{l} \delta^o(L)\{Y-\eta^o(L)\}\\ +^o(L)\{A-\delta^o(L)\} \end{array} \right\}s(O)\right]\\ &+\mathbb{E}\left[\dfrac{\gamma^o(L)}{\kappa^o(L)}2\delta^o(L)\left\{ A-\delta^o(L) \right\}s(O)\right]\\ &-\mathbb{E}\left[\dfrac{\gamma^o(L)}{\kappa^o(L)}\left\{ \begin{array}{l} \{A\pi^o(Z,L)-\kappa^o(L)\}\\ +\pi^o(Z,L)\{A-\pi^o(Z,L)\} \end{array} \right\}s(O)\right]. \end{align}\] The efficient influence function finally consists of \[\begin{align} &\varphi(O;\psi_{ada}^o,\beta^o)=\gamma^o(L)-\psi_{ada}^o\\ &+\dfrac{1}{\kappa^o(L)} \left\{\begin{array}{l} \{Y\pi^o(Z,L)-\zeta^o(L)\}\\ +\xi^o(Z,L)\{A-\pi^o(Z,L)\}\\ -\delta^o(L)\{Y-\eta^o(L)\}\\ -\eta^o(L)\{A-\delta^o(L)\} \end{array}\right\}\\ &-\dfrac{\gamma^o(L)}{\kappa^o(L)}\left\{ \begin{array}{l} A\pi^o(Z,L)-\kappa^o(L)-\delta^o(L)^2\\ +\pi^o(Z,L)\{A-\pi^o(Z,L)\}\\ -2\delta^o(L)\left\{A-\delta^o(L)\right\} \end{array} \right\}\\ =&\dfrac{\pi^o(Z,L)-\delta^o(L)}{\kappa^o(L)}Y-\psi_{ada}^o\\ &+\dfrac{1}{\kappa^o(L)} \left\{ \xi^o(Z,L)(A-\pi^o(Z,L)) -\eta^o(L)(A-\delta^o(L)) \right\}\\ &+\gamma^o(L)\left\{ 1 - \dfrac{(\pi^o(Z,L)-\delta^o(L))(2A-\pi^o(Z,L)-\delta^o(L))}{\kappa^o(L)} \right\}. \end{align}\]
Define the conditional expectation on the \(k\)-th fold as \(\mathbb{E}_k[O]=\mathbb{E}[O\mid I_{-k}]\). From the definition of \(\psi_{\pi}^o\) in Equation 5 , \[\begin{align} 0=&\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})\right]\notag\\ =&\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})\right]+\sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}\notag\\ =&\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]+\sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}\notag\\ &+\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{k}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\tag{11}\\ &+\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\{\mathbb{E}_{nk}-\mathbb{E}_{k}\}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right].\tag{12} \end{align}\] From Proposition 5, we can deduce that the quantity in Equation 11 is \(o_p(1)\), since \[\begin{align} &\mathbb{E}_{k}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\\ \lesssim& \left\{\begin{array}{l} \|\hat{\kappa}_{\pi}^{(n,k)}(L)-\kappa_{\pi}^o(L)\|_2\times \|\hat{\gamma}_{\pi}^{(n,k)}(L)-\gamma_{\pi}^o(L)\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}(L)-\rho_{\pi}^o(L)\|_2\times \|\hat{\delta}^{(n,k)}(L)-\delta^o(L)\|_2\\ +\|\hat{\rho}_{\pi}^{(n,k)}(L)-\rho_{\pi}^o(L)\|_2\times \|\hat{\eta}^{(n,k)}(L)-\eta^o(L)\|_2 \end{array}\right\} =o_p(n^{-1/2}). \end{align}\] Define the empirical process as \(\mathbb{G}_{nk}[f(O)]:=\sqrt{n_k}\{\mathbb{E}_{nk}-\mathbb{E}_{k}\}[f(O)].\) \[\begin{align} &\Pr\left(\mathbb{G}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\geq \epsilon_0\middle| I_{-k}\right)\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathrm{Var}\left[\mathbb{G}_{nk}\left\{\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right\}\middle| I_{-k}\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathrm{Var}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\middle| I_{-k}\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathbb{E}\left[\left\{\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right\}^2\middle| I_{-k}\right]\\ \lesssim&\|\hat{\rho}_{\pi}(L)-\rho_{\pi}^o(L)\|_2^2+ \|\hat{\delta}^{(n,k)}(L)-\delta^o(L)\|_2^2+ \|\hat{\eta}^{(n,k)}(L)-\eta^o(L)\|_2^2\\&+ \|\hat{\kappa}^{(n,k)}(L)-\kappa^o(L)\|_2^2+ \|\hat{\gamma}_{\pi}^{(n,k)}(L)-\gamma_{\pi}^o(L)\|_2^2\\\lesssim& \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^o\|_2^2. \end{align}\] We can now integrates \(I_{-k}\) out to deduce that \[\begin{align} \Pr\left(\mathbb{G}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]\geq \epsilon_0\right)\lesssim \mathbb{E}[\|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^o\|_2^2]=o(1). \end{align}\] This is equivalent to say that the quantity in Equation 12 is \(o_p(1)\). Next, we can see that \[\begin{align} \sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}=&-\dfrac{\sqrt{n}}{K}\sum_{k=1}^K\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right]+o_p(1)\\ =&-\dfrac{1}{\sqrt{n}}\sum_{i = 1}^n\varphi_{\pi}(O_i;\psi_{\pi}^{o},\alpha_{\pi}^{o})+o_p(1). \end{align}\] From the central limit theory and the Slutskys lemma, we know that \(\sqrt{n}\{\hat{\psi}_{\pi}^{(n)}-\psi_{\pi}^{o}\}\) converges in distribution to \(\mathcal{N}(0,(\sigma_{\pi}^o)^2)\). Next, we prove the consistency of the variance estimator \[\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2:=\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2\right].\] First, we define \(\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2:=\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right]=\mathbb{E}_n\left[ \varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right].\) By the law of large number, \(\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\) converges to \(\left(\sigma_{\pi}^{o}\right)^2\) in probability. Simultaneously, by the boundness for \(\varphi_{\pi}(O;\psi_{\pi},\alpha_{\pi})\), we know that \[\begin{align} &\Pr\left(\left|\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2-\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\right|\geq \epsilon_0\middle| I_{-k}\right) \leq \dfrac{1}{\epsilon_0^2}\mathbb{E}_k\left[\left|\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2-\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\right|^2\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2} \mathbb{E}_{k}\left[\left|\mathbb{E}_{nk}\left[\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right]\right|\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathbb{E}_{k}\left[\left|\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})^2-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})^2\right|\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\left(\mathbb{E}_{k}\left[\left|\varphi_{\pi}(O;\hat{\psi}_{\pi}^{(n)},\hat{\alpha}_{\pi}^{(n,k)})-\varphi_{\pi}(O;\psi_{\pi}^{o},\alpha_{\pi}^{o})\right|^2\right]\right)^{1/2}\\ \lesssim&\dfrac{1}{\epsilon_0^2}\left\{\left|\hat{\psi}_{\pi}^{(n,k)}-\psi_{\pi}^{o}\right|^2+ \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^{o}\|_2^2\right\}^{1/2}. \end{align}\] Now we can integrates \(I_{-k}\) out to deduce that \[\begin{align} &\Pr\left(\left|\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2-\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2\right|\geq \epsilon_0\right)\\ \lesssim&\dfrac{1}{\epsilon_0^2}\mathbb{E}\left[\left\{\left|\hat{\psi}_{\pi}^{(n,k)}-\psi_{\pi}^{o}\right|^2+ \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^{o}\|_2^2\right\}^{1/2}\right]\\ \lesssim&\dfrac{1}{\epsilon_0^2}\left\{\mathbb{E}\left[\left|\hat{\psi}_{\pi}^{(n,k)}-\psi_{\pi}^{o}\right|^2+ \|\hat{\alpha}_{\pi}^{(n,k)}-\alpha_{\pi}^{o}\|_2^2\right]\right\}^{1/2}=o(1). \end{align}\] Now \(\left(\hat{\sigma}_{\pi}^{(n,k)}\right)^2=\left(\tilde{\sigma}_{\pi}^{(n,k)}\right)^2+o_p(1)=\left(\sigma_{\pi}^{o}\right)^2+o_p(1)\), finishing the proof for Theorem 4.
For any pathwise differentiable parameterization \(p_\theta(O)\), we denote \(\require{physics} s(O):=\nabla_{\theta}\log p_\theta(O)\eval_{\theta=0}\) as the score function, \(\mathbb{E}_\theta\) as taking expectation with respect to \(p_\theta(O)\). Recall that the true parameter of interest is \[\begin{align} \psi_{\overline{a}}^o:=\mathbb{E}\left[ \prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y \right]. \end{align}\] Concretely, \[\require{physics} \begin{align} &\nabla_{\theta}\psi_{\overline{a},\theta}\eval_{\theta=0} \\=&\nabla_{\theta} \mathbb{E}_{\theta}\left[ \prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}_{\theta}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}_{\theta}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}_{\theta}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}_{\theta}\left[Z_t\middle| H_t\right]} \times Y \right]\eval_{\theta=0}\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\times Y\\ \times\left\{\begin{array}{l} \nabla_{\theta}\dfrac{\left(Z_t-\mathbb{E}_{\theta}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]}\displaystyle\eval_{\theta=0}\\ -\dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2}\\ \times\nabla_{\theta}\left(\begin{array}{l} \mathbb{E}_{\theta}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}_{\theta}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}_{\theta}\left[Z_t\middle| H_t\right] \end{array}\right)\eval_{\theta=0} \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\times Y\\ \times\left\{\begin{array}{l} -\dfrac{\mathbb{E}[(Z_t-\rho_t^o(H_t))s(O)\mid H_t]A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]}\\ -\dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2}\\ \times\left(\begin{array}{l} \mathbb{E}\left[(A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))s(O)\middle| H_t\right]\\ -\mathbb{E}\left[(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t))s(O)\middle| H_t\right] \rho_t^o(H_t)\\ -\mathbb{E}\left[(Z_t-\rho_t^o(H_t))s(O)\middle| H_t\right] \delta_{t,a_t}^o(H_t) \end{array}\right) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\mathbb{E}\left[A_t^{(a_t)}\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[ A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}Y\middle| H_{t}\right]\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))s(O)} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \end{array}\right\} \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\mathbb{E}\left[\begin{array}{l} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2}\\ \times\displaystyle\prod_{s=t+1}^{T}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\times Y \end{array}\middle| H_{t}\right]\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))s(O)\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)s(O)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t)s(O) \end{array}\right) \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\mathbb{E}\left[A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\middle| H_{t}\right]\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))s(O)} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \end{array}\right\} \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\begin{array}{l} \dfrac{\mathbb{E}\left[\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}\times\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\middle| H_{t}\right]} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2} \end{array}\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))s(O)\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)s(O)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t)s(O) \end{array}\right) \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\mathbb{E}[Z_t\mid H_t]\right)A_t^{(a_t)}} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)s(O)} {\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]} \end{array}\right\} \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\mathbb{E}[Z_s\mid H_s]\right)A_s^{(a_s)}} {\mathbb{E}\left[A_s^{(a_s)}Z_s\middle| H_s\right] -\mathbb{E}\left[A_s^{(a_s)}\middle| H_s\right] \mathbb{E}\left[Z_s\middle| H_s\right]}\\ \times\begin{array}{l} \dfrac{\zeta_{t,\underline{a}_t}^o(H_t)-\rho_t^o(H_t)\eta_{t,\underline{a}_t}^o(H_t)} {\left\{\mathbb{E}\left[A_t^{(a_t)}Z_t\middle| H_t\right] -\mathbb{E}\left[A_t^{(a_t)}\middle| H_t\right] \mathbb{E}\left[Z_t\middle| H_t\right]\right\}^2} \end{array}\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t) \end{array}\right)s(O) \end{array} \right]\\ =&\mathbb{E}\left[ \left(\prod_{t=0}^{T} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)A_t^{(a_t)}} {\kappa_{s,a_s}^o(H_s)} \times Y-\psi_{\overline{a}}^o\right)s(O) \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}\\ \times\left\{\begin{array}{l} \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}^o(H_t)} \end{array}\right\}s(O) \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[ \begin{array}{l} \displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}\\ \times\begin{array}{l} \dfrac{\zeta_{t,\underline{a}_t}^o(H_t)-\rho_t^o(H_t)\eta_{t,\underline{a}_t}^o(H_t)} {\left\{\kappa_{t,a_t}^o(H_t)\right\}^2} \end{array}\\ \times\left(\begin{array}{l} (A_t^{(a_t)}Z_t-\kappa_{t,a_t}^o(H_t))\\ -(A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)) \rho_t^o(H_t)\\ -(Z_t-\rho_t^o(H_t)) \delta_{t,a_t}^o(H_t) \end{array}\right)s(O) \end{array} \right]\\ =&\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o)s(O)\right], \end{align}\] where \[\begin{align} &\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}}^o):= \prod_{t=0}^{T} \dfrac{\left(Z_t-\rho_t^o(H_t)\right)A_t^{(a_t)}} {\kappa_{s,a_s}^o(H_s)} \times Y-\psi_{\overline{a}}^o\\ &-\sum_{t=0}^T\displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}\times \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)}{\kappa_{t,a_t}^o(H_t)}\\ &+\sum_{t=0}^T\displaystyle\prod_{s=0}^{t-1}\dfrac{\left(Z_s-\rho_t^o(H_t)\right)A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)} \times\gamma_{t,\underline{a}_t}^o(H_t) \\&\times \left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)\}\{Z_t-\rho_t^o(H_t)\}}{\kappa_{t,a_t}^o(H_t)} \right). \end{align}\] This finishes the proof for Theorem 6.
Analogous to the proof for Theorem 6, we calculate the path-wise derivative for \[\begin{align} \psi_{\overline{a},\theta} := \mathbb{E}_\theta\left[ \prod_{t=0}^{T} \frac{\pi_{t,a_t,\theta}^o(Z_t,H_t) - \delta_{t,a_t,\theta}^o(H_t)}{\mathrm{Var}_\theta\!\{\pi_{t,a_t,\theta}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \right]. \end{align}\] Concretely, \[\require{physics} \begin{align} &\nabla_{\theta}\psi_{\overline{a},\theta}\eval_{\theta=0} =\nabla_{\theta}\mathbb{E}_\theta\left[ \prod_{t=0}^{T} \frac{\pi_{t,a_t,\theta}^o(Z_t,H_t) - \delta_{t,a_t,\theta}^o(H_t)}{\mathrm{Var}_\theta\!\{\pi_{t,a_t,\theta}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \right]\eval_{\theta=0}\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}_\theta\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}_\theta\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times\nabla_{\theta} \dfrac{\pi_{t,a_t,\theta}^o(Z_t,H_t) - \delta_{t,a_t,\theta}^o(H_t)} {\mathrm{Var}_\theta\!\{\pi_{t,a_t,\theta}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)}\displaystyle\eval_{\theta=0} \times Y \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\mathbb{E}[\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)\mid Z_t,H_t] - \mathbb{E}[\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)\mid H_t]} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}^2}A_t^{(a_t)} \times Y\\ \times\mathbb{E}\left[(\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t))s(O)\middle| H_t\right] \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\mathbb{E}[\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)\mid Z_t,H_t]} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\mathbb{E}[\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)\mid H_t]} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}} \times A_t^{(a_t)}Y \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times \dfrac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}^2} \times A_t^{(a_t)}Y\\ \times\mathbb{E}\left[(\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t))s(O)\middle| H_t\right] \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times \dfrac{\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}\\ \mathbb{E}\left[\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\times A_t^{(a_t)} Y\middle| Z_t,H_t\right] \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\dfrac{\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}\\ \mathbb{E}\left[\displaystyle\prod_{s=t+1}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\times A_t^{(a_t)} Y\middle| H_t\right] \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\\ \times\mathbb{E}\left[\displaystyle\prod_{s=t}^{T} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times Y\middle| H_t\right]\\ \times\dfrac{\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}s(O) \end{array} \right]\\ =&\mathbb{E}\left[ \left\{\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\right\}s(O) \right]\\ &+\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times \dfrac{\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}} \xi_{t,\underline{a}_t}^o(Z_t,H_t) \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\dfrac{\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}s(O)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}} \eta_{t,\underline{a}_t}^o(H_t) \end{array} \right]\\ &-\sum_{t=0}^T\mathbb{E}\left[\begin{array}{l} \displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)} \times\gamma_{t,\underline{a}_t}^o(H_t)\\ \times\dfrac{\{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\}\{2A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)\} - \kappa_t^o(H_t)} {\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}s(O) \end{array} \right] \end{align}\] In summary, we see that the influence function of \(\psi_{\overline{a}_t}\) consists of \[\begin{align} &\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\\ &+\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\mathrm{Var}\!\{\pi_{s,a_s}^o(Z_s,H_s) \mid H_s \}}A_s^{(a_s)}\right)\times \dfrac{1}{\mathrm{Var}\!\{\pi_{t,a_t}^o(Z_t,H_t) \mid H_t \}}\\ &\times \left\{\begin{array}{l} \{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\} \xi_{t,\underline{a}_t}^o(Z_t,H_t) -\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)\\ +\gamma_{t,\underline{a}_t}^o(H_t) \times\left(\kappa_t^o(H_t)+\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}^2-\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}^2\right) \end{array}\right\}\\ =&\prod_{t=0}^{T} \frac{\pi_{t,a_t}^o(Z_t,H_t) - \delta_{t,a_t}^o(H_t)}{\kappa_{t,a_t}^o(H_t)}A_t^{(a_t)} \times Y - \psi_{\overline{a},ada}^o\\ &+\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1} \dfrac{\pi_{s,a_s}^o(Z_s,H_s) - \delta_{s,a_s}^o(H_s)} {\kappa_{s,a_s}^o(H_s)}A_s^{(a_s)}\right)\times \dfrac{1}{\kappa_{t,a_t}^o(H_t)}\\ &\times \left\{\begin{array}{l} \{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\} \xi_{t,\underline{a}_t}^o(Z_t,H_t) -\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)\\ +\gamma_{t,\underline{a}_t}^o(H_t) \times\left(\kappa_t^o(H_t)+\{A_t^{(a_t)} - \pi_{t,a_t}^o(Z_t,H_t)\}^2-\{A_t^{(a_t)} - \delta_{t,a_t}^o(H_t)\}^2\right) \end{array}\right\}. \end{align}\] This finishes the proof that \(\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada}^o,\beta_{\overline{a}}^o)\) is the influence function for \(\psi_{\overline{a},ada}^o\). Since the tangent space spanned by the score functions equals \(L_2(O)\), we know that the EIF is just \(\varphi_{\overline{a},ada}(O;\psi_{\overline{a},ada}^o,\beta_{\overline{a}}^o)\), finishing the proof for Theorem 10.
Analogous to the proof for Theorem 6, we calculate the path-wise derivative for \[\begin{align} \psi_{a,MIV}^o:= \mathbb{E}\left[ (1-A^{(a)})\dfrac{\mathrm{Cov}\{A^{(a)}Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid L\}} +A^{(a)}Y \right]. \end{align}\] The EIF can be derived as \[\require{physics} \begin{align} &\nabla_{\theta} \mathbb{E}_\theta\left[ (1-A^{(a)})\dfrac{\mathrm{Cov}_\theta\{A^{(a)}Y,Z\mid L\}}{\mathrm{Cov}_\theta\{A^{(a)},Z\mid L\}} +A^{(a)}Y \right]\eval_{\theta=0}\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +(1-A^{(a)})\nabla_{\theta}\left(\dfrac{\mathbb{E}_\theta[A^{(a)}Y(Z-\mathbb{E}_\theta[Z\mid L])\mid L]}{\mathrm{Cov}\{A^{(a)},Z\mid L\}}\right)\displaystyle\eval_{\theta=0}\\ +(1-A^{(a)})\nabla_{\theta}\left(\dfrac{\mathrm{Cov}\{A^{(a)}Y,Z\mid L\}} {\mathrm{Cov}_\theta\{A^{(a)},Z\mid L\}}\right)\displaystyle\eval_{\theta=0} \end{array} \right]\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +\dfrac{(1-A^{(a)})} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} \mathbb{E}[\left\{A^{(a)}Y (Z - \mathbb{E}[Z\mid L]) - \gamma_{a}^o(L)\kappa_{a}^o(L)\right\} s(O)\mid L]\\ -\mathbb{E}[A^{(a)}Y \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ]\mid L] \end{array} \right\}\\ -\dfrac{(1-A^{(a)})\gamma_a^o(L)}{\kappa_{a}^o(L)} \left\{\begin{array}{l} \mathbb{E}[\left\{A^{(a)} (Z - \mathbb{E}[Z\mid L]) - \kappa_{a}^o(L)\right\} s(O)\mid L]\\ -\mathbb{E}[A^{(a)} \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ]\mid L] \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} \left\{A^{(a)}Y (Z - \mathbb{E}[Z\mid L]) - \gamma_{a}^o(L)\kappa_{a}^o(L)\right\} s(O)\\ -A^{(a)}Y \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ] \end{array} \right\}\\ -\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} \left\{A^{(a)} (Z - \mathbb{E}[Z\mid L]) - \kappa_{a}^o(L)\right\} s(O)\\ -A^{(a)} \mathbb{E}[ (Z-\rho^o(L))s(O)\mid L ] \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[ \begin{array}{l} \left\{ (1-A^{(a)})\gamma_{a}^o(L) +A^{(a)}Y-\psi_{a,MIV}^o \right\}s(O)\\ +\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} \left\{A^{(a)}Y (Z - \mathbb{E}[Z\mid L]) - \gamma_{a}^o(L)\kappa_{a}^o(L)\right\} s(O)\\ -\eta_a^o(L) (Z-\rho^o(L))s(O) \end{array} \right\}\\ -\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} \left\{A^{(a)} (Z - \mathbb{E}[Z\mid L]) - \kappa_{a}^o(L)\right\} s(O)\\ -\delta_a^o(L) (Z-\rho^o(L))s(O) \end{array}\right\} \end{array} \right]. \end{align}\] The EIF corresponds to be \[\begin{align} &(1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} A^{(a)}Y (Z - \rho^o(L)) - \gamma_{a}^o(L)\kappa_{a}^o(L) \\ -\eta_a^o(L) (Z-\rho^o(L)) \end{array} \right\}\\ &-\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} A^{(a)} (Z - \rho^o(L)) - \kappa_{a}^o(L) \\ -\delta_a^o(L) (Z-\rho^o(L)) \end{array}\right\}\\ =&(1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left\{ \begin{array}{l} A^{(a)}Y (Z - \rho^o(L)) \\ -\eta_a^o(L) (Z-\rho^o(L)) \end{array} \right\}\\ &-\dfrac{(1-\delta_a^o(L))\gamma_a^o(L)}{\kappa_a^o(L)} \left\{\begin{array}{l} A^{(a)} (Z - \rho^o(L)) \\ -\delta_a^o(L) (Z-\rho^o(L)) \end{array}\right\}\\ =&(1-A^{(a)})\gamma_{a}^o(L)+A^{(a)}Y-\psi_{a,MIV}^o\\ &+\dfrac{(1-\delta_a^o(L))} {\kappa_{a}^o(L)} \left(A^{(a)}Y-\eta_a^o(L)\right)(Z-\rho^o(L))\\ &-\dfrac{(1-\delta_a^o(L))}{\kappa_a^o(L)}\gamma_a^o(L) \left(A^{(a)}-\delta_a^o(L)\right)(Z - \rho^o(L)). \end{align}\] This finishes the proof for Theorem 12.
If \(Z\) is an AIV for \(A = a\), then there exists \(b(U,L)\) and \(c(Z,L)\), such that \[\begin{align} \mathbb{E}[A^{(a)}\pi(Z,L)\mid U,L] =& \mathbb{E}[\mathbb{E}[A^{(a)}\mid Z,U,X]\pi(Z,L)\mid U,L]\\ =&\mathbb{E}[\{b(U,L)+c(Z,L)\}\pi(Z,L)\mid U,L]\\ =&b(U,L)\mathbb{E}[\pi(Z,L)\mid U,L] + \mathbb{E}[c(Z,L)\pi(Z,L)\mid U,L]\\ =&b(U,L)\mathbb{E}[\pi(Z,L)\mid L] + \mathbb{E}[c(Z,L)\pi(Z,L)\mid L];\\ \mathbb{E}[A^{(a)}\mid U,L]=&b(U,L) + \mathbb{E}[c(Z,L)\mid L];\\ \mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid U,L\}=& \mathbb{E}[c(Z,L)\pi(Z,L)\mid L]- \mathbb{E}[c(Z,L)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathrm{Cov}\{c(Z,L),\pi(Z,L)\mid L\}. \end{align}\] Analogously, one can verify that \[\begin{align} \label{exp:100} \mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid L\} = \mathrm{Cov}\{c(Z,L),\pi(Z,L)\mid L\} =\mathrm{Cov}\{A^{(a)},\pi(Z,L)\mid U,L\}. \end{align}\tag{13}\] Reversely, if Equation 13 holds true for any \(\pi(Z,L)\), then we know that \[\begin{align} &\mathbb{E}[A^{(a)}\pi(Z,L)\mid U,L]-\mathbb{E}[A^{(a)}\pi(Z,L)\mid L]\\ =&\{\mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]\}\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\]
For any subset \(\mathcal{S}\subseteq \mathcal{Z}\), we choose \(\delta>0\) and \(\pi(Z,L)=I(Z\in \mathcal{S})\) \[\begin{align} &\mathbb{E}[A^{(a)}I(Z\in \mathcal{S})\mid U,L]-\mathbb{E}[A^{(a)}I(Z\in \mathcal{S})\mid L]\\ =&\{\mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]\}\mathbb{E}[I(Z\in \mathcal{S})\mid L].\\ \Rightarrow\quad & \mathbb{E}[A^{(a)}\mid Z\in \mathcal{S}, U,L]-\mathbb{E}[A^{(a)}\mid Z\in \mathcal{S},L]= \mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]. \end{align}\] This implies that \[\begin{align} \mathbb{E}[A^{(a)}\mid Z, U,L]-\mathbb{E}[A^{(a)}\mid Z,L]= \mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]. \end{align}\] We can take \(b(U,L) = \mathbb{E}[A^{(a)}\mid U,L]-\mathbb{E}[A^{(a)}\mid L]\) and \(c(Z,L) = \mathbb{E}[A^{(a)}\mid Z,L]\), and verify that \[\begin{align} \mathbb{E}[A^{(a)}\mid Z, U,L] = b(U,L)+c(Z,L), \end{align}\] finishing the proof for Proposition 1.
Without loss of generality, we only consider the case that \(A=1\). If Assumption 5 holds, we can take \(\pi^{o}(Z,L) = \Pr(A=1 \mid Z,L)\) to verify that \(\pi^{o}(Z,L)\) is an RWF. Indeed, \(\pi^{o}(Z,L)\) is uniformly bounded by one, and \[\mathrm{Cov}\!\{A,\pi^{o}(Z,L) \mid L\} = \mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\}\] is uniformly bounded below by some positive constant \(\epsilon_{0} > 0\).
Conversely, suppose there exists an RWF \(\pi(Z,L)\). By the Cauchy–Schwarz inequality, we have \[\mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\} \ge \frac{\left|\mathrm{Cov}\!\{\Pr(A=1 \mid Z,L), \pi(Z,L) \mid L\}\right|^2}{\mathrm{Var}\!\{\pi(Z,L) \mid L\}}.\] Noting that \(\mathrm{Cov}\!\{\Pr(A=1 \mid Z,L), \pi(Z,L) \mid L\} = \mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}\), the RWF property yields \[\mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\} \ge \frac{\epsilon_{0}^2}{\sup_{L \in \mathcal{L}} \mathrm{Var}\!\{\pi(Z,L) \mid L\}}.\]
The uniform boundedness of \(\pi(Z,L)\) implies that \(\mathrm{Var}\!\{\Pr(A=1 \mid Z,L) \mid L\}\) is uniformly bounded away from zero. Thus, \(\pi^o(Z,L):=\Pr(A=1 \mid Z,L)\) is an RWF for \(A\). This completes the proof of the statement in Assumption 5.
Define \(g_a(U,L):=\mathbb{E}[Y(a)\mid U,L]\). We can observe that the left side of Equation 1 equals to \[\begin{align} &\mathbb{E}[A^{(a)}Y\mid Z,L]=\mathbb{E}[\mathbb{E}[A^{(a)}Y\mid U,Z,L]\mid Z,L]\notag\\ =&\mathbb{E}\left[\mathbb{E}[Y\mid U,L,Z,A=a]\Pr(A=a\mid Z,U,L)\middle| Z,L\right]\notag\\ =&\mathbb{E}\left[\mathbb{E}[Y(a)\mid U,L,Z,A=a]\Pr(A=a\mid Z,U,L)\middle| Z,L\right]\tag{14}\\ =&\mathbb{E}\left[g_a(U,L)\Pr(A=a\mid Z,U,L)\middle| Z,L\right].\tag{15} \end{align}\] Equation 14 follows from Assumption 1 that \(Y=Y(a)\). Equation 15 follows from Assumption 2 that \(Y(a)\perp\!\!\!\perp\{Z,A\}\mid L,U\). Similarly, the right side of Equation 1 equals to \[\begin{align} &\mathbb{E}[f_a^o(A^{(a)},L)\mid Z,L]=\mathbb{E}[\mathbb{E}[f_a^o(A^{(a)},L)\mid U,Z,L]\mid Z,L]\\ =&\mathbb{E}\left[f_a^o(1,L)\Pr(A=a\mid Z,U,L)\middle| Z,L\right]+\mathbb{E}\left[f_a^o(0,L)\{1-\Pr(A=a\mid Z,U,L)\}\middle| Z,L\right]\\ =&\mathbb{E}\left[\{f_a^o(1,L)-f_a^o(0,L)\}\Pr(A=a\mid Z,U,L)\middle| Z,L\right]+f_a^o(0,L). \end{align}\] Next, from Assumption 3 that \(Z\perp\!\!\!\perp U\mid L\), we see that for any \(z\in \mathcal{Z}\), \[\label{exp:6} \begin{align} &\mathbb{E}\left[g_a(U,L)\Pr(A=a\mid Z=z,U,L)\middle| L\right]\\ =&\mathbb{E}\left[\{f_a^o(1,L)-f_a^o(0,L)\}\Pr(A=a\mid Z=z,U,L)\middle| L\right] +f_a^o(0,L). \end{align}\tag{16}\] Then from Equation 16 and Assumption 1 that there exists \(b(U,L)\) and \(c(Z,L)\), such that \[\Pr(A=a\mid Z,U,L)=b(U,L)+c(Z,L).\] Then for any \(z\), it holds that \[\begin{align} f_a^o(0,L) =&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\{b(U,L)+c(z,L)\}\middle| L\right]\notag \\=&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L\right]\label{exp:1}\\ &+\mathbb{E}\left[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\middle| L\right]c(z,L).\notag \end{align}\tag{17}\] From Assumption 4, for any \(l\), there exists \(z_1,z_2\), such that \(c(z_1,l)\neq c(z_2,l)\), and we can take difference to get \[\begin{align} &\mathbb{E}[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\mid L=l]\{c(z_1,l)-c(z_2,l)\}= 0.\\ \Rightarrow& \mathbb{E}[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\mid L=l]= 0. \end{align}\] Then we see that \(\mathbb{E}[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\mid L]\equiv 0.\) Next, note that for \(a\in\mathcal{A}\), \[\mathbb{E}[g_a(U,L)\mid L]=\mathbb{E}[\mathbb{E}[Y(a)\mid L,U]\mid L]=\mathbb{E}[Y(a)\mid L].\] Now we know that \(f_a^o(1,L)-f_a^o(0,L)=\mathbb{E}[Y(a)\mid L].\) We can now substitute this into Equation 17 to deduce that \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[\{\mathbb{E}[Y(a)\mid U,L]-\mathbb{E}[Y(a)\mid L]\}b(U,L)\middle| L\right]\\ =&\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\},\\ f_a^o(1,L)=&\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L], \Pr(A=a\mid Z,U,L)\mid L\} +\mathbb{E}[Y(a)\mid L]. \end{align}\] This step follows from Assumption 3 that \(\mathrm{Cov}\!\{\mathbb{E}[Y(a)\mid U,L],c(Z,L)\mid L\}=0\). This is equivalent to say that there only exist one solution \(f_a^o(A^{(a)},L)\) to Equation 1 , finishing the proof for Proposition 3.
From the proof for Proposition 3, we can deduce that \[\begin{align} &\mathbb{E}[Y\mid Z=z,L]=\mathbb{E}[\mathbb{E}[AY\mid Z,U,L]+\mathbb{E}[(1-A)Y\mid Z,U,L]\mid Z=z,L]\\ =&\mathbb{E}\left[g_1(U,L)\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[g_0(U,L)\Pr(A=0\mid Z=z,U,L)\middle| L\right]\\ =&\mathbb{E}\left[\{g_1(U,L)-g_0(U,L)\}\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[g_0(U,L)\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]. \end{align}\] Then from Equation ?? , we know that \[\label{exp:3} \begin{align} &\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}\left[\{f^o(1,L)-f^o(0,L)\}\Pr(A=1\mid Z=z,U,L)\middle| L\right] +f^o(0,L). \end{align}\tag{18}\] First, we consider the case when \(Z\) is an AIV for \(A\). In this case, there exists \(b(U,L)\) and \(c(Z,L)\) such that \(\Pr(A=1\mid Z,U,L)=b(Z,L)+c(U,L)\), then \[\begin{align} f^o(0,L)=&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &-\mathbb{E}\left[\{f^o(1,L)-f^o(0,L)\}\Pr(A=1\mid Z=z,U,L)\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\{b(U,L)+c(z,L)\}\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &-\mathbb{E}\left[\{f^o(1,L)-f^o(0,L)\}\{b(U,L)+c(z,L)\}\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid U,L]b(U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &+c(z,L)\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid U,L]\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid U,L]b(U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ &+c(z,L)\mathbb{E}\left[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\middle| L\right]. \end{align}\] Under Assumption 4, for any \(l\in\mathcal{L}\), there exist \(z_1,z_2\in\mathcal{Z}\), such that \[c(z_1,l)-c(z_2,l)=\Pr(A=1\mid Z=z_1,U,L=l)- \Pr(A=1\mid Z=z_2,U,L=l)\neq 0,\] then we know that \[\begin{align} &\{c(z_1,l)-c(z_2,l)\}\mathbb{E}\left[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\middle| L=l\right]\equiv 0.\\ \Rightarrow &f^o(1,L)-f^o(0,L)= \mathbb{E}\left[Y(1)-Y(0)\middle|L\right]. \end{align}\] Now we can substitute \(f^o(1,L)-f^o(0,L)\) with \(\mathbb{E}\left[Y(1)-Y(0)\middle|L\right]\) to deduce that \[\begin{align} f^o(0,L)=&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)-\mathbb{E}\left[Y(1)-Y(0)\middle|L\right]\mid U,L]b(U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],b(U,L)\mid L\right\}+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(0)\middle| L\right],\\ f^o(0,L)=&f^o(0,L)+\mathbb{E}\left[Y(1)-Y(0)\middle|L\right]\\ =&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(1)\middle| L\right]. \end{align}\] Second, we consider the case when \(Y(1)-Y(0)\perp\!\!\!\perp U\mid L\). In this case, we only need to show that \[\begin{align} f^o(0,L)=&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}\left[Y(0)\middle| L\right],\\ f^o(1,L)=&\mathrm{Cov}\left\{\mathbb{E}[Y(1)-Y(0)\mid U,L],\Pr(A=1\mid Z,U,L)\mid L\right\}+\mathbb{E}\left[Y(1)\middle| L\right]\\ =&\mathbb{E}\left[Y(1)\middle| L\right]. \end{align}\] From Equation 18 , we see that \[\begin{align} &\mathbb{E}[Y(1)-Y(0)\mid L]\mathbb{E}\left[\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(1)-Y(0)\mid U,L]\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\{f^o(1,L)-f^o(0,L)\}\mathbb{E}\left[\Pr(A=1\mid Z=z,U,L)\middle| L\right] +f^o(0,L). \end{align}\] Now we deduce that \[\begin{align} f^o(0,L)=&\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid L]\\ &\times\mathbb{E}\left[\Pr(A=1\mid Z=z,U,L)\middle| L\right]+\mathbb{E}\left[Y(0)\middle| L\right]\\ =&\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid L]\\ &\times\Pr(A=1\mid Z=z,L)+\mathbb{E}\left[Y(0)\middle| L\right]. \end{align}\] Under Assumption 4, for any \(l\in\mathcal{L}\), there exist \(z_1,z_2\in\mathcal{Z}\), such that \(\Pr(A=1\mid Z=z_1,L=l)- \Pr(A=1\mid Z=z_2,L=l)\neq 0\), then we know that \[\begin{align} &\{\Pr(A=1\mid Z=z_1,L=l)-\Pr(A=1\mid Z=z_2,L=l)\}\\ &\times\mathbb{E}[Y(1)-Y(0)-f^o(1,L)+f^o(0,L)\mid L=l]=0,\text{ for any l\in\mathcal{L}.}\\ \Rightarrow&f^o(1,L)-f^o(0,L) = \mathbb{E}[Y(1)-Y(0)\mid L]. \end{align}\] We now establish that \(f^o(0, L) = \mathbb{E}[Y(0) \mid L]\) and \(f^o(1, L) = \mathbb{E}[Y(1) \mid L]\). The proof of Equation ?? follows analogously to the arguments in Proposition 3, and is therefore omitted for brevity. This concludes the proof of Proposition 4.
Finally, when the AIV condition does not holds, we prove the fact that \[\mathbb{E}\left[\dfrac{\mathrm{Cov}\{Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\{A,\pi(Z,L)\mid L\}}\right] =\mathbb{E}\left[\mathrm{Cov}\{A,\pi(Z,L)\mid U,L\}\mathbb{E}[Y(1)-Y(0)\mid U,L]\right].\] which is a weighted average of conditional ATE \(\mathbb{E}[Y(1)-Y(0)\mid U,L]\). In fact, \[\begin{align} &\mathrm{Cov}\{Y,\pi(Z,L)\mid L\} = \mathbb{E}\left[\mathbb{E}[Y|Z,L]\pi(Z,L)\middle| L\right] - \mathbb{E}[Y\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\mathbb{E}[A\{Y(1)-Y(0)\}+Y(0)|Z,L]\pi(Z,L)\middle| L\right] \\&- \mathbb{E}[A\{Y(1)-Y(0)\}+Y(0)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\left\{\mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]+\mathbb{E}[Y(0)\mid U,L]\mid Z,L]\right\}\pi(Z,L)\middle| L\right] \\&- \mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]+\mathbb{E}[Y(0)\mid U,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[A\mathbb{E}[Y(1)-Y(0)\mid U,L]\pi(Z,L)\middle| L\right] +\mathbb{E}\left[\mathbb{E}[Y(0)\mid L]\pi(Z,L)\middle| L\right] \\ &- \mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L] - \mathbb{E}[Y(0)\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\mathbb{E}[A\pi(Z,L)\mid U,L]\mathbb{E}[Y(1)-Y(0)\mid U,L]\middle| L\right] \\ &- \mathbb{E}[A\mathbb{E}[Y(1)-Y(0)\mid U,L]\mid L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\mathbb{E}\left[\mathbb{E}[A\pi(Z,L)\mid U,L]\mathbb{E}[Y(1)-Y(0)\mid U,L]\middle| L\right] \\ &- \mathbb{E}[\mathbb{E}[\pi(Z,L)\mid L]\mathbb{E}[A\mid U,L]\mathbb{E}[Y(1)-Y(0)\mid U,L]\mid L]\\ =&\mathbb{E}\left[\mathrm{Cov}\{A,\pi(Z,L)\mid U,L\}\mathbb{E}[Y(1)-Y(0)\mid U,L]\middle| L\right]. \end{align}\] Then we can show that \[\begin{align} \dfrac{\mathrm{Cov}\{Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\{A,\pi(Z,L)\mid L\}} =\mathbb{E}\left[\dfrac{\mathrm{Cov}\{A,\pi(Z,L)\mid U,L\}\mathbb{E}[Y(1)-Y(0)\mid U,L]} {\mathrm{Cov}\{A,\pi(Z,L)\mid L\}}\middle| L\right]. \end{align}\]
The proof of this theorem follows the same approach as Proposition 9, which establishes a parallel result in the longitudinal setting.
Recall that \[\begin{align} &\pi^o(Z,L)=\Pr(A=1\mid Z,L), \\ &\gamma^o(L):=\dfrac{\mathrm{Cov}\!\{Y, \pi(Z,L) \mid L\}}{\mathrm{Cov}\!\{A, \pi(Z,L) \mid L\}}=\dfrac{\zeta_{\pi}^o(L)-\eta^o(L)\rho_{\pi}^o(L)} {\kappa_{\pi}^o(L)}. \end{align}\] We can see from Proposition 4 that \(\gamma^o(L)\) is a function that do not related to the choose of \(\pi(Z,L)\) and it solve the nonparametric IV problem in Equation ?? as \[\begin{align} \mathbb{E}[Y\mid Z,L]=\mathbb{E}[f^o(A,L)\mid Z,L]=\mathbb{E}[f^o(0,L)+A\gamma^o(L)\mid Z,L]. \end{align}\] From this equality, we can see that \(\mathbb{E}[Y-A\gamma^o(L)\mid Z,L]=f^o(0,L)\). \[\begin{align} \mathrm{Var}\!\{\mathbb{E}[\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\mid L]\}= \mathrm{Var}\left\{\dfrac{\zeta_{\pi}^o(L)-\eta^o(L)\rho_{\pi}^o(L)} {\kappa_{\pi}^o(L)}\right\} =\mathrm{Var}\left\{\gamma^o(L)\right\}, \end{align}\] which is a constant that not related to \(\pi(Z,L)\). \[\begin{align} &\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o) \mid L\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} Y\{\pi(Z,L)-\rho_{\pi}^o(L)\}-\eta^o(L)\pi(Z,L)\\ -\gamma^o(L)\cdot \left\{A(\pi(Z,L)-\rho_{\pi}^o(L))-\delta^o(L)\pi(Z,L)\right\} \end{array}\middle| L \right\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} \{Y-A\gamma^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\\ +\{\gamma^o(L)\delta^o(L)-\eta^o(L)\}\pi(Z,L) \end{array}\middle| L \right\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} \{Y-A\gamma^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\\ -\mathbb{E}[Y-A\gamma^o(L)\mid L]\pi(Z,L) \end{array}\middle| L \right\}\\ =&\dfrac{1}{\{\kappa_{\pi}^o(L)\}^2}\mathrm{Var}\left\{ \begin{array}{l} \left(\{Y-A\gamma^o(L)\}-\mathbb{E}[Y-A\gamma^o(L)\mid L]\right)\\ \cdot\{\pi(Z,L)-\rho_{\pi}^o(L)\} \end{array}\middle| L \right\}. \end{align}\] Notably, \(\mathbb{E}[Y|Z,L]=\mathbb{E}[\gamma^o(L)|Z,L]\) We denote \[\begin{align} W=&Y-A\gamma^o(L)-\mathbb{E}[\{Y-A\gamma^o(L)\}\mid L]\\ =&Y-A\gamma^o(L)-f^o(0,L)=Y-f^o(A,L). \end{align}\] From the assumption that \(\mathbb{E}[W^2\mid Z,L]\perp\!\!\!\perp Z\mid L\), we adopt the Cauchy Schwartz inequality to deduce that \[\begin{align} &\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o) \mid L\} =\dfrac{\mathbb{E}[W^2\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]}{\{\kappa_{\pi}^o(L)\}^2}\\ =&\dfrac{\mathbb{E}[W^2\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]} {\{\mathbb{E}[\{A-\delta^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\mid L]\}^2}\\ =&\dfrac{\mathbb{E}[\mathbb{E}[W^2\mid Z,L]\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]} {\{\mathbb{E}[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\mid L]\}^2}\\ =&\dfrac{\mathbb{E}[W^2\mid L]\mathbb{E}[\{\pi(Z,L)-\rho_{\pi}^o(L)\}^2\mid L]} {\{\mathbb{E}[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}\{\pi(Z,L)-\rho_{\pi}^o(L)\}\mid L]\}^2}\\ \geq&\mathbb{E}[W^2\mid L]\left\{\mathbb{E}\left[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}^2\middle| L\right]\right\}^{-1}. \end{align}\] The inequality holds only when there exists \(f(L)\), such that \[\begin{align} \pi(Z,L)-\rho_{\pi}^o(L)=f(L)\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}. \end{align}\] Apparently, when \(\pi(Z,L)=\pi^o(Z,L)\), this lower bound is achieved. Now we verify the fact that \[\begin{align} &\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\}\\ =&\mathrm{Var}\!\{\mathbb{E}[\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\mid L]\} +\mathbb{E}\left[\mathrm{Var}\!\{\varphi_{\pi}(O;\psi_{ada}^o,\alpha_{\pi}^o)\mid L\}\right]\\ \geq&\mathrm{Var}\left\{\gamma^o(L)\right\}+\mathbb{E}\left[ \mathbb{E}[W^2\mid L]\left\{\mathbb{E}\left[\{\pi^o(Z,L)-\rho_{\pi^o}^o(L)\}^2\middle| L\right]\right\}^{-1}\right]\\ =&\mathrm{Var}\!\{\varphi_{\pi^o}(O;\psi_{ada}^o,\alpha_{\pi^o}^o)\}, \end{align}\] finishing the proof for Proposition 6.
We finish the proof by calculating \[\begin{align} &\mathbb{E}\left[\varphi(O;\psi_{ada}^o,\beta)\right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\pi(Z,L)\xi^o(Z,L)-\delta(L)^o(L)}{\kappa(L)}-\psi_{ada}^o\\ +\gamma(L)\left\{ 1 - \dfrac{(A-\delta(L))^2-(A-\pi(Z,L))^2}{\kappa(L)} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\xi(Z,L)(\pi^o(Z,L)-\pi(Z,L))-\eta(L)(\delta^o(L)-\delta(L))\right\} \end{array} \right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\pi^o(Z,L)\xi^o(Z,L)-\delta^o(L)^o(L)}{\kappa(L)}-\psi_{ada}^o\\ +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \begin{array}{l} \kappa(L) - (A-\delta^o(L)+\delta^o(L)-\delta(L))^2\\ +(A-\pi^o(Z,L)+\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\begin{array}{l} (\xi(Z,L)-\xi^o(Z,L))(\pi^o(Z,L)-\pi(Z,L))\\ -(\eta(L)-\eta^o(L))(\delta^o(L)-\delta(L)) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\kappa^o(L)}{\kappa(L)}\gamma^o(L)-\psi_{ada}^o\\ +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \begin{array}{l} \kappa(L) - \kappa^o(L) - (\delta^o(L)-\delta(L))^2\\ +(\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\begin{array}{l} (\xi(Z,L)-\xi^o(Z,L))(\pi^o(Z,L)-\pi(Z,L))\\ -(\eta(L)-\eta^o(L))(\delta^o(L)-\delta(L)) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{\kappa^o(L)}{\kappa(L)}\gamma^o(L) +\dfrac{\gamma(L)}{\kappa(L)}\{\kappa(L) - \kappa^o(L)\} -\psi_{ada}^o\\ +\dfrac{\gamma(L)}{\kappa(L)}\left\{ \begin{array}{l} - (\delta^o(L)-\delta(L))^2\\ +(\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ +\dfrac{1}{\kappa(L)} \left\{\begin{array}{l} (\xi(Z,L)-\xi^o(Z,L))(\pi^o(Z,L)-\pi(Z,L))\\ -(\eta(L)-\eta^o(L))(\delta^o(L)-\delta(L)) \end{array}\right\} \end{array} \right]\\ =&\mathbb{E}\left[\dfrac{1}{\kappa(L)}\left\{\begin{array}{l} \{\gamma(L)-\gamma^o(L)\}\{\kappa(L) - \kappa^o(L)\}\\ +\gamma(L)\left\{ \begin{array}{l} - (\delta^o(L)-\delta(L))^2\\ +(\pi^o(Z,L)-\pi(Z,L))^2 \end{array} \right\}\\ -(\xi(Z,L)-\xi^o(Z,L))(\pi(Z,L)-\pi^o(Z,L))\\ +(\eta(L)-\eta^o(L))(\delta(L)-\delta^o(L)) \end{array}\right\} \right]. \end{align}\]
We provide a proof for the following result, which is stronger than the results in Equation ?? . For any \(s\) with \(0 \leq s \leq T+1\) and \(r\) with \(0 \leq r \leq T+1-s\), we have \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s}) \mid H_s]\notag\\ =&\mathbb{E}\left[\prod_{t=s}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r})\middle| H_s \right]\label{exp:13}. \end{align}\tag{19}\] which is clearly a stronger version of Equation ?? . We establish the result by induction. When \(s = T+1\), Equation 19 reduces to \(\mathbb{E}[Y(\overline{A}) \mid H_{T+1}] = \mathbb{E}[Y \mid H_{T+1}]\), which follows directly from the consistency assumption (Assumption 6). Now, assume that Equation 19 holds for all \(s_c = T+1, T, \ldots, s+1\) and \(r_c\) satisfying \(s_c + r_c \leq T+1\). We verify that the equation also holds for \(s_c = s\) and all \(r_c = 0, \ldots, T+1 - s\). This completes the proof by induction. In particular, we begin by verifying the case \(r_c = T+1 - s\). By the definition of \(\gamma_{s,\underline{a}_s}^o\), this is equivalent to verifying that \[\label{exp:11} \begin{align} &\mathbb{E}[Y(\underline{a}_{s})|H_s] =\gamma_{s,\underline{a}_s}^o(H_{s}) :=\frac{\mathrm{Cov}\bigl\{ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}), \pi_s(Z_s, H_s) \mid H_s \bigr\}}{\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}. \end{align}\tag{20}\] By induction, \(\mathbb{E}[Y(\underline{a}_{s+1})|H_{s+1}]=\gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})\). Then \[\begin{align} &\mathbb{E}\left[ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ I\{A_s = a_s\} \mathbb{E}[Y(\underline{a}_{s+1})\mid H_{s+1}]\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ I\{A_s = a_s\} Y(\underline{a}_{s+1})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ I\{A_s = a_s\} Y(\underline{a}_{s})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}\left[ \mathbb{E}[Y(\underline{a}_{s})\mid A_s=a_s,Z_s,H_s]\times \mathbb{E}[I\{A_s = a_s\}\mid Z_s,H_s] \pi_t(Z_s, H_s)\middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times\mathbb{E}\left[ \Pr(A_s = a_s\mid Z_s,H_s) \pi_t(Z_s, H_s)\middle| H_s \right]. \end{align}\] Since \(A_s\) is an AIV for \(H_s\), there exists functions \(b_{s,a_s}(\overline{U}_s,H_s)\) and \(c_{s,a_s}(Z_s,H_s)\) such that, for any \(a_s \in \mathcal{A}_s\), \[\Pr(A_s = a_s \mid Z_s, \overline{U}_s, H_s) = b_{s,a_s}(\overline{U}_s, H_s) + c_{s,a_s}(Z_s, H_s).\] Then we can deduce that \[\begin{align} &\mathbb{E}\left[ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})\pi_t(Z_s, H_s) \middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times\mathbb{E}\left[ \left\{b_{s,a_s}(\overline{U}_s, H_s) + c_{s,a_s}(Z_s, H_s)\right\} \pi_t(Z_s, H_s)\middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \left\{\begin{array}{l} \mathbb{E}\left[ b_{s,a_s}(\overline{U}_s, H_s)\middle| H_s\right]\times\mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right]\\ +\mathbb{E}\left[ c_{s,a_s}(Z_s, H_s) \pi_t(Z_s, H_s)\middle| H_s \right] \end{array}\right\}. \end{align}\] Similarly, we can calculate \[\begin{align} &\mathbb{E}\left[ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}) \middle| H_s \right] \times\mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right]\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \left\{\begin{array}{l} \mathbb{E}\left[ b_{s,a_s}(\overline{U}_s, H_s)\middle| H_s\right]\\ +\mathbb{E}\left[ c_{s,a_s}(Z_s, H_s) \middle| H_s \right] \end{array}\right\}\times\mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right];\\\\ &\mathrm{Cov}\bigl\{ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \left\{\begin{array}{l} \mathbb{E}\left[ c_{s,a_s}(Z_s, H_s) \pi_t(Z_s, H_s)\middle| H_s \right]\\ -\mathbb{E}\left[ c_{s,a_s}(Z_s, H_s)\middle| H_s \right]\times \mathbb{E}\left[\pi_t(Z_s, H_s)\middle| H_s \right] \end{array}\right\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{c_{s,a_s}(Z_s, H_s), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{ b_{s,a_s}(\overline{U}_s, H_s) +c_{s,a_s}(Z_s, H_s), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{\Pr(A_s = a_s \mid Z_s, \overline{U}_s, H_s), \pi_s(Z_s, H_s) \mid H_s \bigr\}\\ =&\mathbb{E}[Y(\underline{a}_{s})\mid H_s]\times \mathrm{Cov}\bigl\{I\{A_s = a_s\} , \pi_s(Z_s, H_s) \mid H_s \bigr\}. \end{align}\] This finishes the proof for Equation 20 , and we finish the proof for \(s_c=s\) and \(r_c=T+1-s\). For \(r_c=T-s\), \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s})|H_s] =\frac{\mathrm{Cov}\bigl\{ I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1}), \pi_s(Z_s, H_s) \mid H_s \bigr\}}{\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}\notag\\ =&\mathbb{E}\left[\frac{ \left\{\pi_s(Z_s, H_s)-\mathbb{E}\left[\pi_s(Z_s, H_s)\middle| H_s\right]\right\} I\{A_s = a_s\} \gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})}{\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}\middle| H_s\right].\label{exp:12} \end{align}\tag{21}\] By induction, \(\gamma_{s+1,\underline{a}_{s+1}}^o(H_{s+1})\) equals to \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s+1}) \mid H_{s+1}]\notag\\ =&\mathbb{E}\left[\prod_{t=s+1}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r})\middle| H_{s+1} \right]. \end{align}\] We can substitute this quantity into Equation 21 to deduce that \[\begin{align} &\mathbb{E}[Y(\underline{a}_{s})|H_s]\\ =&\mathbb{E}\left[\begin{array}{l} \dfrac{ \left\{\pi_s(Z_s, H_s)-\mathbb{E}\left[\pi_s(Z_s, H_s)\middle| H_s\right]\right\} I\{A_s = a_s\}} {\mathrm{Cov}\bigl\{ I\{A_s = a_s\}, \pi_s(Z_s, H_s) \mid H_s \bigr\}}\\ \times \displaystyle\prod_{t=s+1}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r}) \end{array}\middle| H_s\right]\\ =&\mathbb{E}\left[\prod_{t=s}^{T-r} \dfrac{\left(\pi_t(Z_t,H_t)-\mathbb{E}[\pi_t(Z_t,H_t)\mid H_t]\right)I\{A_t=a_t\}}{\mathrm{Cov}\!\{I\{A_t=a_t\},\pi_t(Z_t,H_t)\mid H_t\}} \gamma_{T+1-r,\underline{a}_{T+1-r}}^o(H_{T+1-r})\middle| H_{s} \right]. \end{align}\] This finishes the proof for Equation 19 .
We first calculate that \[\begin{align} &\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t,\overline{a}},\underline{\alpha}_{t+1,\overline{a}}^o) -\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t-1,\overline{a}},\underline{\alpha}_{t,\overline{a}}^o)\\ =&\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)} \times\left( \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} -\dfrac{\left\{Z_t-\rho_t^o(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}^o(H_t)} \right)\\ &\times\left\{ \begin{array}{l} \displaystyle\prod_{s=t+1}^{T}\dfrac{\left\{Z_s-\rho_t^o(H_t)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}^o(H_s)}Y-\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\\ +\displaystyle\sum_{s=t+1}^T\left(\displaystyle\prod_{r=t+1}^{s-1} \dfrac{\left\{Z_r-\rho_r^o(H_r)\right\}A_r^{(a_r)}} {\kappa_{r,a_r}^o(H_r) - \delta_{r,a_r}^o(H_r)\rho_r^o(H_r)}\right)\\ \times\left\{ \left(1-\dfrac{\{A_s^{(a_s)}-\delta_{s,a_s}^o(H_s)\}\{Z_s-\rho_s^o(H_s)\}}{\kappa_{s,a_s}^o(H_s)} \right)\gamma_{s,\underline{a}_s}^o(H_s) - \dfrac{(Z_s-\rho_s^o(H_s))\times\eta_{s,\underline{a}_s}^o(H_s)}{\kappa_{s,a_s}^o(H_s)}\right\} \end{array} \right\}\\ &+\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\\ &\times\left\{\begin{array}{l} \left( \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} -\dfrac{\left\{Z_t-\rho_t^o(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}^o(H_t)} \right)\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\\ +\left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}\{Z_t-\rho_t(H_t)\}}{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\times\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\\ -\left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}^o(H_t)\}\{Z_t-\rho_t^o(H_t)\}}{\kappa_{t,a_t}^o(H_t)} \right)\gamma_{t,\underline{a}_t}^o(H_t) - \dfrac{(Z_t-\rho_t^o(H_t))\times\eta_{t,\underline{a}_t}^o(H_t)}{\kappa_{t,a_t}^o(H_t)} \end{array}\right\}\\ =&\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)} \times\left( \dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}(H_t)} -\dfrac{\left\{Z_t-\rho_t^o(H_t)\right\}A_t^{(a_t)}} {\kappa_{t,a_t}^o(H_t)} \right)Q_{11}\\ &+\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)} \left\{ Q_{12}+Q_{13}-Q_{14} \right\}. \end{align}\] Since the nuisance functions in \(Q_{11}\) all equal to the corresponding true form, it is straight forward to verify that \(\mathbb{E}[Q_{11}\mid H_{t+1}]\equiv 0\). Similarly, one can verify that \(\mathbb{E}[Q_{14}\mid H_t]\equiv 0\). From \[\begin{align} &\mathbb{E}\left[\dfrac{\left\{Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})} {\kappa_{t,a_t}(H_t)}\middle| H_t\right]\\ =&\dfrac{\mathbb{E}\left[\left\{ Z_t-\rho_t(H_t)\right\}A_t^{(a_t)}\gamma_{t+1,\underline{a}_{t+1}}^o(H_{t+1})\middle| H_t\right]} {\kappa_{t,a_t}(H_t)}\\ =&\dfrac{\zeta_{t,\underline{a}_t}^o(H_t)-\rho_t(H_t)\eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}(H_t)} =\dfrac{\gamma_{t,\underline{a}_t}^o(H_t)\kappa_{t,a_t}^o(H_t)+\{\rho_t^o(H_t)-\rho_t(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}(H_t)}, \end{align}\] we know that \[\begin{align} &\mathbb{E}[Q_{12}\mid H_t]=\gamma_{t,\underline{a}_t}^o(H_t)\left(\dfrac{\kappa_{t,a_t}^o(H_t)}{\kappa_{t,a_t}(H_t)}-1\right) +\dfrac{\{\rho_t^o(H_t)-\rho_t(H_t)\} \eta_{t,\underline{a}_t}^o(H_t)} {\kappa_{t,a_t}(H_t)};\\ &\mathbb{E}[Q_{13}\mid H_t]\\=& \mathbb{E}\left[\left(1-\dfrac{\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}\{Z_t-\rho_t(H_t)\}}{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(Z_t-\rho_t(H_t))\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\middle| H_t\right]\\=& \left(1-\dfrac{ \mathbb{E}\left[\{A_t^{(a_t)}-\delta_{t,a_t}(H_t)\}\{Z_t-\rho_t(H_t)\}\middle| H_t\right] }{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) - \dfrac{(\rho_t^o(H_t)-\rho_t(H_t))\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)}\\=& \left(1-\dfrac{ \kappa_{t,a_t}^o(H_t) }{\kappa_{t,a_t}(H_t)} \right)\gamma_{t,\underline{a}_t}(H_t) +\dfrac{1}{\kappa_{t,a_t}(H_t)}(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t)\\&- \dfrac{(\rho_t^o(H_t)-\rho_t(H_t))\times\eta_{t,\underline{a}_t}(H_t)}{\kappa_{t,a_t}(H_t)};\\ &\mathbb{E}[Q_{12}+Q_{13}\mid H_t]\\=& \dfrac{1}{\kappa_{t,a_t}(H_t)}\left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ &+\dfrac{1}{\kappa_{t,a_t}(H_t)}(\rho_t(H_t)-\rho_t^o(H_t))\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ &+\dfrac{1}{\kappa_{t,a_t}(H_t)}(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t). \end{align}\] In summary, we can deduce that \[\begin{align} &\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t,\overline{a}},\underline{\alpha}_{t+1,\overline{a}}^o) -\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t-1,\overline{a}},\underline{\alpha}_{t,\overline{a}}^o)\middle| H_t\right] =\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\\ &\times \dfrac{1}{\kappa_{t,a_t}(H_t)}\left\{\begin{array}{l} \left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\rho_t(H_t)-\rho_t^o(H_t))\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t). \end{array}\right\}. \end{align}\] Finally, we deduce that \[\begin{align} &\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\alpha_{\overline{a}})\right]\\ =&\sum_{t=0}^T\mathbb{E}\left[\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t,\overline{a}},\underline{\alpha}_{t+1,\overline{a}}^o) -\varphi_{\overline{a}}(O;\psi_{\overline{a}}^o,\overline{\alpha}_{t-1,\overline{a}},\underline{\alpha}_{t,\overline{a}}^o)\right]\\ =&\mathbb{E}\left[\begin{array}{l} \displaystyle\sum_{t=0}^T\left(\displaystyle\prod_{s=0}^{t-1}\dfrac{\left\{Z_s-\rho_s(H_s)\right\}A_s^{(a_s)}} {\kappa_{s,a_s}(H_s)}\right) \times \dfrac{1}{\kappa_{t,a_t}(H_t)}\\ \times\left\{\begin{array}{l} \left\{\kappa_{t,a_t}(H_t)- \kappa_{t,a_t}^o(H_t)\right\} \left\{\gamma_{t,\underline{a}_t}(H_t)-\gamma_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\rho_t(H_t)-\rho_t^o(H_t))\left\{\eta_{t,\underline{a}_t}(H_t)-\eta_{t,\underline{a}_t}^o(H_t)\right\}\\ +(\delta_{t,a_t}(H_t)-\delta_{t,a_t}^o(H_t))(\rho_t(H_t) - \rho_t^o(H_t)) \gamma_{t,\underline{a}_t}(H_t) \end{array}\right\} \end{array}\right]. \end{align}\] This finishes the proof for Proposition 9.
From Equation 16 , we see that \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[g_a(U,L)\{ 1-\Pr(A\neq a\mid Z=z,U,L)\}\middle| L\right]\\ &-\mathbb{E}\left[\{f_a^o(1,L)-f_a^o(0,L)\}\{1-\Pr(A\neq a\mid Z=z,U,L)\}\middle| L\right]. \end{align}\] If there exists \(b(U,L)\) and \(c(Z,L)\), such that \(\Pr(A\neq a\mid Z=z,U,L)=b(U,L)c(Z,L)\), and \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\{1-b(U,L)c(z,L)\}\middle| L\right] \notag\\ =&\mathbb{E}\left[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\middle| L\right]\label{exp:2}\\ &-c(z,L)\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L\right].\notag \end{align}\tag{22}\] If for any \(l\), there exists \(z_1,z_2\), such that \(c(z_1,l)\neq c(z_2,l)\), then we can take difference to deduce that \[\begin{align} &\{c(z_1,l)-c(z_2,l)\}\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L=l\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}b(U,L)\middle| L\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\Pr(A\neq a\mid U,L)\middle| L\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}I(A\neq a)\middle| L\right]=0\\ \Rightarrow&\mathbb{E}\left[\{g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\}\middle| L,A\neq a\right]=0. \end{align}\] Now we know that \[\begin{align} &f_a^o(1,L)-f_a^o(0,L)=\mathbb{E}\left[g_a(U,L)\middle| L,A\neq a\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(a)\mid U,L]\middle| L,A\neq a\right]\\ =&\mathbb{E}\left[\mathbb{E}[Y(a)\mid U,L,A\neq a]\middle| L,A\neq a\right]\\ =&\mathbb{E}\left[Y(a)\middle| L,A\neq a\right]. \end{align}\] We can now substitute this expression into Equation 22 to deduce that \[\begin{align} f_a^o(0,L)=&\mathbb{E}\left[g_a(U,L)-f_a^o(1,L)+f_a^o(0,L)\middle| L\right]\\ =&\mathbb{E}[Y(a)\mid L]-\mathbb{E}[Y(a)\mid L,A\neq a].\\ f_a^o(1,L)=&f_a^o(0,L)+\mathbb{E}[Y(a)\mid L,A\neq a]=\mathbb{E}[Y(a)\mid L]. \end{align}\] This finishes the proof for Proposition 12.
For any measurable and bounded \(f(A,V)\), we directly calculate \[\begin{align} &\mathbb{E}\left[\omega_{\pi}^o(A, Z, L) f(A, V) \left\{Y - g(A, V; \psi_{MSM}^o)\right\}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\mathbb{E}\left[I\{A=a\}\omega_{\pi}^o(a, Z, L) \left\{Y - g(a, V; \psi_{MSM}^o)\right\}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\mathbb{E}\left[I\{A=a\} \dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} \left\{Y - g(a, V; \psi_{MSM}^o)\right\}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\mathbb{E}\left[ \dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} A^{(a)}Y\middle| L\right]\\ &-\sum_{a\in\mathcal{A}}f(a, V)g(a, V; \psi_{MSM}^o)\mathbb{E}\left[ \dfrac{\pi(Z,L)-\mathbb{E}[\pi(Z,L)\mid L]}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} A^{(a)}\middle| L\right]\\ =&\sum_{a\in\mathcal{A}}f(a, V)\dfrac{\mathrm{Cov}\!\{I\{A=a\}Y,\pi(Z,L)\mid L\}}{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} \\ &-\sum_{a\in\mathcal{A}}f(a, V)g(a, V; \psi_{MSM}^o)\dfrac{\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}} {\mathrm{Cov}\!\{I\{A=a\},\pi(Z,L)\mid L\}}\\ =&\mathbb{E}\left[\sum_{a\in\mathcal{A}}f(a, V)\{Y(a)-g(a,V;\psi_{MSM}^o)\}\middle| L\right]. \end{align}\] The last step follows from Proposition 3. Now we can take expectation with respect to \(L\) and get \[\begin{align} &\mathbb{E}\left[\omega_{\pi}^o(A, Z, L) f(A, V) \left\{Y - g(A, V; \psi_{MSM}^o)\right\}\right]\\ =&\mathbb{E}\left[\sum_{a\in\mathcal{A}}f(a, V)\{Y(a)-g(a,V;\psi_{MSM}^o)\}\right] \\ =&\mathbb{E}\left[\sum_{a\in\mathcal{A}}f(a, V)\{\mathbb{E}[Y(a)\mid V]-g(a,V;\psi_{MSM}^o)\}\right]=0. \end{align}\] The final step follows from the definition of \(g(a,V;\psi_{MSM}^o)\) in Equation 10 . Since this equality holds for any \(f(A,V)\), we conclude that \[\begin{align} \mathbb{E}\left[\omega_{\pi}^o(A, Z, L)\left\{Y - g(A, V; \psi_{MSM}^o)\right\}\mid A,V\right]=0. \end{align}\]
We first verify one important fact, that is, if \(Z\) is an AIV for \(A=a\), then \(Z_\mathcal{S}\) is also an AIV for \(A=a\). Since \(Z\) is an AIV for \(A=a\), there exists \(b(U,L)\) and \(c(Z,L)\), such that \[\Pr(A = a \mid Z, U, L) = b(U, L) + c(Z, L).\] We define \(\tilde{c}(Z_{\mathcal{S}},L)=\int c(z,L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid Z_\mathcal{S},L)\). Then we can use the law of total probability to verify that \[\begin{align} &\Pr(A = a \mid Z_{\mathcal{S}}=z_\mathcal{S}, U, L)\notag\\ =&\int\Pr(A = a \mid Z=z, Z_{\mathcal{S}}=z_\mathcal{S}, U, L)\text{d} P_{Z\mid Z_{\mathcal{S}},U,L}(z\mid z_\mathcal{S},U,L)\notag\\ =&\int\Pr(A = a \mid Z=z, Z_{\mathcal{S}}=z_\mathcal{S}, U, L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\tag{23}\\ =&\int\Pr(A = a \mid Z=z, U, L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\tag{24}\\ =&\int b(U,L)+c(z,L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\notag\\ =&b(U,L)+\int c(z,L)\text{d} P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)\notag\\ =&b(U,L)+\tilde{c}(z_{\mathcal{S}},L).\notag \end{align}\] Equation 23 follows by Assumption 3 that \(Z\perp\!\!\!\perp U\mid L\). Equation 24 follows by when \(Z=z\) and \(P_{Z\mid Z_{\mathcal{S}},L}(z\mid z_\mathcal{S},L)>0\), it must holds that \(Z_{\mathcal{S}}=z_\mathcal{S}\).
This is equivalent to say that \(Z_\mathcal{S}\) is an AIV for \(A=a\). Furthermore, Assumptions 1–4 still holds for \(Z_\mathcal{S}\). Now we adopt Proposition 3 for \(Z_\mathcal{S}\) to deduce the results.
For any fixed \(a\in\mathcal{A}\), denote \(g(u,L):=\mathbb{E}[Y(a)\mid U,L]\). We first observe that \[\begin{align} &\mathbb{E}[Y\pi(Z,L)\mid A=a,L]=\mathbb{E}[\mathbb{E}[Y\mid Z,A=a,U,L]\pi(Z,L)\mid A=a,L]\\ =&\mathbb{E}[\mathbb{E}[Y(a)\mid U,L]\pi(Z,L)\mid A=a,L]\\ =&\int g(u,L)\pi(z,L)p_{Z,U\mid A,L}(z,u\mid a,L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)p_{A,Z,U\mid L}(a,z,u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)p_{A\mid Z,U,L}(a\mid z,u,L)p_{Z\mid L}(z\mid L)p_{U\mid L}(u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)b(z,L)p_{Z\mid L}(z\mid L)p_{U\mid L}(u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)\pi(z,L)c(u,L)p_{Z\mid L}(z\mid L)p_{U\mid L}(u\mid L)\text{d}\mu_Z(z)\text{d}\mu_U(u)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u)\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)c(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u)\int \pi(z,L)p_{Z\mid L}(z\mid L)\text{d}\mu_Z(z)\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)c(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u)\cdot\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\] Similarly, we can just set \(\pi(Z,L)\equiv 1\) to verify that \[\begin{align} \mathbb{E}[Y\mid A=a,L]=&\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &+\dfrac{1}{p_{A\mid L}(a\mid L)}\int g(u,L)c(u,L)p_{U\mid L}(u\mid L)\text{d}\mu_U(u). \end{align}\] Finally, we conclude that \[\begin{align} &\mathbb{E}[Y\pi(Z,L)\mid A=a,L]-\mathbb{E}[Y\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &-\dfrac{1}{p_{A\mid L}(a\mid L)}\mathbb{E}[Y(a)\mid L]\cdot\int b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\] Similarly, we can verify that \[\begin{align} &\mathbb{E}[1\cdot\pi(Z,L)\mid A=a,L]-\mathbb{E}[1\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\dfrac{1}{p_{A\mid L}(a\mid L)}\cdot\int \pi(z,L)b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\\ &-\dfrac{1}{p_{A\mid L}(a\mid L)}\cdot\int b(z,L)p_{Z\mid L}(z\mid L) \text{d}\mu_Z(z)\mathbb{E}[\pi(Z,L)\mid L]. \end{align}\] Now we get the result that \[\begin{align} &\mathbb{E}[Y\pi(Z,L)\mid A=a,L]-\mathbb{E}[Y\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]\\ =&\{\mathbb{E}[\pi(Z,L)\mid A=a,L]-\mathbb{E}[\pi(Z,L)\mid L]\}\mathbb{E}[Y(a)\mid L]. \end{align}\] From this equality, we know that \[\begin{align} \mathbb{E}[Y(a)]=\mathbb{E}\left[\dfrac{\mathbb{E}[Y\pi(Z,L)\mid A=a,L]-\mathbb{E}[Y\mid A=a,L]\mathbb{E}[\pi(Z,L)\mid L]} {\mathbb{E}[\pi(Z,L)\mid A=a,L]-\mathbb{E}[\pi(Z,L)\mid L]}\right]. \end{align}\]
The authors were partially supported by the National Key R&D Program of China (2024YFA1015600), the National Natural Science Foundation of China (12471266 and U23A2064). Correspondence to cuiyf@zju.edu.cn.↩︎