Measuring the Redundancy of Information from a Source Failure Perspective


Abstract

In this paper, we define a new measure of the redundancy of information from a fault tolerance perspective. The partial information decomposition (PID) emerged last decade as a framework for decomposing the multi-source mutual information \(I(T;X_1, ..., X_n)\) into atoms of redundant, synergistic, and unique information. It built upon the notion of redundancy/synergy from McGill’s interaction information [1]. Separately, the redundancy of system components has served as a principle of fault tolerant engineering, for sensing, routing, and control applications. Here, redundancy is understood as the level of duplication necessary for the fault tolerant performance of a system. With these two perspectives in mind, we propose a new PID-based measure of redundancy \(I_{\texttt{ft}}\), based upon the presupposition that redundant information is robust to individual source failures. We demonstrate that this new measure satisfies the common PID axioms from [2]. In order to do so, we establish an order-reversing correspondence between collections of source-fallible instantiations of a system, on the one hand, and the PID lattice from [2], on the other.

redundancy, partial information decomposition, fault tolerance, interaction information

1 Introduction↩︎

There are two notions of redundancy explored in this paper. Within the fault tolerance literature, redundancy describes the desirable property of a system to be able to replicate its component services in the event of a fault [3]. This redundancy is typically implemented through physical or logical copies of the system’s fallible components. Fault tolerance describes the system’s ability to continue to operate despite faults. Classic results from the field prescribe the minimal level of redundancy needed to withstand a given number of faults, for a given task and fault type. For instance, one classic result considers \(n\) independent sensors measuring the same continuous value within a specified interval [4]. Depending on the nature of the faults, it was determined that the centralized algorithm provided could tolerate up to \(f={(n-1)/k}\) faults, where \(k=1,2,3\) for different fault types. Besides sensing, similar results exist for routing and control [3]. Redundancy, in the form of duplicated information assets, provides gaurantees on system performance up to a certain level of failure.

Separately, there is a recurrent interest in information theory to measure the redundant or common information content among multiple variables. Both total correlation [5] and interaction information [1], for instance, are different extensions to mutual information, and can both be considered as measures of redundancy or common randomness among many variables. When one of the variables is designated as a target, this question of redundancy can be interpreted as asking for the interactions among multiple sources regarding the target (see McGill’s original [1]). For three variables \(X,Y,Z\), interaction information takes the form: \[\label{eq:intxn95info} I(X ; Y ; Z) = I(X : Y | Z) - I(X; Y)\tag{1}\] The signed nature of the interaction suggests a rich structure: a positive interaction indicates synergy, while a negative one indicates redundancy among the variables. Interaction information found favor in the theoretical neuroscience community in the 90s/00s, both to measure synergy [6], [7] and redundancy[8] (see also [7]).

The partial information decomposition (PID) was introduced in the last decade to separate redundancy and synergy as distinct informational quantities [2]. Dispensing with the target-source symmetry of interaction information, PID seeks to decompose the information between a collection of source variables \(X_1, X_2, ..., X_n\) and a target \(T\) into those components that can be uniquely, redundantly, and/or synergistically attributed to each source. For instance, when \(n=2\), the so-called bivariate PID decomposes \(I(T;X_1,X_2)\) into four information atoms \(R+S+U_1+U_2\), satisfying: \[\label{eqs:bivariate95PID} \begin{gather} I(T;X_1) = R + U_1 \\ I(T;X_2) = R + U_2 \\ I(T;X_1,X_2) = R + U_1 + U_2 + S \end{gather}\tag{2}\] These equations are known as the PID equations, and these atoms are referred to as redundant (\(R\)), unique (\(U_1\), \(U_2\)), and synergistic (\(S\)) information. By applying the chain rule for mutual information, from (1 ) and (2 a-c) we have that \[\label{eq:intxn95info95PID} I(T; X_1 ; X_2) = S - R\tag{3}\] and thus recover the traditional interpretation of interaction information as measuring a synergy/redundancy trade-off. As (3 ) is an underdetermined system, one of the atoms needs to be specified in order to provide a proper PID definition. Many different PIDs have been proposed, satisfying different, often mutually exclusive axioms [2], [9][11].

PID has drawn great interest from researchers at the intersection of neuroscience and complex systems [11]. However, its application outside of biological and social science is less common, largely limited to theoretical machine learning [12], [13]. To our knowledge, there is no technical work examining what PID might be able to suggest about an information system’s robustness from a fault tolerance perspective. This work addresses itself to that question.

In this paper, we will develop a novel PID redundancy function \(I_{\texttt{ft}}\) that quantifies the minimal average information in the presence of crash faults — i.e. faults in which a source unambiguously fails and provides a null signal. We begin by presenting a simple distribution in Section 2 that demonstrates the type of redundancy we would like to capture. In Sec. 3, we review the fundamentals of the partial information decomposition, and introduce relevant notation and terminology. In Sec. 4, we introduce our fault tolerance-inspired PID redudndancy function \(I_{\texttt{ft}}\), and examine its mathematical properties. In the course of demonstrating that \(I_{\texttt{ft}}\) satisfies the common PID axioms from [2], [9], we show an insightful result: there is a order-reversing correspondence between the collections of sources on the PID lattice, over which PID redundancy is measured, and collections of source-fallible realizations of the system — i.e. garbled, fault-prone copies of \(X_1, ..., X_n\).

2 Motivating Example↩︎

Consider a scenario in which we need to know a one-bit variable \(T\) of great importance — say, who is entering a building tonight, an invited or an uninvited guest. The building has two entrances they may enter from: a front and a back. In ideal circumstances, both are monitored by a security camera, and so we have source variables \(X_1\) and \(X_2\) for the front and back cameras, respectively. We code the meanings derived from each video feed using the alphabet \(\mathcal{A}_{X_i} = \{ 1, 2, 3\}\), where \(1\) means that we observe no activity at that entrance, while \(2\) and \(3\) signify the invited and uninvited guest entering, respectively. Our target variable only takes values \(2\) and \(3\), since we have prior information that some visitor will be entering the building. Assuming our observation of the entrances is perfect and the visitor cannot use both entrances, the distribution for the system \((X_1,X_2,T)\) is given by Table (2A).

It is clear that, in this ideal scenario, \(I(T;X_1,X_2) = H(X_1) = 1\), i.e. \((X_1,X_2)\) contain all the information about \(T\). Moreover, each source \(X_i\) gives this full bit of information half of the time, i.e. \(I(T; X_i | X_i \neq 1) = 1\). However, when \(X_i=1\), we have gain no information from it, as \(H(T | X_i=1) = H(T) = 1\). Thus each pairwise MI is given as \(I(T;X_i) = 1/2\).

A

\(X_1\) \(X_2\) \(T\) \(p\)
1 2 2 \(1/4\)
1 3 3 \(1/4\)
2 1 2 \(1/4\)
3 1 3 \(1/4\)
Table 1: (A) Our leading example, a trivariate, \(n=2\) system with a 1-bit target. As can be seen, each predictor provides 1 bit of information half the time.(B) A garbled, source-fallible copy of the same system, where the fallible sources provide no information about \(T\).This demonstrates the kind of circumstance that fault tolerant engineering aims to control for.
\(\tilde{X}_1\) \(\tilde{X}_2\) \(T\) \(p\)
1 0 2 \(1/4\)
1 0 3 \(1/4\)
0 1 2 \(1/4\)
0 1 3 \(1/4\)

How, then, should we decompose this bit of mutual information into \(R+U_1+U_2+S\), the bivariate PID of this system (2 a-c)? Let’s assume that we want each atom to be non-negative, which would exclude several recent PIDs (e.g. [10], [14], [15]). Since the bivariate PID has one degree of freedom, we may consider the edge cases, corresponding to \(R=1/2\) and \(R=0\), respectively.

First, we might have \(R=S=1/2\), which would in turn imply \(U_1=U_2=0\). This decomposition would indicate that, on average, half a bit of information is redundant between the variables, and half a bit is synergistic. This is the PID that would be given by both the original \(I_{\texttt{min}}\) redundancy-based PID from [2] and the popular \(I_{\texttt{BROJA}}\) PID from [9]. However, does this decomposition make sense? In every realization \((x_1,y_1,t)\), only one of our two predictors is giving any pointwise information regarding the target, while the other is giving none. For instance, for the first two rows of Table 2A, \[\begin{align} \tag{4} \log \frac{p(x_1,t)}{p(x_1)p(t)} &= 0\\ \tag{5} \log \frac{p(x_2,t)}{p(x_2)p(t)} &= 1 \end{align}\] Thus, it is unclear in what sense there is redundancy (or synergy) between the information provided pointwise by these outcomes — at least if we take redundancy to mean information that can be provided reliably by either variable in the event that the other source fails.

Let us instead consider the other option, setting \(R=S=0\), which would in turn give us \(U_1=U_2=1/2\). This tells us that the information provided by \((X_1,X_2)\) about \(T\) can be divided into unique contributions from each, and that none of the information can be considered redundantly provided by both, or synergistically provided by their combination. One possible intuitive justification for this allocation could be that the supports of the pointwise informations given by the left-hand sides of Eqs (4 )-(5 ) are mutually exclusive: our sample space is partitionable between the events “only \(X_1\) is informative” and “only \(X_2\) is informative.”

We favor this latter assignment. Taking our inspiration from the notion of redundancy as used in the fault tolerance literature, we conceptualize ‘redundant information’ as the expected bits that one can guarantee if at least one of the sources is available in every realization — or, equivalently, if we allow that all but one source may fail arbitrarily. With respect to this example, let us consider the garbled source tuple \((\tilde{X}_1, \tilde{X}_2)\), defined by \(\tilde{X}_i = \delta_{1}(X_i) \, X_i\). This new distribution is given by Table 2B. We can imagine this distribution as the worst-case failure mode in every instance, while still guaranteeing that one source will be available. Any decision-maker who has access to \(\tilde{X}_1,\tilde{X}_2\) always has access to the ground-truth of either \(X_1\) or \(X_2\), and thus, we argue, the information redundant to both of them. As we see, \(I(T; \tilde{X}_1,\tilde{X}_2) = 0\). Thus, it stands to reason that \(R=0\).

Our proposed redundancy function \(I_{\texttt{ft}}\) will be defined using such garblings, which we think of as source-fallible instantiations of the underlying ground-truth system. First, we review the partial information decomposition for the general case.

3 Partial Information Decomposition↩︎

PID concerns the system of random variables composed of a target \(T\) and a collection of source or predictor variables \(\boldsymbol{X} = (X_i)_{i=1}^n\). In this work we consider all variables, including the target \(T\) and predictors \(X_i\), to be discrete, with finite alphabets \(\mathcal{A}_T, \mathcal{A}_{X_i} \subset \mathbb{N}\). Note that we exclude ‘0’ from these alphabets, as we will be reserving it for source failure later. This collection of variables forms the base system that represents the ‘ground truth’ of both sensor and target variables, assuming no failures. We refer to this as the base system, to distinguish it from source-fallible systems later.

Definition 1 (Base System). For a given \(n \in \mathbb{N}\), a predictor-target base system is the collection of \(n+1\) finite random variables \(\mathfrak{s} = (\boldsymbol{X}, T) = (X_1, ..., X_n, T)\), characterized by the joint pmf \(p_{\mathfrak{s}} = p_{\boldsymbol{X},T}\). For any index set \(I \subseteq [n]\), we may denote the subsystem \(\mathfrak{s}_{I} = (\boldsymbol{X}_I, T) = (X_{i_1}, ... X_{i_m}, T)\).

The PID framework was put forward as a method of decomposing the information in such systems — that is, decomposing the mutual information \(I(T;\boldsymbol{X})\) into constitutive parts attributable to each \(X_i\) and their combinations [2]. This attribution does not follow obviously from pairwise informations \(I(T;X_1)\), since information is non-additive in general.

In the introduction, we introduced the bivariate PID in (2 a-c), in which \(I(T;X_1, X_2)\) is decomposed into atoms \(R+U_1+U_2+S\). These atoms are typically given the following names and interpretations [11]

  • Redundant or shared information (\(\boldsymbol{R}\)). The information regarding \(T\) common to both predictors, and in some sense available from either.

  • Unique information (\(\boldsymbol{U_i}\)). The information re: \(T\) available from only one of the variables, independently of the other.

  • Synergistic or complementary information (\(\boldsymbol{S}\)). The information re: \(T\) that is only available when both variables are known.

PID extends for \(n\geq 2\), though the number of distinct information atoms scales superexponentially. Moreover, the interpretation of higher-order information atoms is not straight-forward. As in the bivariate case above, in which there are 3 independent constraints for 4 unknowns, the PID equations for any \(n\) present us with an underdetermined system. Thus, a definition for one of the quantities must be provided. Typically, this is done by defining a redundancy, synergy, or unique information function, and allowing the other atoms to follow. Ours is a redundancy function \(I_{\texttt{ft}}\), and thus the recursive derivation of the full PID lattice follows exactly as it did for \(I_{\texttt{min}}\) in [2].

We will now review the construction of the PID for an arbitrary number of predictors \(n\). From here on, sources are subcollections of source variables, denoted here \(\boldsymbol{X}_I \subseteq \boldsymbol{X}\) for \(I \subseteq [n]\). PID can be summarized as a framework for measuring the redundancy and synergy among collections of information sources — specifically, their information with respect to the designated target. Note that we will sometimes treat both the full set \(\boldsymbol{X}\) and sources \(\boldsymbol{X}_{I} \subseteq \boldsymbol{X}\) as sets, and other times as tuples/vectors.

For any set \(A\), allow \(\mathcal{P}'(A) = \mathcal{P}(A) \setminus \{ \emptyset \}\) to denote the collection of all non-empty subsets, where \(\mathcal{P}\) is the standard powerset. The collection \(\mathcal{P}'(\boldsymbol{X})\) forms a partially ordered set (‘poset’) under inclusion, denoted \((\mathcal{P}'(\boldsymbol{X}), \subseteq)\). We let \(I(T; \boldsymbol{X}_I)\) denote source-target mutual information. For a fixed base system \(\mathfrak{s}\), \(I(T; \cdot)\) can be thought of as a functional on \(\mathcal{P}'(\boldsymbol{X})\) that is monotonically increasing with respect to the partial order ‘\(\subseteq\).’ PID assigns information value to collections of sources interpretable as redundant, synergistic, and/or contributions from the underlying variables. In order to avoid trivial cases, only the antichains of \((\mathcal{P}'(\boldsymbol{X}), \subseteq)\) are considered. An antichain of a poset is a collection of its elements such that no two are directly comparable. Let \(\mathcal{A}(\boldsymbol{X})\) denote the antichains of \((\mathcal{P}'(\boldsymbol{X}), \subseteq)\), defined formally by \[\begin{align} \nonumber \mathcal{A}(\boldsymbol{X}) \triangleq \bigg\{ & \alpha \in \mathcal{P}'(\mathcal{P}'(\boldsymbol{X})) \; \bigg| \\ & \forall \boldsymbol{A},\boldsymbol{A}' \in \alpha, \; \boldsymbol{A} \neq \boldsymbol{A}' \Rightarrow \boldsymbol{A} \not\subseteq \boldsymbol{A} '\bigg\} \end{align}\] The idea here is that any \(\alpha \in \mathcal{A}(\boldsymbol{X})\) is a collection of sources within which one source will never trivially dominate another by inclusion. For a counter-example, consider that in any system \(\mathfrak{s}\), \((X_1,X_2)\) will always provide no less information pointwise than \(X_1\) alone, and so any sensible redundancy function will identify the redundancy of the non-antichain \(\{ \{X_1, X_2\}, \{X_1\} \}\) with \(I(T;X_1)\). There are multiple partial orderings that may be assigned to the antichains themselves [16]. In PID, the following is used [2]: \[\begin{align} \label{eq:PID95lattice95order} \forall \alpha, \beta \in \mathcal{A}(\boldsymbol{X}), \quad \alpha \preceq \beta \Leftrightarrow \forall \boldsymbol{B} \in \beta, \exists \boldsymbol{A} \in \alpha, \boldsymbol{A} \subseteq \boldsymbol{B} \end{align}\tag{6}\] Although every antichain \(\alpha\) is a set of sets of indexed predictors, we will typically exclude the outer brackets from notation where possible, i.e. \(\{X_1\}\{X_2\}\) in place of \(\{\{X_1\},\{X_2 \}\}\). Moreover, we will let \(\mathop{\mathrm{ind}}(\alpha)\) denote the collection of index sets for \(\alpha\), i.e. \(\mathop{\mathrm{ind}}(\alpha) = \{ I_j \}_j\) when \(\alpha = \{\boldsymbol{X}_{I_j}\}_j\).

A PID function \(\Pi\) assigns an information value to each \(\alpha\) in \(\mathcal{A}(\boldsymbol{X})\), while satisfying the PID equations: \[\forall \boldsymbol{X}_I \in \mathcal{P}'(\boldsymbol{X}), \quad I(T;\boldsymbol{X}_I) = \sum_{\alpha \preceq \{X_I\}} \Pi(\alpha)\] In the bivariate case where \(\boldsymbol{X} = (X_1, X_2)\), these resolve to Eqs. (2 a-c), which in our more general notation take the form: \[\begin{align} I(T;X_1) & = \Pi(\{ X_1\} \{X_2\}) + \Pi(\{ X_1\}) \\ I(T;X_2) &= \Pi(\{ X_1\} \{X_2\}) + \Pi(\{ X_2\}) \\ \nonumber I(T;X_1, X_2) & = \Pi(\{ X_1\} \{X_2\}) + \Pi(\{ X_1\}) \\ & \quad + \Pi(\{ X_2\}) + \Pi(\{ X_1, X_2\}) \end{align}\] We refer the reader to [2] for visualizations of the PID lattice for \(n \leq 3\).

The original [2] and most common approach to defining \(\Pi\) is by first defining a redundancy function \(I_{\cap}\) on \(\mathcal{A}(\boldsymbol{X})\), which measures the redundant information among all the sources in \(\alpha\). The PID function \(\Pi\) is then derived as the Möbius inverse of \(I_{\cap}\): \[\Pi(\alpha) = I_{\cap}(\alpha) - \sum_{\beta \preceq \alpha} I_{\cap}(\beta)\]

It has often be considered desirable for a proposed PID to satisfy a set of properties first mentioned in [2], which have sometimes been referred to as the Williams-Beer (WB) axioms [10].

  1. Symmetry. \(I_{\cap}(\boldsymbol{X}_{I_1}, ..., \boldsymbol{X}_{I_k}) = I_{\cap}(\boldsymbol{X}_{I_{\sigma(1)}}, ..., \boldsymbol{X}_{I_{\sigma(k)}})\) for any permutation \(\sigma\).

  2. Self-Redundancy. \(I_{\cap}(\boldsymbol{X}_I) = I(T;\boldsymbol{X}_I)\)

  3. Monotonicity. \(\alpha \preceq \beta \Rightarrow I_{\cap}(\alpha) \leq I_{\cap}(\beta)\)

The symmetry axiom is usually trivial, and we ignore it in this work. The monotonicity axiom is typically stated in its weaker form \(\alpha \subseteq \beta \Rightarrow I_{\cap}(\alpha) \leq I_{\cap}(\beta)\), since inclusion is a weaker ordering for antichains than that from (6 ). However, the stronger form here was demonstrated in the original PID [2], and follows naturally from our main theorem in the next section.

4 A Measure of Redundant Information in Source-Fallible Systems↩︎

In Sec. 2, we gave an example of the kind of redundancy we aim to quantify. In the fault tolerance literature, redundancy is provided as the number of nodes \(n\) we would need in order to maintain system functionality if we allow for up to \(\ell\) failures [3]. Here, we ask a slightly different question. Suppose we have \(n\) sources of information \(X_1, ..., X_n\), and we want to know whether we may tolerate \(\ell\) source failures while still having enough information regarding \(T\). That determination may be made by considering \(I(T;\tilde{\boldsymbol{X}})\), for some \(\tilde{\boldsymbol{X}} = (\tilde{X}_i)_i\) analagous to that in Table 2B, earlier. We begin by formally defining such source-fallible instantiations for a system of interest.

Definition 2 (Source-Fallible System). Let a base system \(\mathfrak{s}\) be given. A source-fallible instantiation of this system will be given by the binary sensor failure variables \(\boldsymbol{F} = (F_1, ..., F_n)\), \(\mathcal{A}_{\boldsymbol{F}}=\{0,1\}^n\), which are fully characterized by the conditional distribution \(p_{\boldsymbol{F} | \boldsymbol{X}, T}\). The event \(F_i = 0\) is interpreted as a ‘failure’ of the sensor for \(X_i\). From a given \(\boldsymbol{F}\), we define the induced source-fallible system (SFS) as the tuple of random variables \(\mathfrak{f}= ( \tilde{\boldsymbol{X}}, T) = ( \tilde{\boldsymbol{X}}(\boldsymbol{F},\boldsymbol{X}), T)\), where \(T\) is the same target variable as in \(\mathfrak{s}\) and \(\tilde{\boldsymbol{X}}\) is given by \[\begin{align} \label{eq:SFS} \tilde{X}_i = \begin{cases} 0, & F_i = 0\\ X_i, & F_i = 1\\ \end{cases} \end{align}\qquad{(1)}\] The collection of all SFS’s associated to base system \(\mathfrak{s}\) will be denoted \(\mathfrak{F} = \mathfrak{F}(\mathfrak{s})\). For any subsystem \(\mathfrak{s}_I \subset \mathfrak{s}\), we may use the short-hand \(\mathfrak{F}_I = \mathfrak{F}(\mathfrak{s}_I)\), which is similarly the set of subsetted sensor-fallible vectors \(\tilde{\boldsymbol{X}}_I\), satisfying (?? ) for each \(i \in I\).

The idea here is that our notion of redundancy deals with censored copies of our predictors, where for fixed \(\mathfrak{s}\) there’s a one-to-one correspondence between the failure distribution \(\boldsymbol{F}\) and the censored source vector \(\boldsymbol{\tilde{X}}\). Clearly, the least informative \(\tilde{\boldsymbol{X}}\) in \(\mathfrak{F}\) will be given by \(\boldsymbol{F} \equiv \boldsymbol{0}\), i.e. when all sources fail all the time. Then \(\tilde{\boldsymbol{X}} \equiv \boldsymbol{0}\). The SFS’s we’re interested in, though, are those that have limited failures while still providing one of the sources of information, as in Table 2B. We now have the notation to make this condition precise in the general case. We will allow for a fairly unconstrained distribution of source failures \(\boldsymbol{F}\), i.e. we need not assume \(\boldsymbol{F}\) has any particular independence with respect to \(T\) and/or \(\boldsymbol{X}\).

Definition 3. Let a system \(\mathfrak{s}\) be given. We say that an SFS \(\mathfrak{f}(\boldsymbol{F})\) redundantly satisfies a source antichain \(\alpha \in \mathcal{A}(\boldsymbol{X})\) if \[\label{eq:defn46redSat} P \left( \bigvee_{I \in \mathop{\mathrm{ind}}(\alpha)} \boldsymbol{F}_I = \boldsymbol{1}_{|I|} \right) = 1\qquad{(2)}\] where \(\boldsymbol{1}_{|I|}\) is the length \(|I|\) vector of ones. The collection of all \(\mathfrak{f}\in \mathfrak{F}(\mathfrak{s})\) redundantly satisfying \(\alpha\) are denoted \(\mathfrak{R}(\alpha).\)

Intuitively, (?? ) means that always at least one source \(X_I \in \alpha\) is fully available from observing \(\tilde{\boldsymbol{X}}\). We posit that the information redundant among the sources in an antichain should be available from any SFS that redundantly satisfies that antichain. From this postulate, the definition that should follow for redundant information becomes clear. We merely minimize over all such SFS’s redundantly satisfying the given antichain.

Definition 4. Let a base system \(\mathfrak{s}\) be given. For any source antichain \(\alpha \in \mathcal{A}(\boldsymbol{X})\), the \(I_{\texttt{ft}}\) redundancy function is defined by \[\begin{align} \label{eq:defn46Ift} I_{\texttt{ft}}(\alpha) = \min_{(\tilde{\boldsymbol{X}},T) \in \mathfrak{R}(\alpha)} I(T; \tilde{\boldsymbol{X}}) \end{align}\qquad{(3)}\]

There is one aspect of this definition that may seem a little strange. For any antichain, we are minimizing over SFS’s constructed from the full system \((X_1, ..., X_n, T)\). For the redundancy of \(\alpha = \{ X_1\} \{X_2\}\), surely we should only be concerned with \((X_1, X_2, T)\). In fact, the minimum in (?? ) for such a case can be achieved for an SFS where \(F_i \equiv 0\) for \(i>2\),. This is indistinguishable from considering only the system \((X_1,X_2, T)\) from the start. Although we retain Def. 4 as our preferred formulation, in the following two propositions, we make precise the equivalence of our definition to this reduced form.

Proposition 1. For any base system \(\mathfrak{s}\) and subsystem \(\mathfrak{s}_I \subset [n]\), there is an injective embedding \(g_I: \mathfrak{F}_I \hookrightarrow \mathfrak{F}\), given by \[\begin{gather} \nonumber g_I: \mathfrak{f}_I = (\tilde{\boldsymbol{X}}_I,T) \mapsto g_I(\mathfrak{f}_{I}) = (\tilde{\boldsymbol{Y}},T) \\ \label{eq:embedding95map} \tilde{Y}_i = \begin{cases} \tilde{X}_i, & i \in I \\ 0, & i \in [n] \setminus I \end{cases} \end{gather}\qquad{(4)}\] This embedding is information preserving, in the sense that \[\label{eq:embedding46info95equiv} I(T;\tilde{\boldsymbol{X}}_I) = I(T;\tilde{\boldsymbol{Y}})\qquad{(5)}\]

Proof. The injectivity of \(g_I\) is straightforward from its definition. For (?? ), we observe that \(I(T;\boldsymbol{X}_I) = I(T;\boldsymbol{Y}_I)\) and \(I(T;\boldsymbol{Y}_{\bar{I}} | \boldsymbol{Y}_I) \leq H(\boldsymbol{Y}_{\bar{I}} |\boldsymbol{Y}_I) = 0\) (as \(\boldsymbol{Y}_{\bar{I}}\) is constant). Thus, (?? ) follows from the chain rule \(I(T;\boldsymbol{Y}) = I(T;\boldsymbol{Y}_I) + I(T;\boldsymbol{Y}_{\bar{I}} | \boldsymbol{Y}_I)\). ◻

We may thus identify \(\mathfrak{F}_I\) with its image \(g_I(\mathfrak{F}_I)\) as a subset of \(\mathfrak{F}\). The following proposition shows that we may compute \(I_{\texttt{ft}}(\alpha)\) within the reduced space \(\mathfrak{F}_I\) for the smallest \(\boldsymbol{X}_I\) capturing all sources in \(\alpha\).

Proposition 2. For any \(\alpha \in \mathcal{A}(\boldsymbol{X})\), we let \(\mathfrak{s}_{\alpha} = \mathfrak{s}_{\cup \mathop{\mathrm{ind}}(\alpha)}\) be the subsystem formed by the union of the sources in \(\alpha\). The sensor-fallible realizations of this subsystem are similarly denoted \(\mathfrak{F}_{\alpha}\). Using the embedding \(\mathfrak{F}_{\alpha} \hookrightarrow \mathfrak{F}\) from Prop. 1, we denote the intersection of \(\mathfrak{F}_{\alpha}\) with antichains redundantly satisfying \(\alpha\) by: \[\mathfrak{R}_{\alpha}(\alpha) = \mathfrak{R}(\alpha) \cap \mathfrak{F}_{\alpha}.\] Then the \(I_{\texttt{ft}}\) function from Def. 4 is equivalently defined by: \[\begin{align} \label{eq:defn46Ift46subsystem} I_{\texttt{ft}}(\alpha) = \min_{(\tilde{\boldsymbol{X}},T) \in \mathfrak{R}_{\alpha}(\alpha)} I(T; \tilde{\boldsymbol{X}}). \end{align}\qquad{(6)}\]

Proof. Let \((\tilde{\boldsymbol{X}},T) \in \mathfrak{R}(\alpha)\) be given. Set \(I^{\cup} = \cup \mathop{\mathrm{ind}}(\alpha)\). Define \((\tilde{\boldsymbol{Y}},T)\) by \(\tilde{Y}_i \equiv \tilde{X}_i\) for \(i \in I^{\cup}\), and \(\tilde{Y}_i \equiv 0\) otherwise. Clearly, \((\tilde{\boldsymbol{Y}},T) \in \mathfrak{R}_{\alpha}(\alpha)\), satisfying (?? ) for \(\alpha\). Moreover, \(I(T;\tilde{\boldsymbol{X}}) \geq I(T;\tilde{\boldsymbol{X}}_{I^{\cup}}) = I(T;\tilde{\boldsymbol{Y}})\). Thus, since we can find such a \((\tilde{\boldsymbol{Y}},T) \in \mathfrak{R}_{\alpha}(\alpha)\) for any \((\tilde{\boldsymbol{X}},T) \in \mathfrak{R}(\alpha)\), it follows that the minimum in (?? ) is equal to that in (?? ). ◻

From the preceding two propositions, the demonstration of the self-redundancy PID axiom for \(I_{\texttt{ft}}\) reduces to one line.

Proposition 3 (Self-Redundancy Axiom). The \(I_{\texttt{ft}}\) function satisfies the self-redundancy axiom from Sec. 3 — that is, it is an extension of mutual information in the sense that \[\label{eq:prop46SR} I_{\texttt{ft}}(\{ X_I \}) = I(T;X_I)\qquad{(7)}\] for all \(\boldsymbol{X}_I\)

Proof. Since \(I = \cup \mathop{\mathrm{ind}}(\{ X_I \})\), \(\mathfrak{R}_{\alpha}(\alpha)\) is a singleton. The only element is \((X_I,T)\), and so (?? ) follows from (?? ). ◻

Our main result in this section elucidates the structure of the \(I_{\texttt{ft}}\) redundancy function, including its satisfaction of the monotonicity PID axiom, by establishing a close correspondence between source antichains \(\alpha \in \mathcal{A}(\boldsymbol{X})\), on the one hand, and these collections of source-fallible systems \(\mathfrak{R}(\alpha) \subset \mathfrak{F}(\mathfrak{s})\) that we have been examining. Let \(\mathcal{F}(\mathfrak{s}) = \{ \mathfrak{R}(\alpha) \}_{\alpha \in \mathcal{A}(\boldsymbol{X})}\). Insofar as each family \(\mathfrak{R}(\alpha)\) is a subset of \(\mathfrak{F}\), it follows that we can consider \(\mathcal{F}(\mathfrak{s})\) as a finite poset under inclusion, i.e. \((\mathcal{F}(\mathfrak{s}), \subseteq)\). As we will now demonstrate, this poset is the mirror image of the PID lattice \((\mathcal{A}(\boldsymbol{X}), \preceq)\).

Theorem 1. For any base system \(\mathfrak{s}= (\boldsymbol{X}, T)\), the lattices \((\mathcal{A}(\boldsymbol{X}), \preceq)\) and \((\mathcal{F}(\mathfrak{s}), \subseteq)\) are anti-isomorphic. Namely, the map \(\alpha \mapsto \mathfrak{R}(\alpha)\) is an order-reversing bijection, in the sense that \[\label{eq:poset95antiisomorphism} \alpha \preceq \beta \Leftrightarrow \mathfrak{R}(\beta) \subseteq \mathfrak{R}(\alpha)\qquad{(8)}\]

Proof. Suppose \(\alpha \preceq \beta\). Let \((\tilde{\boldsymbol{X}},T) \in \mathfrak{R}(\beta)\) be given, induced from \(\boldsymbol{F}\). We will show that \((\tilde{\boldsymbol{X}},T) \in \mathfrak{R}(\alpha)\). Let \(\boldsymbol{f}\) be a given realization of \(\boldsymbol{F}\). We then know by Def. 3 that for some index set \(I \in \mathop{\mathrm{ind}}(\beta)\), for source \(\boldsymbol{X}_I \in \beta\), \(\boldsymbol{f}_I = \boldsymbol{1}\). In other words, for the event \(\boldsymbol{F}=\boldsymbol{f}\), the source \(\boldsymbol{X}_I\) is available. By (6 ), we know that there is some \(\boldsymbol{X}_J \in \alpha\) such that \(\boldsymbol{X}_J \subset \boldsymbol{X}_I\), i.e. \(J \subseteq I\) for some \(J \in \mathop{\mathrm{ind}}(\alpha)\). Therefore, \(\boldsymbol{f}_J = \boldsymbol{1}\) and \(X_J\) is available. Since our choice of \(\boldsymbol{f}\) was arbitrary, it follows that (?? ) holds for \(\alpha\) as well as \(\beta\), and so \((\tilde{\boldsymbol{X}},T)\) redundantly satisfies \(\alpha\).

Now, suppose instead \(\alpha \not \preceq \beta\). Then there is some \(X_I \in \beta\) such that for all \(X_J \in \alpha\), \(X_J \not\subseteq X_I\), i.e \(J \not\subseteq I\). This means for each \(J \in \mathop{\mathrm{ind}}(\alpha)\), there is some \(j^\star \in J \cap \bar{I}\). Let \(\boldsymbol{F}\) be the constant binary vector where \(F_I = \boldsymbol{1}\) and \(F_{\bar{I}} = \boldsymbol{0}\). This induces the SFS \((\tilde{\boldsymbol{X}}, T)\) which redundantly satisfies \(\beta\), since \(P(\boldsymbol{F}_I) = 1\). However, \(P(\boldsymbol{F}_{J} = \boldsymbol{1}) = 0\) for every \(J \in \mathop{\mathrm{ind}}(\alpha)\), since \(F_{j^\star} \equiv 0\). Thus, \((\tilde{\boldsymbol{X}}, T) \in \mathfrak{R}(\beta) \setminus \mathfrak{R}(\alpha)\). ◻

An immediate consequence of this result is the monotonicity PID axiom. As one adds sources to an antichain, the set of fallible instantiations of the system grows: there are more ways for things to go wrong. Conversely, by removing sources or concatenating them, we reduce that range of possibilities.

Corollary 1 (Monotonicity Axiom). The \(I_{\texttt{ft}}\) function is (monotone) increasing on the PID lattice \((\mathcal{A}(\boldsymbol{X}), \preceq)\), i.e. \[\label{eq:cor46monotonicity} \alpha \preceq \beta \Rightarrow I_{\texttt{ft}}(\alpha) \leq I_{\texttt{ft}}(\beta)\qquad{(9)}\]

Proof. By (?? ), we have \[\begin{align} \nonumber \alpha \preceq \beta & \Leftrightarrow \quad \mathfrak{R}(\alpha) \supseteq \mathfrak{R}(\beta ; \mathfrak{s}) \\ & \Rightarrow \quad \min_{\mathfrak{f}\in \mathfrak{R}(\alpha ; \mathfrak{s})} f(\mathfrak{f}) \leq \min_{\mathfrak{f}\in \mathfrak{R}(\beta ; \mathfrak{s})} f(\mathfrak{f}) \end{align}\] for any function \(f\). (?? ) follows from (?? ). ◻

To conclude, let us revisit the scenario we introduced at the start of this section. Given \(n\) sources of information (predictor variables) for a system regarding its target \(T\), we asked how that system might tolerate up to \(\ell\) failures at a time. Let us further add that these failures may be arbitrarily (even adversarially) distributed. Let \(\mathcal{P}_{n - \ell}'(\boldsymbol{X})\) be the set of all sources \(\boldsymbol{X}_I\) with \(|I| = n - \ell\). There will be \({n \choose \ell}\) of these sources. Then, letting \(\alpha_{\ell} = \mathcal{P}_{n - \ell}'(\boldsymbol{X})\), we can answer that the expected amount of information we can guarantee regarding \(T\) is given by \(I_{\texttt{ft}}(\alpha)\), as defined in this section. Moreover, as a special case of the monotonicity axiom demonstrated above, we have that \[\begin{align} \nonumber 0 &= I_{\texttt{ft}}(\alpha_n) \leq I_{\texttt{ft}}(\alpha_{n-1}) \leq ... \\ &\leq I_{\texttt{ft}}(\alpha_{1}) \leq I_{\texttt{ft}}(\alpha_{0}) = I(T; \boldsymbol{X}) \end{align}\] where we extended \(I_{\texttt{ft}}\) naturally to take \(\alpha_n = \emptyset\) as an argument.

References↩︎

[1]
W. McGill, “Multivariate information transmission,” Transactions of the IRE Professional Group on Information Theory, vol. 4, no. 4, pp. 93–111, 1954.
[2]
P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” arXiv preprint arXiv:1004.2515, 2010.
[3]
A. Rullo, E. Serra, and J. Lobo, “Redundancy as a measure of fault-tolerance for the internet of things: A review,” Policy-Based Autonomic Data Governance, pp. 202–226, 2019.
[4]
K. Marzullo, “Tolerating failures of continuous-valued sensors,” ACM Transactions on Computer Systems (TOCS), vol. 8, no. 4, pp. 284–304, 1990.
[5]
S. Watanabe, “Information theoretical analysis of multivariate correlation,” IBM Journal of Research and Development, vol. 4, no. 1, pp. 66–82, 1960.
[6]
E. Schneidman, W. Bialek, and M. J. B. II, “Synergy, redundancy, and independence in population codes,” Journal of Neuroscience, vol. 23, no. 37, pp. 11539–11553, 2003.
[7]
N. Timme, W. Alford, B. Flecker, and J. M. Beggs, “Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective,” Journal of computational neuroscience, vol. 36, pp. 119–140, 2014.
[8]
G. Chechik, A. Globerson, M. Anderson, E. Young, I. Nelken, and N. Tishby, “Group redundancy measures reveal redundancy reduction in the auditory pathway,” Advances in neural information processing systems, vol. 14, 2001.
[9]
N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifying unique information,” Entropy, vol. 16, no. 4, pp. 2161–2183, 2014.
[10]
C. Finn and J. T. Lizier, “Pointwise partial information decomposition using the specificity and ambiguity lattices,” Entropy, vol. 20, no. 4, 2018.
[11]
J. T. Lizier, N. Bertschinger, J. Jost, and M. Wibral, “Information decomposition of target effects from multi-source interactions: Perspectives on previous, current and future work,” 2018.
[12]
D. A. Ehrlich, A. C. Schneider, V. Priesemann, M. Wibral, and A. Makkeh, “A measure of the complexity of neural representations based on partial information decomposition,” Transactions on Machine Learning Research, vol. 5, 2023.
[13]
P. P. Liang, Y. Cheng, X. Fan, C. K. Ling, S. Nie, R. J. Chen, Z. Deng, N. Allen, R. Auerbach, F. Mahmood, et al., “Quantifying & modeling multimodal interactions: An information decomposition framework,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[14]
R. A. Ince, “Measuring multivariate redundant information with pointwise common change in surprisal,” Entropy, vol. 19, no. 7, p. 318, 2017.
[15]
A. Makkeh, A. J. Gutknecht, and M. Wibral, “Introducing a differentiable measure of pointwise shared information,” Physical Review E, vol. 103, no. 3, p. 032149, 2021.
[16]
J. Crampton and G. Loizou, “Two partial orders on the set of antichains,” Research note, September 2000.