Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

April 01, 2024

In a feedforward network, Transfer Entropy (TE) can be used to measure the influence that one layer has on another by quantifying the information transfer between them during training. According to the Information Bottleneck principle, a neural model’s internal representation should compress the input data as much as possible while still retaining sufficient information about the output. Information Plane analysis is a visualization technique used to understand the trade-off between compression and information preservation in the context of the Information Bottleneck method by plotting the amount of information in the input data against the compressed representation. The claim that there is a causal link between information-theoretic compression and generalization, measured by mutual information, is plausible, but results from different studies are conflicting. In contrast to mutual information, TE can capture temporal relationships between variables. To explore such links, in our novel approach we use TE to quantify information transfer between neural layers and perform Information Plane analysis. We obtained encouraging experimental results, opening the possibility for further investigations.

transfer entropy; information bottleneck; information plane analysis; deep learning visualization

Transfer Entropy (TE) is a statistical measure that is commonly used to quantify the degree of coherence between events, usually those represented as time series. This measure was introduced by Schreiber [1] and has been linked by some authors [2], [3] to Granger’s causality. Using the term "causality" alone is a misnomer. To avoid further confusion, Granger himself used in 1977 the term "temporally related" [4]. Causality is concerned with whether interventions on a source can have an impact on the target, while information transfer relates to whether observations of the source can aid in predicting state transitions on the target [5]. While TE may indicate temporal relationships between two variables, it is not a definitive test for causality, and care should be taken when interpreting the results of TE analysis in this context.

In the case of a multilayered neural network, we can use TE to measure the volume of information that is transferred between the layers as they are successively computed during the network’s training process. The temporal relationship between the layers is straightforward: the output of one layer can be seen as the "cause" of the subsequent layer’s input, or "effect". This directional information transfer can be measured using TE, which provides a way to analyze the interactions between the different layers of the network.

The Information Bottleneck (IB) principle [6] is another information-theoretical tool that has been used recently to analyze neural models. This principle is founded on the notion that an effective internal representation created by a neural network should maximize compression of the input data and yet preserve correct information about the output.

Shwartz-Ziv and Tishby [7] drew a visualization of the IB using the Information Plane (IP), where IB consists of \(I(X ; T)\) and \(I(Y ; T)\) as coordinates. \(X\) stands for the input, \(T\) for the compression, \(Y\) represents the output, and \(I(\, \cdot \,;\, \cdot \,)\) is the Mutual Information (MI). They analyzed the IP of a deep neural network over time and found that the network undergoes a
transition of phases during training, from fitting to compression, where the flow of information becomes more focused and efficient. The two successive distinct phases, *fitting* and *compression* are characterized by \(I(X ; T)\), respectively \(I(Y ; T)\).

Model discovery, development, verification, and interpretation has been recently enriched with new visual methods having numerous advantages over other alternatives [8]. The human component can be a significant part of this activity since visuals can naturally support explanations of neural models. Up until now, most visualization techniques used to gain insight into deep neural models have focused on visualizing features or generating heatmaps. IP visualization uses a new approach and integrates well into the emerging field know as Visual Knowledge Discovery, combining advances in artificial intelligence, machine learning, and visual analytics [9].

IP analysis can be used to study deep neural networks to understand the dynamics of information flow and the relationship between input, hidden, and output layers [10], [11]. It has promising potential for deep learning visualization because it can help identify how a model is learning and where it may be struggling. Scatter plots are typically used to visualize an IP. The shape and distribution of the points in the IP can provide insights into the behavior of the model.

Several authors have performed IP analyses, applied to different NN architectures. For instance, Cheng *et al.* [10] used IP analysis to
evaluate and select better deep neural networks, with respect to accuracy and training time, for image classification. The outcomes of their study indicate that the IP is a more informative metric than the loss curve, as models exhibiting similar loss
curves, which decrease over training epochs, exhibit differing behavior in the IP.

It has been hypothesized that the evolution of the IP trajectories respects the following dynamics: the generalization performance^{1} of deep networks is determined in the
compression phase. Stochastic gradient descent diffusion-like behavior [11], [12] controls the compression phase. This behavior is known to be effective for improved generalization when training neural networks as also shown in Zhu [13]. Saxe *et al.* [11] showed that these claims are not generally
true, and reflect instead assumptions made to compute a finite MI metric in deterministic networks. According to [11], [14], there is no clear causal relationship between compression and generalization in neural networks. Networks that do not compress can still achieve good generalization, and conversely, networks that do compress are not
necessarily better at generalization.

Although the IB principle and the idea of a causal link between information-theoretic compression and generalization appear convincing, the results of many experiments conflict with this view. Geiger’s recent study [14] suggested that compression in the IP does not necessarily indicate the learning of a minimum sufficient statistic, and that this concept may not be essential for good generalization performance. Therefore, the suitability of MI as a measure of a causal relationship is in question, which prompted our current research.

MI and TE capture different aspects of the relationship between two variables. MI measures the amount of information shared between two variables, whereas TE measures the direction and strength of information flow between two variables. TE can capture temporal relationships between variables, whereas MI cannot. Since a temporal connection between information-theoretic compression and generalization is plausible, using the TE instead of the MI is a promising approach.

The main contribution of our work is to use the TE measure to quantify information transfer between neural layers in IP analyses of neural networks. Our goal is to gain a deeper understanding of the learning dynamics and investigate if there is a connection between compression in the IP and generalization performance. To the extent of our knowledge this is the first time that the TE measure has been applied to demonstrate the IB principle.

The rest of the paper is structured as follows. Section 2 summarizes essential results related to the application of TE and IB in neural networks. Section 3 introduces our novel model, whereas Section 4 presents experimental results and their interpretations. Section 5 contains the final remarks.

In this section, we provide a summary of important findings related to the use of TE and IB in artificial neural networks.

TE is an asymmetric measure of data connectivity and interchange between time series \(I\) and \(J\), and it is calculated with Schreiber’s formula [1]: \[\label{eq:TEcond} TE_{J\rightarrow I}=\sum_{t=1}^{n-1}{p(i_{t+1},i_{t}^{(k)},j_{t}^{(l)}) \: log \frac{p(i_{t+1}|i_{t}^{(k)},j_{t}^{(l)})}{p(i_{t+1}|i_{t}^{(k)})}}\tag{1}\]

Recently, we can observe a growing interest in applying TE in quantifying the effective connectivity between artificial neurons [15]–[19]. Some results report applications of TE in training neural networks [20]–[23].

For instance, in [21], TE is applied to structure the feedback between layers (adjacent or distant) of Convolutional Neural Networks (CNNs) with reduced number of layers. The TE was used to adjust the feedback and was calculated by averaging the values collected from connected neurons.

Herzog *et al.* [22] aimed to provide clear guidelines for computing TE-based neural feedback connectivity in feedforward neural networks,
with the goal of improving classification performance. Because of the large number of possible connections, structuring feedback connections in a deep neural model can be very challenging. The authors reduced feedback connections on an AlexNet network down
to 3.5% of all connections, then combined a genetic algorithm to moderately increment the strengths of the retained connections, in similar ways to how the brain amplifies existing feedforward paths. The algorithmic steps involved training the network with
standard backpropagation, fine-tuning the network with feedback TE connections, and generating the best performing network using a genetic algorithm. This method improved classification performance.

In previous work [24], we incorporated the computed TE into the backpropagation algorithm of simple feedforward networks as a scaling factor for gradients. As a result, we reduced the training time and improved the stability of the model. Afterward, we extended the application of TE to include CNNs for the purpose of accelerating the training phase and achieving the desired level of accuracy in fewer epochs [25]. TE served as a smoothing factor that fosters stability by only becoming active periodically, rather than after processing each input sample.

IB is a subset of the information distortion metrics [26]–[28].
Information distortion is a qualitative metric of the quantization success or of the information loss. Various experiments and theoretical initiatives attempted to investigate the compressed content *T* as a measurable proof of the information that
is being encoded during the training processes. Others have simply looked at the distributions of \(X\), \(Y\) and \(T\), either individually, in various
pairs or as Markov chains. Pattern recognition problems are predicting the \(Y\) outputs by looking at a training set \(X\) - the inputs. IB concentrates on the "bottleneck" variable \(T\) that aims to maximize the prediction of \(Y\) by \(I(Y ; T)\) while learning \(X\).

The first significant work that scrutinized IB in the machine learning field was done by Tishby *et al.* [29], for discrete values of \(X\), \(Y\) and \(T\). It was followed by the application of IB objective functions to neural networks [30]. In essence, minimizing the following expression aims on maintaining a qualitative compression while improving the accuracy of the decoded variable \(Y\):
\[\label{eq:ib} min_{P_{T|X}}(I(X;T)-\beta I(Y;T))\tag{2}\] [29], where
\(\beta\) is a Lagrange multiplier that controls the importance of the compressed information captured by \(T\).

Goldfeld *et al.* [31] and Muşat *et al.* [32] are a few of the recent works that attempted to understand deep learning models using the IB principle. Overviews of IB theory and IP analyses and applications in neural networks were developed by Goldfeld *et
al.* [33] and Geiger [14].

The work of Tishby *et al.* [30] was extended by Shwartz-Ziv *et al.* [7] with new findings of the IP on deep networks. They showed that the SGD optimization consists of a first shorter fitting phase and a longer compression phase.

Saxe *et al.* [11] reduced the boundaries stated by Tishby *et al.* in [30] to a smaller subset of neural networks, simply by providing eloquent counterexamples. They presented scenarios where generalization is correlated with the activation function and not to IB (i.e.,
*tanh* activation function provides better compression rates than *ReLU* w.r.t. IB). Another finding of Saxe *et al.* was that SGD produces the same accuracy in the context of IB when compared with regular GD. Probably one of the most
important findings from Saxe *et al.* [11] is that the compression is observed only in the classification (fully connected) layers. Similar
observations were made by Amjad *et al.* [34].

Goldfeld *et al.* [35] showed that the reduction (compression) of \(I(X ; T)\) is a clustering of
training samples based on class.

Wieczorek *et al.* [36] created class-level visualization of MI for each layer, revealing the implications of IB for various
datasets.

Later, Elad *et al.* [37] used a modified IB loss function to sequentially train each layer. The authors attempted to validate the claim that the
maximum IB Lagrangian on each layer offers similar performance with the regular end to end DNN training using cross entropy loss.

Enforcing a lower bound for the IB on a maximization objective function by using variational inference has been successfully experimented by Alemi *et al.* [38], by generating models more robust to overfitting. Kolchinsky *et al.* [39] improved Alemi’s *et
al.* [38] lower bound, making it non-parametric and allowing both discrete and continuous \(X\) and \(Y\).

Gabri’e *et al.* [40] did not discovered a connection between IB and generalization, when analyzing entropy and MI throughout the training
process.

The IB method provides several techniques for determining latent variables that measure the model’s comprehension. Saxe *et al.* [11] measured
the variances of the MI between inputs, labels and compression. They considered binning neurons’ activations as the compression variable when computing the MI. In their investigation of an uncalibrated fully connected neural architecture, they discovered
that MI exhibited lower variance and higher values in the initial layers, with decreasing values and greater variance in the later layers.

Creating models that generalize well may necessitate large datasets and many training iterations. While some efforts have been made to determine the rate at which a network learns, such as assessing the optimal progression of the training process or detecting whether generalization decreases during training, these attempts are limited in number. We describe in this section our TE-based method to quantify information transfer between neural layers.

Multiple metrics can be used to evaluate a neural network’s performance, the most common being accuracy and number of parameters. Focusing only on accuracy and number of parameters can lead to a suboptimal training path or a less robust model, because we may omit important features of the training process.

Our investigation is based on the assumption that it is feasible to quantify compression by calculating TE across neighboring layers. TE can be used to improve the latter but also as an adaptive parameter that can require less training epochs, as we shown in [24], [25].

We calculate TE using equation 1 , where \(i_t^{(k)}\) and \(j_t^{(l)}\) refer to the series obtained from the binarized neuron activations, and \(k\) and \(l\) denote the neural network layers, in sequential order. Each training sample produces a single entry in the time series. By employing an optimized network architecture, we show that
TE values exhibit higher values in the last layers and also higher variance. Within each epoch, TE gradually decreases, as depicted in Figures 3 and 6, confirming that compression
of irrelevant information is higher at the beginning of each epoch. Thereafter, the network progressively learns generic features and *forgets* extraneous information.

The overall compressed information evolves incrementally during training and is unevenly distributed among the layers, with TE having highest values in the final layers. Previous research has typically constructed IP at the intersection of MI between \(X\), \(T\) respectively \(Y\) and \(T\), where \(X\) and \(Y\) denote the activations and outputs, we suggest that TE serves as a comparable representation for IP since it is obtained from two adjacent layers. By indirectly measuring compression using TE between adjacent layers, we obtain more concrete evidence of the fitting and compression phases at the layer level.

The experiments conducted in this study employ the neural architectures outlined in [24], [25], namely shallow feedforward fully connected networks and CNNs. Prior to recording the TE values, we designed optimized architectures for each of the used datasets to obtain close to state-of-the-art accuracy values. To calculate TE during training, we bin the neuron activations on the observed layers, resulting time series for each neuron. For CNNs, the length of the time series is equivalent to the batch size, and it is limited for performance reasons; for the shallow networks TE is calculated for each training sample up to the epoch length.

For the glass, ionosphere, seeds, divorce, and liver disorders datasets from UCI [41], we utilize an optimized single hidden layer network with the same
structure and layer widths as described in Moldovan *et al.* in [24]. We compute TE for all adjacent layers and neurons, resulting two TE
sets: "TE input" calculated between input and hidden layers and "TE output" between hidden and output layers. In order to allow neuron activation stabilization, we do not record activations for the initial 5% training samples in each epoch. The
binarization procedure uses a 95% threshold percentile from each examined activation matrix. We emphasize the significance of utilizing a dynamic threshold since mean activation values change considerably during training. Our general layout for the
feedforward network is illustrated in Figure 1.

Through analysis of the TE values for each training sample in every epoch, we observe a consistent trend where initial epochs exhibit higher TE values that gradually decrease over time. This pattern persists across all epochs, as illustrated in Figure 3, where the TE values show a continuous decrease from their initial values in subsequent epochs.

In Figure 3, each line is associated with a specific epoch, where on average, higher TE lines correspond to lower epoch indexes. The \(x\) axis represents the number of training
samples available for a complete epoch. To gain an overall understanding of the TE trend, we average the TE array for each training sample across all epochs and reduce it to a single \(y\) value as shown in Figure 2. We believe that this provides a more precise representation of the fitting phase, as discussed by Shwartz-Ziv *et al.* [7]. Towards the end of each epoch, the TE values show lower variation, which we associate with the *compression* phase.

We conduct another experiment on a toy MLP fully connected architecture with 4, 8, 16, 8, respectively 3 neurons per layer. The first and last layers are the input and output layers. Due to computational constraints, we use the Iris dataset from UCI [41]. To calculate TE, we employ a window of 20 training items. In each epoch, we skip the first 10 training items to allow for neuron activations to stabilize before using them in TE calculations. We observe that the averaged inter-layer TE lines match the IP from [11], with the same sequencing: last layers produce larger TE values than the ones from the initial layers (Figure 5). This confirms that the last layers contain generic features, and receive most of the variations during training. Interestingly, even after overtraining the network up to 200 epochs, we do not observe a significant variation of the averaged TE values, despite achieving 97% accuracy in just 10 epochs.

In the second group of experiments, we utilize an optimized CNN, based on the Alexnet architecture [42], for an image classification task on the following datasets: FashionMNIST [43], STL-10 [44], SVHN [45], and USPS [46]. We focus only on the last pair of fully connected layers (including the softmax layer), following the findings from [11], that IB is present only in the fully connected layers. Another reason for the layer selection is the computational cost. However, the behavior of TE in these experiments is similar to that observed in shallow networks. For larger datasets, we also observe steeper but smoother trajectories in the averaged TE slopes, as shown in Figure 4.

In all experiments, we notice a direct correlation between the primary metrics of the network, such as accuracy and loss, and the changes in TE. Specifically, we observe that during the fitting phase there is a quick reduction of the TE values - that is similar to loss evolution and inverse to the accuracy, followed by an extended compression phase with consistently low TE variance.

In our CNN experiments, TE follows the observations from [7], [11], [30]. The highest compression is found in the last fully connected layers. Since we evaluate TE after each training sample on shallow networks, we observe an immediate feedback in the TE value. The
variability in TE’s behavior is explainable by the used CNN architecture and the size of the datasets. Since Saxe *et al.* [11] mentioned that SGD
training does not yield better compression, we did not used it for shallow nets, but only for CNNs, due to performance reasons.

TE decreases during and after each epoch because the compression rate ‘slows down’ as the patterns emerge and stabilize in each layer. Variances of TE during the fitting phase are mostly related to shuffling training samples for each epoch, as it is
usually done in iterative training. At the same time, we empirically observe that TE is a sensitive metric to a network’s architecture efficiency, determining various slopes and amplitudes during training. While choosing network parameters for experiments,
we noticed that optimal architectures tend to produce smoother TE lines with a specific evolution pattern. Due to this, we hypothesize that *good* inter-layer compression is achieved only in efficient neural network structures. Studying TE at epoch
level, we can visually depict on which epoch the network has reached a stable accuracy. In a similar fashion as Saxe’s IP illustrates each layer’s compression proficiency, TE dynamic visualization (at layer, epoch or training batch level) can depict
possible hurdles during training. Figure 4 shows that there is a strong inverse relation between train accuracy and TE, and also a close evolution of TE and calculated loss during training. On larger datasets, TE and
training loss are closely following the same slope and relative magnitudes. Even if [7] correctly pointed out that measuring individual weights
and neurons is meaningless, we see that averaging all TE values for a layer creates coherent TE lines during training.

We can conclude that there is a strong connection between accuracy and loss on one side, and TE evolution on the other. The observations from our CNN experiments apply to smaller datasets and shallow architectures, but the latter are more susceptible to deficient network architectures and biased datasets.

Elad *et al.* [37] demonstrated that training a deep neural model layer-by-layer with an IB loss function based on MI results in comparable
accuracy to end-to-end training using cross-entropy loss. However, some researchers have suggested using alternative cost functions instead of the IB loss, as it may not always behave optimally [34], [39]. In our future work, we aim to explore the potential use of a TE-based loss function to enhance
the training process of deep neural networks.

Dataset |
Top 1 Accuracy |
Epochs |
---|---|---|

fashionMNIST | 88.96 | 22 |

USPS | 99.62% | 10 |

SVHN | 92.30% | 10 |

STL10 | 100.00% | 25 |

glass | 35.38% | 80 |

iris | 95.25% | 199 |

ionosphere | 95.28% | 30 |

seeds | 84.12% | 100 |

divorce | 98.03% | 28 |

liver | 64.42% | 57 |

All authors contributed equally to this article. This work was supported by Siemens SRL.

[1]

Thomas Schreiber. Measuring information transfer. *Phys. Rev. Lett.*, 85:461–464, Jul 2000.

[2]

Lionel Barnett, Adam B. Barrett, and Anil K. Seth. Granger causality and transfer entropy are equivalent for gaussian variables. *Phys. Rev. Lett.*, 103:238701, Dec 2009.

[3]

Kateřina Hlaváčková-Schindler. Equivalence of granger causality and transfer entropy: A generalization. *Appl. Math. Sci.*,
5(73):3637–3648, 2011.

[4]

C. W. J. Granger and Paul Newbold. *Forecasting economic time series*. Academic Press New York, 1977.

[5]

Joseph T. Lizier and Mikhail Prokopenko. Differentiating information transfer and causal effect. *The European Physical Journal B*, 73:605–615, 2010.

[6]

Naftali Tishby, Cicero Pereira, and William Bialek. The information bottleneck method. *Proceedings of the 37th Allerton Conference on Communication, Control and Computation*, 49,
07 2001.

[7]

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. *CoRR*, abs/1703.00810, 2017.

[8]

Boris Kovalerchuk, Răzvan Andonie, Nuno Datia, Kawa Nazemi, and Ebad Banissi. Visual knowledge discovery with artificial intelligence: Challenges and future directions. In
Boris Kovalerchuk, Kawa Nazemi, Răzvan Andonie, Nuno Datia, and Ebad Banissi, editors, *Integrating Artificial Intelligence and Visualization for Visual Knowledge Discovery*, pages 1–27. Springer International Publishing, Cham,
2022.

[9]

Boris Kovalerchuk, Kawa Nazemi, Răzvan Andonie, Nuno Datia, and Ebad Banissi. *Integrating Artificial Intelligence and Visualization for Visual Knowledge Discovery*.
Springer, 2022.

[10]

Hao Cheng, Dongze Lian, Shenghua Gao, and Yanlin Geng. Utilizing information bottleneck to evaluate the capability of deep neural networks for image classification. *Entropy*,
21(5), 2019.

[11]

Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox. On the information bottleneck theory of deep learning. *Journal of
Statistical Mechanics: Theory and Experiment*, 2019(12):124020, dec 2019.

[12]

Carmina Fjellström and Kaj Nyström. Deep learning, stochastic gradient descent and diffusion maps. *Journal of Computational Mathematics and Data Science*, 4:100054, 2022.

[13]

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects.
*arXiv preprint arXiv:1803.00195*, 2018.

[14]

Bernhard C. Geiger. On information plane analyses of neural network classifiers—a review. *IEEE Transactions on Neural Networks and Learning Systems*, 33(12):7039–7051,
2022.

[15]

Raphael Féraud and Fabrice Clérot. A methodology to explain neural network classification. *Neural Networks*, 15(2):237 – 246, 2002.

[16]

Joseph T. Lizier, Jakob Heinzle, Annette Horstmann, John-Dylan Haynes, and Mikhail Prokopenko. Multivariate information-theoretic measures reveal directed information structure and task
relevant changes in fmri connectivity. *Journal of Computational Neuroscience*, 30(1):85–107, Feb 2011.

[17]

Raul Vicente, Michael Wibral, Michael Lindner, and Gordon Pipa. Transfer entropy—a model-free measure of effective connectivity for the neurosciences. *Journal of Computational
Neuroscience*, 30(1):45–67, Feb 2011.

[18]

Masanori Shimono and John M. Beggs. Functional clusters, hubs, and communities in the cortical microconnectome. *Cerebral cortex (New York, N.Y. : 1991)*, 25(10):3743–3757, Oct
2015. 25336598[pmid].

[19]

Hui Fang, Victoria Wang, and Motonori Yamaguchi. Dissecting deep learning networks—visualizing mutual information. *Entropy*, 20(11), 2018.

[20]

Josh Patterson and Adam Gibson. *Deep Learning: A Practitioner’s Approach*. O’Reilly Media, Inc., 1st edition, 2017.

[21]

Sebastian Herzog, Christian Tetzlaff, and Florentin Wörgötter. Transfer entropy-based feedback improves performance in artificial neural networks.
*CoRR*, abs/1706.04265, 2017.

[22]

Sebastian Herzog, Christian Tetzlaff, and Florentin Wörgötter. Evolving artificial neural networks with feedback. *Neural Networks*, 123:153 – 162, 2020.

[23]

Oliver Obst, Joschka Boedecker, and Minoru Asada. Improving recurrent neural network performance using transfer entropy. In *Proceedings of the 17th International Conference on Neural
Information Processing: Models and Applications - Volume Part II*, ICONIP’10, pages 193–200, Berlin, Heidelberg, 2010. Springer-Verlag.

[24]

Adrian Moldovan, Angel Caţaron, and Răzvan Andonie. Learning in feedforward neural networks accelerated by transfer entropy. *Entropy*, 22(1):102,
2020.

[25]

Adrian Moldovan, Angel Caţaron, and Răzvan Andonie. Learning in convolutional neural networks accelerated by transfer entropy. *Entropy*, 23(9),
2021.

[26]

Claude E Shannon. Coding theorems for a discrete source with a fidelity criterion (international convention record), vol. 7. *New York, NY, USA: Institute of Radio Engineers*,
1959.

[27]

Alexander G Dimitrov and John P Miller. Neural coding and decoding: communication channels and quantization. *Network: Computation in Neural Systems*, 12(4):441, aug 2001.

[28]

W.H.R. Equitz and T.M. Cover. Successive refinement of information. *IEEE Transactions on Information Theory*, 37(2):269–275, 1991.

[29]

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method, 2000.

[30]

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In *2015 IEEE Information Theory Workshop (ITW)*, pages 1–5, 2015.

[31]

Ziv Goldfeld, Ewout Van Den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. Estimating information flow in deep neural networks. In Kamalika
Chaudhuri and Ruslan Salakhutdinov, editors, *Proceedings of the 36th International Conference on Machine Learning*, volume 97 of *Proceedings of Machine Learning Research*, pages 2299–2308. PMLR, 09–15 Jun 2019.

[32]

Bogdan Musat and Razvan Andonie. Information bottleneck in deep learning-a semiotic approach. *International Journal of Computers Communications & Control*, 17(1), 2022.

[33]

Ziv Goldfeld and Yury Polyanskiy. The information bottleneck problem and its applications in machine learning. *IEEE Journal on Selected Areas in Information Theory*, 1(1):19–38,
2020.

[34]

R. Amjad and B. C. Geiger. Learning representations for neural network-based classification using the information bottleneck principle. *IEEE Transactions on Pattern Analysis and
Machine Intelligence*, 42(09):2225–2239, sep 2020.

[35]

Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. Estimating information flow in deep neural networks, 2018.

[36]

Aleksander Wieczorek and Volker Roth. On the difference between the information bottleneck and the deep information bottleneck. *Entropy*, 22(2), 2020.

[37]

Adar Elad, Doron Haviv, Yochai Blau, and Tomer Michaeli. Direct validation of the information bottleneck principle for deep nets. In *2019 IEEE/CVF International Conference on
Computer Vision Workshop (ICCVW)*, pages 758–762, 2019.

[38]

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck, 2016.

[39]

Artemy Kolchinsky, Brendan D. Tracey, and Steven Van Kuyk. Caveats for information bottleneck in deterministic scenarios. In *International Conference on Learning
Representations*, 2019.

[40]

Marylou Gabrié, Andre Manoel, Clément Luneau, Jean Barbier, Nicolas Macris, Florent Krzakala, and Lenka Zdeborová. Entropy and mutual information in models of deep neural networks*.
*Journal of Statistical Mechanics: Theory and Experiment*, 2019(12):124014, dec 2019.

[41]

Dheeru Dua and Casey Graff. machine learning repository, 2017.

[42]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In *Advances in Neural Information Processing Systems*,
2012.

[43]

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

[44]

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. *Journal of Machine Learning Research - Proceedings Track*,
15:215–223, 01 2011.

[45]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural images with unsupervised feature learning. *NIPS*, 01 2011.

[46]

J. J. Hull. A database for handwritten text recognition research. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 16(5):550–554, 1994.

Generalization refers here to the performance of the classifier on the test data and measures the model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.↩︎