A Neuro-Fuzzy System for Interpretable Long-Term Stock Market Forecasting 1


Abstract

In the complex landscape of multivariate time series forecasting, achieving both accuracy and interpretability remains a significant challenge. This paper introduces the Fuzzy Transformer (Fuzzformer), a novel recurrent neural network architecture combined with multi-head self-attention and fuzzy inference systems to analyze multivariate stock market data and conduct long-term time series forecasting. The method leverages LSTM networks and temporal attention to condense multivariate data into interpretable features suitable for fuzzy inference systems. The resulting architecture offers comparable forecasting performance to conventional models such as ARIMA and LSTM while providing meaningful information flow within the network. The method was examined on the real world stock market index S&P500. Initial results show potential for interpretable forecasting and identify current performance tradeoffs, suggesting practical application in understanding and forecasting stock market behavior.

Stock Market Prediction, LSTM, Multi-Head Attention, Fuzzy Systems, Interpretability, Deep Clustering

1 Introduction↩︎

In stock price forecasts, the interpretability of the model is desired to explain the predictions and extract semantic meaning that can be understood by investors. While black-box models achieve greater accuracy, their predictions are hard to explain, which is very important in economics. One of the most influential advancements in the field of deep neural networks is attention-based architectures; the connections of the input signals [1] can be examined for technical analysis of the patterns from historical data. The interpretability of the relation between inputs and outputs was one of the cornerstones of fuzzy systems [2]. We propose to combine recent advances in deep neural network time series forecasting [3], deep clustering [4], and evolving fuzzy systems [5] into a complex model, while maintaining enough transparency in the final layers to be interpreted by humans.

Neuro-Fuzzy Systems (NFS) are models that can combine fuzzy logic rules as neurons in a network structure. Fuzzy and neuro-fuzzy models are commonly used for time series forecasting, specifically stock price prediction [6]. Examples include a Hammerstein-Wiener linguistic model [2], a fuzzy granular predictor for Bitcoin [7], and an ensemble approach for S&P500 prediction [8], though these typically focus on univariate or short-term forecasting.

Interpretability in fuzzy systems results from their structural separation between rule antecedents and functional or linguistic consequents [9]. Antecedents are typically described using Gaussian clusters, obtained via unsupervised or supervised clustering. In this study, we use deep clustering to jointly learn the model with backpropagation. Various deep clustering approaches exist, such as autoencoders with k-means (e.g., DEKM [10], FAE [11]), or VAE-based generative models [4]. LSTM networks are a staple of sequence modeling in tasks such as regression and time series forecasting. In stock prediction, they are often combined with attention for long-term dependency modeling. Examples include the Temporal Fusion Transformer (TFT) [3], multi-horizon LSTM forecasting [12], Indian stock market modeling [13], NASDAQ-focused attention mechanisms [14], and transformer-only models [15]. Day-trading directional prediction with LSTM has also been explored [16].

We propose a similar methodology by combining these advances with interpretable fuzzy systems. The main contributions of this paper are as follows: First, combining Long Short-Term Memory (LSTM) recurrent neural network architecture and the Multi-Head self-Attention (MHA) mechanism with Fuzzy Inference Systems (FIS) for multivariate multi-horizon time series forecasting (multi-step-ahead prediction). Second, an unsupervised approach to deep unsupervised multivariate Gaussian clustering of the low dimensional latent space.

2 Methods↩︎

Let \({\underline{X}}(k) {=} [\underline{x}(k{-}N), \ldots, \underline{x}(k)] {\in} \mathbb{R}^{N \times D_X}\) be the multivariate input data with \(D_X\) channels or input features, containing a main time series \(\underline{Y}(k) {=} [y(k{-}N), \ldots, y(k)] {\in} \mathbb{R}^{N}\), and other multivariate input data up to a discrete time step \(k\). Our objective is to predict the next \(H\) time steps of the main time series \(\underline{\hat{Y}}(k) {=} [\hat{y}(k{+}1), \ldots, \hat{y}(k{+}H)] {\in} \mathbb{R}^{H}\).

Figure 1: The Fuzzformer architecture. The orange dots at the outputs of the LSTM and Dense layers represent the dropout mechanisms. The red and blue connections illustrate the data flow of the antecedent latent space clustering samples and the local model regression variables, respectively. The variable B is the batch size.

The proposed neuro-fuzzy transformer system, called the Fuzzformer, is a sequence-to-sequence model comprised of the following main layers: 1) An LSTM network [17] to encode long-term dependencies between time steps; 2) A Multi-Head Self-Attention network [1] for long-term information retention; 3) Fully connected layers to reduce the encoded data into two low-dimensional latent space representations; 4) A fuzzy local model network composed of multivariate Gaussian clusters [5] and ARIX local models [18] to generate the sequence forecast. The Fuzzformer architecture is presented in Figure 1.

2.1 Temporal Multi-Head Self-Attention↩︎

The Multi-Head Attention (MHA) mechanism, as proposed in [1], allows the model to focus on different parts of a sequence, by splitting the hidden features into several sub-spaces, computing an attention mechanism for each in parallel, and combining the outputs. Specifically, the input sequence \(\underline{S} {=} [\underline{s}(1),...,\underline{s}(N)] {\in}\mathbb{R}^{N\times D_{in}}\) with length \(N\) and input dimension \(D_{in}\) is transformed with trainable parameter matrices \(\underline{W}_K{\in} \mathbb{R}^{D_{in}\times D_h}\), \(\underline{W}_Q{\in} \mathbb{R}^{D_{in}\times D_h}\), and \(\underline{W}_V{\in} \mathbb{R}^{D_{in}\times d_{out}}\) into keys \(\underline{K}{=} \underline{S}\,\underline{W}_K{\in} \mathbb{R}^{N\times D_h}\), queries \(\underline{Q}{=} \underline{S}\,\underline{W}_Q{\in} \mathbb{R}^{N\times D_h}\), and values \(\underline{V}{=} \underline{S}\,\underline{W}_V{\in} \mathbb{R}^{N\times d_{out}}\). A Scaled Dot-Product Attention is then computed as [1]: \[{A}(\underline{Q},\underline{K},\underline{V}) = \mathrm{softmax} \Big(\frac{\underline{Q}\underline{K}^\top}{\sqrt{D_h}}\Big) \underline{V}\] In multi-head attention, this is done multiple times in parallel, where each layer is called a head \(h {=} 1,...,N_h\). The outputs of all heads are concatenated and reshaped with a linear transformation: \[{M} = [\underline{H}^1,...,{\underline{H}^{N_h}}] \underline{W}_O\] \[\underline{H}^h= A (\underline{S}\,\underline{W}_\mathrm{q}^h,\underline{S}\,\underline{W}_K^h,\underline{S}\,\underline{W}_V)\] where \(\underline{W}_O {\in} \mathbb{R}^{(N_h d_{out})\times D_{out}}\) is the output projection. We used the same hidden dimensions as proposed in [1], i.e, \(D_h {=} d_{out} {=} D_{in}/N_h\). which results in the distribution of the \(D_{in}\) features of the input sequence into \(N_h\) sub-spaces.

2.2 Fuzzy Inference System↩︎

A fuzzy Takagi-Sugeno rule with one membership function in the antecedent and an affine linear function in the consequent can be described as follows: \[R_i :\quad \mathrm{IF} \quad \big(\underline{Z}(k) \sim \mathcal{Z}_i \big) \quad \mathrm{THEN} \quad \underline{\hat{Y}}_i(k),\] where \(\sim\) denotes a soft membership to the fuzzy set \(\mathcal{Z}_i\), \(i{=}1,2,...,C\) is the index of the fuzzy rules, \(j {=} 1,...,H\) is the future time step, \(\underline{Z}(k) {\in}\mathbb{R}^{D_Z}\) is the antecedent input vector, \(\underline{\hat{Y}}_i(k) {=} [\hat{y}_i({k {+}1}),...,\hat{y}_i(k{+}H)]\) is the output of the fuzzy rule. The antecedent fuzzy sets \(\mathcal{Z}_i\) can be represented by multivariate Gaussian clusters that can be rotated arbitrarily to define correlations between variables, since they can approximate a variety of data distributions [19] \[d_i^2(k) =(\underline{Z}(k) {-}\underline{\mu}_i)^{\top}\underline{\Sigma}_i^{-1}(\underline{Z}(k){-}\underline{\mu}_i),\] where \(d_i^2(k)\) is the Mahalanobis distance, \(\underline{\mu}_i {\in}\mathbb{R}^{D_Z}\) and \(\underline{\Sigma}_i {\in}\mathbb{R}^{D_Z \times D_Z}\) are the center and the covariance matrix of the cluster associated with the fuzzy rule \(\mathcal{R}_i\). The membership functions are normalized with a softmax function \(\Psi_{i}(k) {=} \mathrm{e}^{-d_i^2(k)}/\sum_i^C\mathrm{e}^{-d_i^2(k)}{\in}[0,1]\), so that a unit partition is obtained \(\sum_{i=1}^c\Psi_{ki} {=} 1\). Finally, the output of the neuro-fuzzy model is aggregated from all activated rules as \(\underline{\hat{Y}}(k){=}\sum_{i{=}1}^C\Psi_{i}(k) \underline{\hat{Y}}_{i}(k)\).

In our case, we used the neural network encoder to generate the (non-)exogenous inputs \(\underline{U}(k) {=} [u_1,...,u_p] {\in}\mathbb{R}^{p}\) for the ARIX model and the latent space vector \(\underline{Z}(k)\) for the antecedent membership functions. The forecast of the ARIX local model of the rule \(\mathcal{R}_i\) for the sequence \(\underline{Y}(k)\) is then defined as: \[\hat{y}(k + j) = \left(1 - q^{-1} \right)^{-d} \frac{B(q^{-1})}{A(q^{-1})} u(k + j)\] where \(A_i\left(\mathrm{q}^{-1}\right) {=} 1 {+} a_1 \mathrm{q}^{-1} {+} ... {+} a_p \mathrm{q}^{-p}\) is the auto-regressive (AR) polynomial, \(B_i\left(\mathrm{q}^{-1}\right) {=} b_1 \mathrm{q}^{-1} {+} ... {+} b_q \mathrm{q}^{-q}\) is the exogenous polynomial, \(\mathrm{q}^{-1}\) is a discrete delay operator, i.e. \({y(k{-}p)} {=} \mathrm{q}^{{-}p}y(k)\), \(p\) is the order of the auto-regressive model and \(q\) is the order of the exogenous input, and \(d\) is the order of integration. The model starts by using the known input time series values \(\underline{X}(k)\) and then adds the recursively computed predictions \(y_i(k+j)\) in a sliding-window way for all \(j\), resulting in a combined forecast \(\underline{\hat{Y}}_i(k) {=} [\hat{y}_i({k {+}1}),...,\hat{y}_i(k{+}H)]\).

2.3 Training losses↩︎

To train the multi-horizon time series forecast, we employ the common Mean Squared Error (MSE) loss function, defined as: \[\mathcal{L}_{\mathrm{MSE}} = \sum_k{||\underline{Y}(k)-\hat{\underline{Y}}(k)||^2},\] We used a "winner-takes-all" local optimization for the consequent local models to improve the interpretability of each local model, as opposed to the standard global optimization. This was done by computing the forward pass during training only for the rule with the highest activation, which forces each rule to have a good output locally.

The encoder neural network generates latent features that are then clustered in an unsupervised way with multivariate Gaussian clusters. The Fuzzy C-means (FCM) clustering loss was used for deep unsupervised clustering: \[\mathcal{L}_{\mathrm{FCM}} = \sum_i^C\sum_k \mathrm{softmax}({-d_i^2(k)}) ||\underline{Z}(k) - \underline{\mu}_i||^2,\] Using only a clustering loss does not ensure separation of the clusters, as all clusters may converge to similar mean values and start to overlap. In order to keep the clusters distinguishable, a regularization loss is used to ensure cluster separation. We formulate an overlapping regularization loss as: \[\label{eq:loss95B} \mathcal{L}_O = \sum_m^C\sum_{n\neq m}^C\frac{1}{d_B(m,n)}\tag{1}\] with the Bhattacharyya distance [9]: \[\label{eq:bhattacharyya95distance} \begin{align} d_B(m,n) =\, & \frac{1}{8} \big(\underline{\mu}_m{-} \underline{\mu}_n\big)^{\!\top}\bigg(\frac{\underline{\Sigma}_{m} + \underline{\Sigma}_{n}}{2}\bigg)^{\!\!-1}\! \big(\underline{\mu}_m{-} \underline{\mu}_n\big)+ \\ & \frac{1}{2}\ln\Bigg( \frac{\mathrm{det}\big(\frac{1}{2}(\underline{\Sigma}_{m} {+} \underline{\Sigma}_{n})\big)}{\sqrt{\mathrm{det}\underline{\Sigma}_m\mathrm{det}\underline{\Sigma}_n}}\Bigg), \end{align}\tag{2}\] where \(m{=}1,2,...,C\) and \(n{=}1,2,...,C\) are the indexes of the two compared clusters for \(m{\neq}n\), and \(d_B(m,n){=}d_B(n,m)\). A second regularization loss based on the Kullback–Leibler divergence is applied to encourage balanced soft assignments of latent space data points to fuzzy rules, following the approach proposed in [11]. \[\mathcal{L}_{B} = \sum_i^c \frac{\sum_k\Psi_i(k)}{N} \log{\frac{{\sum_k\Psi_i(k)}/{N}}{1/c}}\] This balanced assignment loss encourages all clusters to have a uniform probability of being assigned a data point.

3 Experimentation↩︎

The proposed methodology was applied to the multi-horizon time series prediction of the closing prices of the Standard & Poor’s 500 stock market index. We examined multivariate data from other related market indicators: the VIX Volatility Index, a commonly referenced measure of the stock market’s expectation of volatility based on S&P 500 index options, often referred to as the "fear index"; the Gold commodity price; and the 5-year U.S. Treasury Yield, which tracks the performance of U.S. dollar-denominated domestic sovereign debt. Data were gathered from January 1st, 2001, to January 1st, 2023, then normalized using Min-Max scaling and split into training (80%), validation (10%), and testing (10%) sets.

As a baseline we compared the Fuzzformer with the classic Autoregressive Integrated Moving Average (ARIMA) model and the Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) for time series forecasting2. The ARIMA meta-parameters were selected with the Auto ARIMA method. The following parameters were used for the LSTM model: hidden dimensions were 32 and 128, and the number of layers was 3. A linear layer was also used to transform the output into \(\mathbb{R}^{N\times 1}\), and a number of time steps equal to the number of forecast horizons were taken as the output of the model. The meta-parameters of the Fuzzformer network were selected experimentally as: 2 LSTM layers, 2 MHA layers, hidden dimension \(D_h {=} 128\), latent space dimension \(D_Z {=} 2\), number of attention heads \(h {=} 4\), order of auto-regression \(p{=}2,4,30\), differentiating order \(d{=}1\), exogenous input order \(q{=}1\), number of fuzzy rules \(C = 16\). The methods were evaluated based on the Root Mean Squared Error (RMSE) measure. Importantly, the ARIMA model is not directly comparable with deep learning methods, as it does not retain long term data, but estimates coefficients from each input data sequence.

Table 1: Comparison of the proposed Fuzzformer neuro-fuzzy system with classic multi-horizon time series forecasting methods
60/30 150/30 150/60
Dataset Train Valid Test Train Valid Test Train Valid Test
0.0096 0.0288 0.0313 0.0091 0.0290 0.0320 0.0123 0.0424 0.0441
ARIMA(\(p {=} 30\), \(d{=} 1\), \(q{=} 1\)) 0.0091 0.0303 0.0311 0.0100 0.0326 0.0355 0.0133 0.0472 0.0462
LSTM(\(D_h{=} 256\), 3 layers) 0.0124 0.0407 0.1436 0.0128 0.0416 0.1536 0.0163 0.0492 0.1683
LSTM(\(D_h{=} 256\), 1 layer) 0.0121 0.0386 0.0797 0.0118 0.0395 0.0981 0.0162 0.0549 0.1181
Fuzzformer (Our, \(p {=} 2\)) 0.0100 0.0303 0.0324 0.0098 0.0308 0.0333 0.0128 0.0428 0.0452
Fuzzformer (Our, \(p {=} 4\)) 0.0107 0.0320 0.0369 0.0098 0.0314 0.0338 0.0183 0.0450 0.0462
Fuzzformer (Our, \(p {=} 30\)) 0.0095 0.0300 0.0321 0.0094 0.0301 0.0336 0.0130 0.0455 0.0442

The results of the case study, presented in Table 1, demonstrate that the Fuzzformer model is less prone to overfitting than the LSTM model when tested on the given data. An example of the prediction is presented in Fig. 2.

Figure 2: A multi-horizon forecast of the Fuzzformer with p = 30, test length N {=} 60, and horizon H {=} 30. The model predicts a bounce after a market drop.

In contrast, the LSTM network exhibits higher variance in predictions and, as a consequence, much larger prediction errors on the test data. Though the LSTM network performs well on the training data, the quality of its forecasts declines sharply on the test dataset, an indication of overfitting. Conversely, the Fuzzformer model does not suffer from this problem, thanks to the use of simple ARIX local models, but it requires more epochs to train effectively due to the attention layer and individual recursive computation for each rule, making training slower. From the second column in Table 1, it’s apparent that extending the look-back period to \(N{=}150\) yields no benefits for either the LSTM or Fuzzformer networks.

The ARIMA models with \(p=4\) generally forecast a constant value without a trend, except when a pronounced trend is present over the entire look-back window. This approach results in forecasts with low RMSE, as the error is never too substantial. The ARIMA model of higher orders may have some decomposition singularity issues during identification. However, our proposed ARIX model, which uses a similar autoregressive structure and is trained with the Adam optimization method, can use much higher orders without difficulty. Training an autoregressive model with a gradient descent method is more stable for higher system orders. The use of the neural network encoder allows the fuzzy system head to use multivariate data, as AR/ARX/ARIX/ARIMAX type models can handle only univariate data. The fuzzy model combines multiple local models in a manner similar to how samples are distributed and multiple models in a SARIMAX model, but with fuzzy model membership based on antecedent clusters.

4 Conclusion↩︎

In this study, we examined how a recurrent neural network architecture with multi-head self-attention can condense multivariate stock market data into features that can be utilized for an interpretable fuzzy multi-horizon time series forecast. The proposed approach demonstrates comparable performance with established ARIMA and LSTM networks that are commonly used for multi-horizon time series forecasting while offering valuable insights into the model structure and information flow through the network. Nonetheless, there remain opportunities for incorporating further interpretable mechanisms into the proposed fuzzy system that were not implemented in this study. These initial results are promising, and with continued research, they could contribute to a deeper understanding of stock market behavior.

References↩︎

[1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, and I. Polosukhin, “Attention is all you need,” vol. 2017-December, 2017.
[2]
C. Xie, D. Rajan, and Q. Chai, “An interpretable neural fuzzy hammerstein-wiener network for stock price prediction,” Information Sciences, vol. 577, pp. 324–335, 10 2021.
[3]
B. Lim, S. Arık, N. Loeff, and T. Pfister, “Temporal fusion transformers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, pp. 1748–1764, 10 2021.
[4]
S. Chang, “Deep clustering with fusion autoencoder,” no. arXiv:2201.04727, Jan 2022, arXiv:2201.04727 [cs].
[5]
Ožbot, Miha and Lughofer, Edwin and Škrjanc, Igor, Evolving Neuro-Fuzzy Systems based Design of Experiments in Process Identification,” IEEE Transactions on Fuzzy Systems, p. 1–11, 2022.
[6]
M. Pratama, S. G. Anavatti, P. P. Angelov, and E. Lughofer, “Panfis: A novel incremental learning machine,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, pp. 55–68, 1 2014.
[7]
C. Garcia, A. Esmin, D. Leite, and I. Škrjanc, “Evolvable fuzzy systems from data streams with missing values: With application to temporal pattern recognition and cryptocurrency prediction,” Pattern Recognition Letters, vol. 128, pp. 278–282, 12 2019.
[8]
S. Jahandari, A. Kalhor, and B. N. Araabi, “Online forecasting of synchronous time series based on evolving linear models,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, pp. 1865–1876, 5 2020.
[9]
E. Lughofer, “On-line assurance of interpretability criteria in evolving fuzzy systems – achievements, new concepts and open issues,” Information Sciences, vol. 251, pp. 22–46, 12 2013.
[10]
W. Guo, K. Lin, and W. Ye, “Deep embedded k-means clustering,” no. arXiv:2109.15149, Sep 2021, arXiv:2109.15149 [cs].
[11]
E. Aljalbout, V. Golkov, Y. Siddiqui, M. Strobel, and D. Cremers, “Clustering with deep learning: Taxonomy and new methods,” Jan 2018.
[12]
S. Aryal, D. Nadarajah, P. L. Rupasinghe, C. Jayawardena, and D. Kasthurirathna, “Comparative analysis of deep learning models for multi-step prediction of financial time series,” Journal of Computer Science, vol. 16, no. 10, p. 1401–1416, Oct 2020.
[13]
A. Yadav, C. K. Jha, and A. Sharan, “Optimizing lstm for time series prediction in indian stock market,” Procedia Computer Science, vol. 167, pp. 2091–2100, 1 2020.
[14]
T. Guo, T. Lin, and N. Antulov-Fantulin, “Exploring interpretable lstm neural networks over multi-variable data,” no. arXiv:1905.12034, May 2019, arXiv:1905.12034 [cs, stat].
[15]
A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” no. arXiv:2205.13504, Aug 2022, arXiv:2205.13504 [cs].
[16]
P. Ghosh, A. Neufeld, and J. K. Sahoo, “Forecasting directional movements of stock prices for intraday trading using lstm and random forests,” Finance Research Letters, vol. 46, p. 102280, 5 2022.
[17]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, p. 1735–1780, Nov 1997.
[18]
G. Prasad, E. Swidenbank, and B. W. Hogg, “,” , vol. 34, no. 10, p. 1185–1204, Oct 1998.
[19]
Igor Škrjanc, “Cluster-volume-based merging approach for incrementally evolving fuzzy gaussian clustering-egauss+,” IEEE Transactions on Fuzzy Systems, vol. 28, pp. 2222–2231, 9 2020.

  1. The authors acknowledge the financial support from the Slovenian Research and Innovation Agency – ARIS (program funding number P2-0219).↩︎

  2. https://github.com/mihaozbot/Fuzzy-transformer↩︎