Spiking Neural Networks Need High-Frequency Information

Yuetong Fang1,  Deming Zhou1,  Ziqing Wang4,  Hongwei Ren1
Zecui Zeng3Lusong Li3Shibo Zhou2
1Renjing Xu1 

1The Hong Kong University of Science and Technology (Guangzhou) 
2Brain Mind Innovation INC  3JD Explore Academy  4Northwestern University

yfang870@connect.hkust-gz.edu.cn``,
bob@brain-mind.com.cn``, renjingxu@hkust-gz.edu.cn


Abstract

Spiking Neural Networks promise brain-inspired and energy-efficient computation by transmitting information through binary (0/1) spikes. Yet, their performance still lags behind that of artificial neural networks, often assumed to result from information loss caused by sparse and binary activations. In this work, we challenge this long-standing assumption and reveal a previously overlooked frequency bias: spiking neurons inherently suppress high-frequency components and preferentially propagate low-frequency information. This frequency-domain imbalance, we argue, is the root cause of degraded feature representation in SNNs. Empirically, on Spiking Transformers, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1 accuracy to 79.12%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: (1) extra Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%. Extending our insight beyond transformers, our Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06% on CIFAR-100. We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks. Code is available: https://github.com/bic-L/MaxFormer.

1 Introduction↩︎

Spiking neural networks (SNNs) are emerging as an energy-efficient alternative to conventional artificial neural networks (ANNs) [1], [2]. Their efficiency arises from spiking neurons that utilize spatiotemporal dynamics to mimic biological computation in the human brain [3]. In ANNs, all neurons within the same layer must await the complete processing of real-valued, dense tensors before any information can flow to the subsequent layer. SNNs, however, transmit information asynchronously, with spiking neurons consuming energy only when receiving or emitting spikes (“1”), otherwise remaining inactive [4], [5]. This binary activation pattern enables SNNs to replace the energy-intensive multiply-and-accumulate (MAC) operations that are essential in ANNs with much simpler spike-based accumulation. Leveraging the energy-efficiency benefits, modern SNN variants, such as Spiking Transformers that integrate the powerful Transformer architecture with spike-based computing, have gained growing attention [6], [7].

Figure 1: Comparison between ReLU and spiking neuron (S-Neuron): (a) Input images; (b) Fourier spectrum analysis of output features processed as input\rightarrow activation\rightarrow weighting, with high-frequency regions marked (red dashed boxes: regions >0.55\times max amplitude) and (c) the corresponding relative log amplitude; (d) GradCAM comparison with identical architectural setting following [7], with the converted Spiking Transformer using 256 timesteps. Spiking neurons cause the rapid dissipation of high-frequency components, which consequently leads to the degradation of feature representations.

Despite their energy efficiency, the discrete nature of spike-based computation presents both opportunities and challenges. A major obstacle for SNNs remains their performance gap relative to ANNs. This gap is often attributed to the so‑called “representation error” [8][10], which argues binary spike trains inherently limit the precision of feature representations compared to continuous activations. However, this seems inconsistent with the established consensus in the standard deep learning literature, where low-bit and even binary networks can still achieve comparable accuracy [11], [12]. Further, it should be noted that SNNs operate on temporal sequences: while spiking neurons strictly transmit binary signals at each individual time step, a train of spikes spanning \(n\) simulation timesteps can encode activation values with at least \(\log(n)\)-bit precision [13], [14]. The conflicting observation reveals an unexplored dimension in understanding SNN performance limitations.

It is natural to think from the frequency domain. Spiking neurons produce discrete, pulse-like activations, fundamentally distinguishing their frequency response from continuous activation functions commonly used in standard networks (e.g., ReLU [15]). Prior works have suggested that spiking neurons may enrich signals with local details (high frequencies) [16], [17]. However, in Figure 1 (b-c), we observed a surprising phenomenon: when examining the end-to-end information flow of input \(\rightarrow\) activation \(\rightarrow\) weighting, rather than focusing solely on the property of activation functions, spiking neurons tend to propagate low-frequency information more prominently than ReLU.

Feature degradation observed in SNNs may instead originate from the rapid dissipation of high-frequency components, which prevents the network from effectively capturing local, fine-grained information (Figure 1 (d)). To support this finding, we perform a simple experiment in which non‑parametric pooling operators, i.e., Max-Pool and Avg-Pool, serve as token mixers in Spiking Transformers, shown in Figure [fig:fig195cmp]. From the frequency domain perspective, Max-Pool excels at capturing local high-frequency details (e.g., local edges/textures), whereas Avg-Pool favors global low-frequency patterns. Intriguing, Spiking Transformers exhibit opposite preferences to ANNs in token mixing: while standard Transformers typically employ Avg-Pool for token mixing [18], [19], replacing it with Max-Pool in Spiking Transformers yields a +2.39% improvement on CIFAR-100, making it surpass the well-tuned Spikformer [6] baseline by 0.97%.

Overall, this work provides further theoretical and empirical evidence supporting the view that high‑frequency information is essential for SNNs:

  • We provide the first theoretical proof that spiking neurons inherently act as low‑pass filters at the network level, revealing their tendency to suppress high‑frequency features.

  • We propose Max-Former, which restores high‑frequency information in Spiking Transformers via two lightweight modules: extra Max-Pool in patch embedding and Depth-Wise Convolution (DWC) in place of early-stage self-attention.

  • Restoring high-frequency information significantly improves performance while saving energy cost. On ImageNet, Max-Former achieves 82.39% top-1 accuracy (+7.58% over Spikformer) with 30% energy consumption and lower parameter count (63.99M vs. 66.34M).

  • Extending the insight beyond transformers, Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06% on CIFAR-100.

We believe this straightforward yet powerful solution will motivate future research to explore the unique properties of SNNs, beyond the established practice in standard deep learning.

2 Preliminary and Related Works↩︎

2.1 Spiking Neuron Models↩︎

SNNs implement spike-driven processing through biologically-inspired neuron models for non-linear activations. The Leaky Integrate-and-Fire (LIF) model is a widely adopted abstraction of this behavior, offering an effective balance between biological plausibility and computational efficiency [20]. The discretized LIF model under each simulation timestep \(n\) can be formulated as:

Figure 2: image.

Figure 3: image.

where \(\beta\) is the decay factor, \(V_{\mathrm{th}}\) is the firing threshold, and \(H(\cdot)\) refers to the Heaviside step function that determining spike generation: \(S[n] = H\bigl(U[n] - V_{\mathrm{th}}\bigr))\) = 1 when \(U[n] \geq V_{\mathrm{th}}\), otherwise remains inactive (\(S[n]\) = 0). The charging process of the LIF neuron is determined by \(f(\cdot)\): \[\begin{align} f(V[n-1], I[n]) &= \beta V[n-1] + (1 - \beta) I[n] \label{eq:lif} \end{align}\tag{1}\] At each timestep \(n\), the current membrane potential \(U[n]\) is updated by integrating the time-domain signal \(I[n]\) corresponding to input data or intermediate operations like Conv and MLP. If \(U[n]\) exceeds the threshold \(V_{\text{th}}\), the neuron fires a spike (\(S[n]\) = 1). \(V[n]\) records membrane potential over time given the decay factor \(\beta\) and output spike activity. If the neuron does not fire, then \(V[n]\) = \(U[n]\). Notably, the LIF model simplifies to integrate-and-fire (IF) neurons when eliminating the membrane potential decay process between timesteps. Its charging process can be formulated as: \[\begin{align} f(V[n-1], I[n]) = V[n-1] + I[n] \label{eq:if} \end{align}\tag{2}\]

2.2 Spiking Neural Networks↩︎

Drawing inspiration from biological neurons, SNNs extend conventional ANNs by incorporating temporal dynamics and discrete spike-based communication [3]. Leveraging this spike-driven mechanism, neuromorphic chips implement computation through event-driven spike routing and accumulation, which substitutes energy-intensive matrix–vector multiplications [4], [5]. This facilitates high parallelism, scalability, and exceptional power efficiency, with power consumption typically in the range of tens to hundreds of milliwatts [21].

Recently, the development of modern SNNs, e.g., Spiking Transformer, has demonstrated both attractive performance and reduced energy consumption [6], [7], [22]. Spikformer [6] pioneered the spike-based self-attention mechanism called Spiking Self Attention (SSA) that utilizes sparse spike-form Query, Key, and Value vectors to eliminate the need for energy-intensive softmax operations. Following its success, many works endeavor to enhance Spiking Transformers by adapting advanced ANN Transformer architectures [23], [24] or devising complicated spike coding mechanisms to reduce representation error (e.g., multi-threshold [25]/ multi-spike neurons [26], [27]).

Here, we instead address a fundamental question: What truly limits SNNs’ performance compared to ANNs? Our investigation reveals that the answer lies in frequency properties —- specifically, that spiking neurons function as low-pass filters, impeding the propagation of high-frequency detail within the network. In addition to theoretical proof, we probe the importance of high-frequency information through our Max-Former, which features two frequency-enhancing operators: Max-Pool in patch embedding and DWC in place of early-stage self-attention. We further validate this principle in convolutional architectures with our proposed Max‑ResNet.

3 Methods↩︎

In this section, we first present a theoretical analysis of the frequency properties of spiking neurons. We show that, although the raw output spike trains of spiking neurons appear spectrally all‑pass due to their impulse‑shaped spike waveform, the resulting high‑frequency components are merely superficial and cannot be propagated through the network. In fact, spiking neurons act as low‑pass filters at the network level. This is a fundamental problem that has been overlooked in previous works. Building on this insight, we probe the importance of high-frequency information in SNNs through Max-Former, which strategically employs high-pass operators (Max-Pool and DWC) to restore high-frequency details and avoid feature degradation.

Figure 4: Time-frequency analysis of ReLU and spiking neurons. (a) Time-domain signals of input x(t) = \frac{1}{3}(\sin(2\pi \cdot 100t) + \sin(2\pi \cdot 200t) + \sin(2\pi \cdot 300t)) (blue), ReLU-processed: r(t) (red), spiking output of a LIF neuron with the \beta = 0.25: s(t) (green). (b) Fourier analysis of x(t), r(t), and s(t). (c) Fourier analysis of linear transformed (CONV/MLP) activations, where ReLU expands the frequency bandwidth of the input signal, while the spiking neuron shows high-frequency attenuation.

3.1 Spiking Neurons are Low-pass Filters↩︎

We begin with an intuitive time-frequency analysis using an input \(x(t) = \frac{1}{3}(\sin(2\pi \cdot 100t) + \sin(2\pi \cdot 200t) + \sin(2\pi \cdot 300t))\) as shown in Figure 4. The results reveal three key observations: (1) In the time domain, the ReLU output (\(r(t)\)) perfectly follows \(x(t) > 0\) while spiking neurons selectively respond to 100Hz (Figure 4 (a)); (2) However, the spiking outputs’ spectral response |S(f)| still appears nearly all‑pass, which contradicts the low‑frequency behavior observed in the time domain. These spurious high‑frequency components actually arise from the impulse‑shaped spike waveform itself rather than from genuine (Figure 4 (a-b)); (3) The waveform‑induced high‑frequency components cannot be propagated across layers, resulting in a network‑level low‑pass behavior. When considering the whole process from input \(\rightarrow\) activation \(\rightarrow\) linear transform, ReLU expands the frequency bandwidth of \(x(t)\) [28], whereas the spiking neuron exhibits strong high-frequency attenuation (Figure 4 (c)).

We first examine the charging process of spiking neurons to theoretically analyze their frequency-selective properties. Given Equ. 3 and Equ. 1 , this can be formulated as: \[V[n] = \beta V[n-1] + (1 - \beta)I[n] \label{eq:lif95update95t}\tag{3}\] Applying the Z–transform with \(\mathcal{Z}\{V[n-1]\}=z^{-1}V(z)\) yields: \[V[z] = \beta z^{-1}V[z] + (1 - \beta)I[z] \label{eq:lif95update95z}\tag{4}\] which can be rearranged to formulate the transfer function from input current to membrane potential: \[H(z) = \frac{V(z)}{I(z)} = \frac{1 - \beta}{1 - \beta z^{-1}} ,\quad 0 \leq \beta < 1 \label{eq:transfer95function}\tag{5}\] Equ. 5 is exactly the form of a first‐order infinite‐impulse‐response (IIR) low‐pass filter with a single pole at \(z=\beta\). Accordingly, as \(\beta\), i.e., the pole, approaches 1, LIF neurons exhibit stronger low‐frequency selectivity. Notably, the decay factor \(\beta\) correlates to the membrane time constant \(\tau\) as \(\beta = 1 - \tfrac{1}{\tau}\), with \(\tau\) ranges from \(1\) to \(+\infty\), a smaller \(\tau\) yields a smaller \(\beta\). From an intuitive standpoint, a shorter time constant allows the membrane potential to respond within narrower temporal windows, rendering the neuron more sensitive to higher‑frequency inputs.

From the individual neuron to the network-level information transmission, the average membrane potential is positively and consistently correlated with the possibility of spike firing during operation. We approximate the inherently nonlinear spike‐generation process as linear around the firing threshold \(V_{th}\). Denoting the firing rate by \(\mathit{fr}(V)\), we define the local gain \(k\) and approximate the Z–domain spike train \(S(z)\) by:

Figure 5: image.

Figure 6: image.

When output spike trains are weighted by a causal synaptic kernel \(w[n]\), the Z-transformed output current (\(y[n]=w[n]*s[n]\)) can be denoted as \(Y(z)=W(z)\,S(z)\). The overall input‐to‐output transfer function is obtained by combining this with 5 and 6: \[H'(z) \;=\; \frac{Y(z)}{I(z)} \;=\;S(z)\,W(z)\,H(z)\; \;=\;k\;W(z)\;\frac{1-\beta}{1-\beta\,z^{-1}}. \label{eq:overall95transfer}\tag{6}\]

The first-order low-pass IIR characteristics of \(H(z)\) make the system \(Y(z)\) inherently favor low-frequency signal components, regardless of whether the synaptic kernel \(W(z)\) or the spike coding process \(S(z)\) changes the gain or phase response. The low-pass term \((\frac{1-\beta}{1-\beta z^{-1}})^L\) further amplifies the system’s frequency selectivity when the process \(H'(z)\) is cascaded \(L\) times (layers). The complete formula is as follows: \[H'_L(z) =\frac{Y_L(z)}{I(z)} =\prod_{i=1}^L \bigl[S_i(z)\,W_i(z)\,H(z)\bigr] =\Bigl(\prod_{i=1}^L k_i\,W_i(z)\Bigr)\, \Bigl(\frac{1-\beta}{1-\beta z^{-1}}\Bigr)^L \label{eq:overall95transfer95N}\tag{7}\] In the special case of the non‐leaky IF neurons, which obey the charging process in 2 , \(H(z)\) is formulated as: \[H_(z) \;=\; \frac{1}{1-z^{-1}}, \label{eq:if95transfer}\tag{8}\] This corresponds to an ideal discrete-time low-pass filter with a pole at \(z=1\), which can yield a consistent conclusion with our previous analysis.

3.2 Max-Former↩︎

It remains unclear whether high‑frequency information is truly important for SNNs and whether restoring it can improve performance. Therefore, we systematically investigate the low-pass filtering characteristics of spiking neurons through Max-Former. To decouple frequency effects from model complexity, we: (1) replace self-attention with high-frequency-preserving DWC in the early stages, and (2) add Max-Pool in patch embedding to compensate spiking neurons’ low-pass preference. Notably, compared to the quadratic computational complexity of self-attention, DWC and Max-Pool only require linear complexity relative to the sequence length and are more parameter-efficient. We consistently adopt the LIF neuron model throughout this work.

Figure 7: (a) Overview of Max-Former: we restore high-frequency signals by using lightweight DWCs instead of self-attention in the early stages. Following the hierarchical design of  [29], Max-Former adopts a 3-stage architecture. D_i: feature dimensions of stage-i. (b) In Max-Former’s patch embedding stage, we propose three configurations (Embed-orig, Embed-Max, and Embed-Max+) to enhance high-frequency components.

3.2.1 Overall Architecture↩︎

Figure 7 (a) illustrates the overall framework of Max-Former. The architecture consists of 3 stages with \(\frac{H}{4}\times\frac{W}{4}\), \(\frac{H}{8}\times\frac{W}{8}\), and \(\frac{H}{16}\times\frac{W}{16}\) tokens respectively, where \(H\) and \(W\) denote the height and width of the input image. Critically, MaxFormer processes information through discrete spikes over time. This spike-driven computing paradigm supports two types of input:

(1) Event Streams: Asynchronous events \(e = [x, y, t, p]\) containing spatial coordinates \((x,y)\), timestamp \(t\), and polarity \(p\) are converted to event frames through temporal binning. Given original resolution \(dt_o\) and target \(dt = \alpha dt_o\), events are aggregated over \(\alpha\) consecutive bins: \[I_t = \sum_{k=\alpha t}^{\alpha(t+1)-1} S_k \in \mathbb{R}^{2\times h \times w}\]

where \(S_k\) denotes raw event data. The whole process denoises raw events and converts them to frame sequences at target temporal resolutions.

(2) Static Images: Conventional images are converted to spike sequences by: 1) Repeating static frames \(T\) times, 2) Encoding pixel intensities to spikes using spiking neurons. The resulting input is formulated as: \(I = \text{Spiking\_Embed}(\{I_t\}_{t=1}^T)\), which contains identical information across timesteps.

3.2.2 Patch Embedding↩︎

To transform the input into a tokenized representation, given input \(\{S\} \in \mathbb{R}^{T \times C \times H \times W}\), the process of patch embedding is formulated as: \[\mathbf{Y} = \left(\mathcal{G}_1({\{S\}}) + {G}_2({\{S\}}\right),\quad \mathbf{Y} \in \mathbb{R}^{T \times C' \times H' \times W'} \label{eq:transformation}\tag{9}\] where \(C' = 2C\), and \(H' = \lfloor H/P \rfloor\), \(W' = \lfloor W/P \rfloor\) with the patch size \(P=4\). To address spiking neurons’ inherent frequency preference, we present three patch embedding configurations as shown in Figure 7 (b) : \[\begin{align} &\text{Embed-Orig~~~:} (\mathcal{G}_1, \mathcal{G}_2) = \text{(Embed, Embed)} \\ &\text{Embed-Max~~~:} (\mathcal{G}_1, \mathcal{G}_2) = \text{(Max-Embed, Embed)} \\ &\text{Embed-Max+ :} (\mathcal{G}_1, \mathcal{G}_2) = \text{(Max-Embed, Max-Embed)} \end{align}\] where: \(\text{Embed} \equiv \{\text{LIF - CONV - BN}\}\), and \(\text{Max-Embed} \equiv \{\text{LIF - CONV - BN - MaxPool}\}\).

3.2.3 Token Mixing↩︎

In Transformers, lower layers typically require more high-frequency details, while higher layers benefit from more global information [30], [31]. Like biological vision, high-frequency details enable early stages to learn low-level features while progressively building local-to-global representations. Accordingly, we replace early-stage self-attention with DWC to preserve high frequencies essential for local feature learning. Given input embedding \(\mathbf{Y} \in \mathbb{R}^{T \times C \times H \times W}\), the spiking DWC is defined as: \[\begin{align} \mathbf{Z}_c(\mathbf{Y})[i] &= \text{LIF}(\sum_{j \in \Omega(i)} w_{c,j} \cdot \mathbf{Y}_c[j]) \end{align}\]

where \(\Omega(i)\) denotes the local neighborhood of position \(i\), \(w_{c,j}\) represents learnable convolution weights for channel \(c\), and \(X_c, Z_c \in \mathbb{R}^{T \times H \times W}\) is the input and output slice for channel \(c\). For the final stage, we implement token mixing via Spiking Self-Attention (SSA) [6]. The SSA computation follows: \[\begin{align} &\mathbf{Z} = \text{LIF}(\text{BN}(\mathbf{Y}\mathbf{W})), \quad \mathbf{Z} \in \{Q, K, V\} \tag{10} \\ &\text{SSA}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{LIF}(\mathbf{QK^T}\mathbf{V} \cdot s) \tag{11} \end{align}\]

where \(\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{T \times N \times H \times W}\) are spike-form tensors generated by learnable linear layers, \(s\) is a scaling factor. SSA eliminates floating-point multiplications, ensuring spike-driven compatibility.

Figure 8: Shortcut connection in SNNs. (Left) Vanilla Shortcut that combines spike and membrane potential. (Middle) Pre-Spike Shortcut that adds spike signals before neuron charging. (Right) Membrane Shortcut that directly connects membrane potentials, ensuring identical potential mapping while strictly preserving the spike-driven computing paradigm throughout the network.

3.3 Membrane Shortcut↩︎

Residual learning and shortcuts enable very deep networks to be trained effectively by providing identity shortcuts that preserve information flow and mitigate vanishing gradients [32]. For SNNs, a crucial consideration is maintaining compatibility with the spike-driven computing paradigm throughout all operations. As Fig. 8 shows, existing SNN shortcut implementations fall into three categories: (1) Vanilla Shortcut [33], (2) Pre-Spike Shortcut [34], and (3) Membrane Shortcut [35].

The Vanilla Shortcut scheme [33] directly connects spikes (binary) to membrane potentials (continuous), leading to a distribution mismatch that inherently violates the identity mapping principle. The Pre-Spike Shortcut [34] adds spike signals before neuron charging, resulting in summation values that range from 0 to 2. This disrupts the binary spike representation and spike-driven flow in SNNs.

In this work, we adopt the Membrane Shortcut [35] for its dual advantages: it preserves identity mapping by directly connecting membrane potentials while maintaining binary spike outputs that remain inherently compatible with spike-driven computation. Unlike Vanilla or Pre-Spike Shortcuts, this approach ensures both mathematical consistency with residual learning principles and seamless compatibility with spike-driven operations. We provide a detailed analysis of its impact on model performance and energy costs in Appendix 10.

4 Experiment↩︎

As shown in Figure 9, Max-Former restores high-frequency information by grafting the merits of frequency-enhancing operators: Max-Pool and DWC. To empirically probe the importance of high-frequency information in Spiking Transformers, we evaluate Max-Former through comprehensive experiments on static datasets (CIFAR-10 [36], CIFAR-100 [37] and ImageNet [38]) and neuromorphic datasets (CIFAR10-DVS [39], DVS128 Gesture [40]), with architecture configurations detailed in Table 1. In addition, we design Max‑ResNet to further investigate the effect of high‑frequency restoration in convolutional architectures (model implementation is detailed in Appendix 7.3). Experimental settings and energy analysis methods are detailed in Appendix 7 and 8.

Figure 9: Fourier spectrum of Spiking Neurons, Spiking Max-Pool, Spiking Depth-Wise Convolution and Spiking Self-attention.

3pt 0pt

Table 1: Max-Former architecture configurations for different classification tasks. Notation DWC-\(N\) denotes depth-wise convolution with kernel size \(N \times N\). For block settings: CIFAR-10/100: 3 stages (1/1/2 blocks); ImageNet: 3 stages (1/3/7 blocks); Neuromorphic: 2 stages (1/1 blocks).
Dataset Stage 1 Stage 2 Stage 3
2-3 (lr)4-5 (lr)6-7 Patch Embed Token Mix Patch Embed Token Mix Patch Embed Token Mix
Cifar10 [36]/100 [37] Embed-Orig Identity Embed-Max DWC-3 Embed-Max SSA
ImageNet [38] Embed-Orig DWD-7 Embed-Max DWC-5 Embed-Max SSA
Neuromorphic [39], [40] Embed-Max+ DWC-3 Embed-Max SSA

4.1 Results on CIFAR and Neuromorphic Datasets↩︎

4pt

c*12cc & & & & &
(lr)2-4 (lr)5-7 (lr)8-10 (lr)11-13 & Param. & \(T\) & Acc. & Param. & \(T\) & Acc. & Param. & \(T\) & Acc. & Param. & \(T\) & Acc.
ResNet-19 (ANN) [24] & 12.63 & 1 & 94.97 & 12.63 & 1 & 75.35 & & & & & & &
Max-Former (ANN) & 6.57 & 1 & 96.82 & 6.60 & 1 & 82.41 & & & & & & &
Spikformer [6] & 9.32 & 4 & 95.51 & 9.32 & 4 & 78.21 & 2.57 & 16 & 98.3 & 2.57 & 16 & 80.9 &
S-Transformer [41] & 10.28 & 4 & 95.60 & 10.28 & 4 & 78.40 & 2.57 & 16 & 99.3 & 2.57 & 16 & 80.0 &
SWformer [17] & 7.51 & 4 & 96.10 & 7.51 & 4 & 79.30 & - & - & - & 2.05 & 16 & 83.9 &
QKFormer [24] & 6.74 & 4 & 96.18 & 6.74 & 4 & 81.15 & 1.50 & 16 & 98.6 & 1.50 & 16 & 84.0 &
MS-QKFormer* & 6.74 & 4 & 96.84 & 6.74 & 4 & 81.57 & 1.50 & 16 & 98.6 & 1.50 & 16 & 82.3 &
Max-Former* & 6.57 & 4 & 97.04 & 6.60 & 4 & 82.65 & 1.45 & 16 & 98.6 & 1.45 & 16 & 84.2 &

4pt

l c l c c c c c & & & & & & &
& & & & & & &
ViT [42] & ANN & ViT-L/16 & 304.3 & 80.96 & 1 & 79.70 &
DeiT [43] & ANN & DeiT-B & 86.6 & 80.50 & 1 & 81.80 &
PVT [44] & ANN & PVT-Large & 61.4 & 45.08 & 1 & 81.70 &
MST [7] & A2S & Swin Transformer-T & 28.50 & — & 512 & 78.51 &
& & Spikformer-8-384 & 16.81 & 7.73 & 4 & 70.24 &
& & Spikformer-8-768 & 66.34 & 21.48 & 4 & 74.81 &
& & S-Transformer-8-384 & 16.81 & 3.90 & 4 & 72.28 &
& & S-Transformer-8-768 & 66.34 & 6.10 & 4 & 76.30 &
& & & 31.3 & 7.80 & 1 & 75.4 &
& & & 31.3 & 32.80 & 4 & 77.2 &
& & SWformer-8-512 & 27.6 & 5.08 & 4 & 75.43 &
& & HST-10-384 & 16.47 & 15.13 & 4 & 78.80 &
& & HST-10-768 & 64.96 & 8.52 & 1 & 81.69 &
& & HST-10-384 & 16.47 & 5.52 & 4 & 76.48 &
& & HST-10-768 & 64.96 & 6.79 & 1 & 77.78 &
& & Max-10-384 & 16.23 & 4.89 & 4 & 77.82 &
& & Max-10-512 & 28.65 & 2.50 & 1 & 75.47 &
& & Max-10-512 & 28.65 & 7.49 & 4 & 79.86 &
& & Max-10-768 & 63.99 & 5.27 & 1 & 78.60 &
& & Max-10-768 & 63.99 & 14.87 & 4 & 82.39 &

As shown in Table [tab:comparison], Max-Former delivers performance improvements across both static datasets (CIFAR10/CIFAR100) and neuromorphic datasets (DVS128/CIFAR10-DVS). Notably, for CIFAR10/100 classification, its first stage only uses identity mapping for token mixing (Table 1, yet still attains attractive results. Max-Former achieves 97.04% accuracy on CIFAR10 with only 6.57M parameters at T=4, surpassing Spikformer (95.51%, 9.32M), S-Transformer (95.60%, 10.28M), and QKFormer (96.18%, 6.74M). Similarly, on CIFAR100, Max-Former attains 82.65% accuracy, significantly outperforming Spikformer (78.21%), S-Transformer (78.40%), and QKFormer (81.57%).

Max-Former and QKFormer share a similar hierarchical architecture, though QKFormer originally employs pre-spike shortcuts [24]. For a fair comparison, we additionally implemented QKFormer with the Membrane Shortcut (denoted as MS-QKFormer in the table) using identical training configurations. Max-Former still outperforms MS-QKFormer by 0.2% on CIFAR10 (97.04% vs. 96.84%) and by 1.08% on CIFAR100 (82.65% vs. 81.57%), while requiring slightly fewer parameters (6.57M vs. 6.74M). For neuromorphic datasets, Max-Former maintains this performance advantage. On DVS128, it achieves 98.6% accuracy, matching MS-QKFormer with the membrane shortcut. On CIFAR10-DVS, Max-Former reaches 84.2% accuracy, exceeding MS-QKFormer (82.3%) by 1.9% and surpassing other spike-driven models like S-Transformer (80.0%) and SWformer (83.9%).

4.2 Results on ImageNet Classification↩︎

Table [tab:big95comparison] shows Max-Former’s performance on ImageNet classification, demonstrating its effectiveness for complex visual tasks. Max-Former-10-768 (T=4) achieves 82.39% accuracy (+7.58% over Spikformer) with 30% lower energy (14.87mJ vs 21.48mJ), despite using only lightweight for early-stage token mixing. It also outperforms the ANN-to-SNN MST model (78.51%) that requires 512 timesteps. Training/inference speed and memory usage are analyzed in Appendix 9.

Our analysis focuses on models using the Membrane Shortcut, which eliminates the energy-inefficient ternary spike transmissions (\(\{0,1,2\}\) in Pre-Spike Shortcut) while maintaining full compatibility with standard neuromorphic hardware without additional hardware overhead (see Appendix 10). For fair comparison, we implemented a membrane potential variant of QKFormer (denoted MS-QKFormer), where MS-QKFormer shows 64% lower energy (5.52mJ vs 15.13mJ) in HST-10-384.

Max-Former-10-384 (16.23M, T=4) achieves 77.82% accuracy, outperforming MS-QKFormer-10-384 (16.47M, 76.48%), S-Transformer-8-768 (66.34M, 76.3%), and Meta-Spikformer (31.3M, 77.2%). For energy efficiency under identical settings, Max-Former-10-384 consumes 4.89mJ, significantly lower than MS-QKFormer-10-384(5.52mJ), S-Transformer(6.10mJ), and Meta-Spikformer (32.8mJ). Compared to conventional ANN models, Max-Former demonstrates concrete advantages in energy efficiency while maintaining competitive accuracy. Specifically, when compared to PVT-Large (a representative hierarchical ANN), Max-Former-10-768 (T=4) achieves comparable accuracy (82.34% vs. 81.70%) with 67% lower energy consumption (14.87mJ vs. 45.08mJ). These results confirm the importance of high-frequency information in Spiking Transformers: replacing energy-intensive self-attention with lightweight DWC in early stages actually produces better performance.

6pt

Table 2: Ablation of Patch-Embedding/ Token-Mixing Strategies on CIFAR100 and CIFAR10‐DVS.
CIFAR100 CIFAR10‐DVS
1-3(lr)4-6 Patch Embed Token Mix Acc (%) Patch Embed Token Mix Acc (%)
Orig/Max/Max Identity/DWC-3/SSA 82.65 Max+/Max DWC-3/SSA 84.2
Orig/Orig/Orig Identity/DWC-3/SSA 81.63 Orig/Orig DWC-3/SSA 79.2
Orig/Max/Orig Identity/DWC-3/SSA 81.88 Orig/Max DWC-3/SSA 81.5
Orig/Max/Max Identity/Identity/SSA 81.28 Max+/Max DWC-1/SSA 81.2
Orig/Max/Max Identity/DWC-5/SSA 82.02 Max+/Max DWC-5/SSA 82.7
Orig/Max/Max DWC-7/DWC-5/SSA 82.42 Max+/Max DWC-7/SSA 82.1
Orig/Max/Max SSA/SSA/SSA 82.23 Max+/Max SSA/SSA 83.9
Orig/Orig/Orig SSA/SSA/SSA 81.43 Orig/Orig SSA/SSA 79.8

4.3 Ablation Study↩︎

We provide direct evidence for the critical role of high-frequency information in Spiking Transformers through in-depth ablation studies.

(1) Ablation on Patch Embedding Strategies: Proper patch embedding strategies help to unlock the performance limit of Spiking Transformers. On CIFAR100, changing Max-Former’s patch embedding strategy from the proposed Embed-Orig/Embed-Max/Embed-Max to the default Embed-Orig/Embed-Orig/Embed-Orig configuration reduces performance from 82.65% to 81.63%. Neuromorphic datasets exhibit stronger dependence on high-frequency components. On CIFAR10-DVS, replacing the first stage’s Embed-Max+ with Embed-Orig causes a significant performance drop from 84.2% to 81.5%. Incorporating high-frequency information through patch embedding proves effective. When purely using SSA token mixing, optimizing the patch embedding strategy can improve accuracy by +0.8% on Cifar 100 and +4.1% on CIFAR10-DVS, highlighting its critical role in spiking architectures.

(2) Ablation on Token Mixing Strategies: Max-Former further restores high-frequency information through the token mixing strategy. Despite SSA taking higher energy/parameter costs, Max-Former achieves better performance by simply replacing early-stage self-attention with lightweight DWC. On Cifar100, this substitution leads to +0.42% performance gain (All SSA: 82.23% vs. Max-Former: 82.65%). On CIFAR10-DVS, Max-Former with hybrid token mixing (DWC-3 + SSA) achieves 84.2% accuracy, outperforming the full-SSA variant (83.9%) by 0.3%. The kernel size selection for high-frequency preservation is also of great significance. Larger kernels (DWC-5/7) degrade performance (-0.63% CIFAR100, -2.1% CIFAR10-DVS) due to excessive feature smoothing. Insufficient filtering with DWC-1 also underperforms (81.2% vs 84.2% on CIFAR10-DVS). This highlights the necessity of balancing high-frequency and low-frequency components in Spiking Transformers.

Overall, Max-Former’s performance gains stem from effective high-frequency information propagation, independent of parameter count, which is evidenced by: (1) Max-pooling patch embedding (Embed-Max/Embed-Max+) consistently outperforms original versions despite similar parameter budgets. (2) Larger DWC kernels (DWC-5/DWC-7) increasing parameters but degrading accuracy (-0.63% on CIFAR100, -2.1% on CIFAR10-DVS vs Max-Former with DWC-3). See Appendix 10 for visualizations and 11 for limitations discussion.

4.4 Generality across Convolutional Architectures↩︎

We further extend the effectiveness of Max-Former to convolutional architectures by proposing Max-ResNet. The key modification in our Max-ResNet lies in the inclusion of only two additional max-pooling operations compared to MS-ResNet [45]. The detailed implementation of Max-ResNet is provided in Appendix 7.3. Training settings are listed in Appendix 7.

High-frequency information is essential for SNNs. As shown in Table [Tab:resnet], Max-ResNet achieves a remarkable performance improvement over the baseline MS-ResNet, despite having identical model sizes. Specifically, with the block configuration [2, 2, 2, 2], Max-ResNet improves the CIFAR-10 accuracy by +2.41% (from 94.4% to 96.81%) and the CIFAR-100 accuracy by +6.48%. Similarly, under the [3, 3, 2] configuration, the accuracy increases by +2.25% and +6.65% on CIFAR-10 and CIFAR-100, respectively.

In short, Max‑ResNet‑18 sets state‑of‑the‑art benchmarks across convolutional baselines, with only a moderate model size and an extremely straightforward high-frequency restoration strategy. Therefore, preserving high‑frequency information is fundamental to effective feature representation in SNNs, regardless of architecture.

l c c c c c c Architecture & & & & & &
KDSNN-ResNet-18 [46] & & [2, 2, 2, 2] & 11.22 & 4 & 95.72 & 78.46
& & [2, 2, 2, 2] & 11.22 & 4 & 94.40 & 75.06
& & [3, 3, 2] & 12.50 & 4 & 94.92 & 76.41
(lr)1-1 (lr)3-7 & & [2, 2, 2, 2] & 21.33 & 4 & 94.69 & 75.34
& & [2, 2, 2, 2] & 11.22 & 4 & 96.81 (+2.41) & 81.54 (+6.48)
& & [3, 3, 2] & 12.50 & 4 & 97.17 (+2.25) & 83.06 (+6.65)

5 Conclusion↩︎

This work challenges the prevailing assumption that binary activation constraints are the primary cause of SNNs’ performance gap. Through theoretical analysis and empirical validation, we demonstrate for the first time that spiking neurons inherently function as low-pass filters at the network level, resulting in the rapid attenuation of high-frequency components that critically degrade feature representation. We demonstrate that high‑frequency information is crucial for effective spiking computation: Max‑Former (63.99M parameters) achieves 82.39% top‑1 accuracy on ImageNet, outperforming Spikformer (74.81%, 66.34M parameters) by +7.58%, while reducing energy consumption by 30% at a comparable model size; Max‑ResNet‑18 further achieves state‑of‑the‑art performance among convolutional baselines – 97.17% on CIFAR‑10 and 83.06% on CIFAR‑100. Notably, all these improvements are achieved through extremely simple modifications, even slightly reducing the overall model size. We believe this simple yet effective solution will motivate future research to explore the unique properties of SNNs, beyond the established practices in ANN studies.

6 Acknowledgements↩︎

This work is supported by the Major Science and Technology Innovation 2030 “Brain Science and Brain-like Research” key project(No.2021ZD0201405), the Guangzhou-HKUST(GZ) Joint Funding Program (Grant No. 2023A03J0682), the National Natural Science Foundation of China (Grant No. 62405255), GuangDong Basic and Applied Basic Research Foundation (No. 2023A1515110679), and partially supported by the collaborative project with Brain Mind Innovation, Inc

7 Analysis of High-Frequency Information in Spiking Transformers↩︎

To validate our approach, we conduct experiments across three static and two neuromorphic datasets. In this section, we first provide the detailed experimental setup used to obtain the results presented in our paper. We then perform additional analysis on the importance of high-frequency information in Spiking Transformers. For complete parameter configurations, please refer to our public code repository at https://github.com/bic-L/MaxFormer.

Table 3: Hyperparameters for image classification on different datasets.
Hyper parameters ImageNet CIFAR-10 CIFAR-100 Neuromorphic
Model Size 10–384 / 10–512 / 10–768 4–384 4–384 2–256
Epochs 200 400 400 106
Resolution \(224\times224\) \(32\times32\) \(32\times32\) \(128\times128\)
Batch Size 512 (8 \(\times\) 64) 128 64 16
Optimizer AdamW AdamW AdamW AdamW
Learning rate
\(1.35\times10^{-3}\,(T=4)\)
Learning rate decay Cosine Cosine Cosine Cosine
Warmup epochs 5 20 20 10
Weight decay 0.05 0.06 0.06 0.06
Rand Augment 9 / 0.5 9-n1 / 0.4 9-n1 / 0.4
Mixup 0.25 / 0.4/ 0.8 0.5 0.75 0.5
CutMix 1 0.5 0.5
Mixup prob 0.5 1 1 0.5
Erasing prob 0.0 0.25 0.25
Label smoothing 0.1 0.1 0.1 0.1

7.1 Experimental Details↩︎

Datasets: We evaluate Max-Former through comprehensive experiments on static datasets (CIFAR-10 [36], CIFAR-100 [37] and ImageNet [38]) and neuromorphic datasets (CIFAR10-DVS [39], DVS128 Gesture [40]). The training and inference pipeline are implemented in SpikingJelly [47].

Static Datasets: For static image classification, we evaluate on three standard benchmarks. ImageNet-1k [38] is one of the most widely used datasets in computer vision. It contains 1.28 million training, 50,000 validation, and 100,000 test images covering the common 1K classes. Both CIFAR-10 [36] and CIFAR-100 [37] include 50,000 training images and 10,000 testing images with 32\(\times\)​32 resolution. The main difference between them is that CIFAR-10 has 10 categories for classification, while CIFAR-100 has 100 categories.

Neuromorphic Datasets: For event-based vision tasks, we evaluate on two standard benchmarks. CIFAR10-DVS [39] is an event-based version of the CIFAR-10 dataset, created by capturing moving image samples using the Dynamic Vision Sensor (DVS). It includes 10,000 event-based images (128\(\times\)​128 pixels) spread across 10 classes, with 9,000 samples for training and 1,000 for testing. The DVS128 Gesture dataset [40] contains 1,342 event-based recordings of 11 different hand gesture types performed by 29 people under 3 different lighting conditions. Each gesture recording lasts about 6 seconds on average.

Hyper Parameters: Our training scheme mainly follows [24] and  [41]. Specifically, MixUp [48], CutMix [49] and RandAugment [50] are used for data augmentation. The models are trained using AdamW optimizer [51] with the weight decay of 0.05 for ImageNet-1K classification tasks and the weight decay of 0.06 for all other datasets. Label Smoothing [52] is set as 0.1. Detailed training hyperparameters are shown in Table 3. For our ImageNet experiments, we used 8 NVIDIA A30 GPUs to train most models. However, for the MaxFormer-10-512 (T=4) and MaxFormer-10-768 (T=4) models, we used 8 NVIDIA H20 GPUs instead. For the smaller datasets (CIFAR10, CIFAR100, DVS128 Gesture, and CIFAR10-DVS), we used a single A30 GPU for training.

7.2 Impact of High-Frequency Information on Model Performance↩︎

Table 4: Patch embedding and token mixing schemes on CIFAR-100. Acc.:Top-1 Accuracy (%). DWC-\(N\): spiking depth-wise convolution with kernel size \(N \times N\). SSA: Spiking Self-Attention.
Patch Embed Token Mix Acc. (%)
(1) Embed-Orig/Embed-Orig/Embed-Orig Avg-Pool/Avg-Pool/Avg-Pool 76.73
Max-Pool/Max-Pool/Max-Pool 79.12
(2) Embed-Orig/Embed-Orig/Embed-Max Identity/Identity/Identity 80.11
Embed-Orig/Embed-Max/Embed-Max Identity/Identity/Identity 80.46
(3) Embed-Orig/Embed-Max/Embed-Max Avg-Pool/Avg-Pool/Avg-Pool 77.61
Max-Pool/Max-Pool/Max-Pool 79.78
Identity/Max-Pool/Identity 79.99
Identity/Identity/Max-Pool 80.12
(4) Embed-Orig/Embed-Max/Embed-Max SSA/SSA/SSA 82.13
DWC-3/DWC-5/SSA 82.36
DWC-5/DWC-3/SSA 82.46
DWC-5/DWC-7/SSA 82.45
DWC-7/DWC-5/SSA 82.42
DWC-3/DWC-3/SSA 82.59
Identity/DWC-3/SSA 82.65
(5) Embed-Orig/Embed-Max/Embed-Max SSA+DWC-5/SSA+DWC-5/SSA+DWC-5 82.09
Embed-Orig/Embed-Orig/Embed-Orig SSA+DWC-3/SSA+DWC-3/SSA+DWC-3 82.56
Embed-Orig/Embed-Orig/Embed-Orig SSA+DWC-5/SSA+DWC-5/SSA+DWC-5 82.73

In Table 4, we provide additional analysis on the impact of high-frequency information. We conducted these experiments on CIFAR-100 [37], using the experimental settings detailed in Table 3. We discuss the ablation below according to the following aspects:

(a) Extra Max-Pool in Patch Embedding/Token Mixing:

High-frequency information plays a critical role in the performance of Spiking Transformer due to the inherent low-pass filter characteristics of spiking neurons. Experimental results shown in Table 4 (1) reveal that strategically preserving these frequencies through max-pooling operations significantly enhances model accuracy, with a 2.39% improvement when replacing average pooling with max-pooling across all stages (76.73% to 79.12%).In Table 4 (2), the performance of Spiking Transformer increases progressively when extending Embed-Max from the last patch embedding blocks (80.11%) to include the middle block (80.46%).

However, excess high-frequency information will instead impair model performance. For instance, in Table 4 (3), switching from using avg-pool for all token mixing to max-pool improves top-1 accuracy from 77.61% to 79.78%. Yet we found even better results with a more targeted setting: when employing max-pooling exclusively in the middle stage increases accuracy to 79.99%, while restricting it to only the last stage further pushes performance to 80.12%.

This happens because spiking neurons act like low-pass filters that naturally reduce high-frequency components as information moves deeper through the network. Therefore, strategically adding back high-frequency components at specific points in the network is crucial for pushing the performance limit of Spiking Transformers.

(b) Spiking Transformers Benefit from High-Frequency Information:

In biological vision, high-frequency details help early processing stages learn elementary features,, which are then gradually built from local to global representations. Similarly, in standard non-spiking Transformers, the lower layers typically need more high-frequency details, while higher layers work better with global information. Spiking Transformers follows the same design philosophy, but with an important difference: they need additional frequency-enhancing operations (e.g., max-pooling and depth-wise convolution) for restoring high-frequency information that would otherwise be lost.

In Table 4 (4), we show that, a proper token mixing strategy can effectively restore high-frequency information in Spiking Transformer, resulting in significant performance gains. By replacing Spiking Self-Attention (SSA) with depth-wise convolution (DWC) for token mixing, we improved performance from 82.13% to 82.65%. Importantly, this improvement does not come from adding more parameters/ computational burden. For example, the Identity/DWC-3/SSA combination works 0.29% better than DWC-3/DWC-5/SSA, even though the former one has a lower computational cost. Our further experiments in Table 4 (4-5) confirm that these findings hold true in the full-SSA network: restoring high-frequency components can significantly optimize the performance from 82.13% to 82.73% (+0.6%) on Cifar100. The proper high-frequency enhancement strategy is essential to unlocking the full potential of Spiking Transformers.

7.3 Max-ResNet Implementation↩︎

As shown in Figure 10, Max‑ResNet introduces only a minor architectural change to MS‑ResNet [45]: all layers are replaced with Max‑ResNet layers, while the first layer remains unchanged. Code implementation is available at https://github.com/bic-L/MaxFormer.

Figure 10: Overview of Max-ResNet. A single Max‑Pool operation is added per block and layer.

8 Energy Analysis↩︎

To estimate the theoretical energy consumption of Spiking Transformers, we follow the methodology used in previous studies [6], [17], [41], [53]. It is worth noting that the batch normalization (BN) layers and linear scaling transformations that follow convolution layers can be combined directly into the convolution layers themselves with added bias terms during deployment. Thus, in common practice [6], [17], [41], [53], the energy consumption of BN is typically excluded when calculating theoretical energy usage. For fair comparison, our work adopts the same strategy. In Spiking Transformers, energy consumption is directly proportional to synaptic operations (SOPs), which can be calculated as: \[\text{SOPs}(l) = fr \times \text{T} \times \text{FLOPs}(l)\]

where \(l\) represents a specific block or layer in the Spiking Transformer architecture, \(fr\) refers to the firing rate of the input spike train for that particular block or layer, and \(\text{T}\) is the simulation time step of spiking neurons. Assuming the multiply-accumulate (MAC) and accumulate (AC) operations are implemented on the 45nm neuromorphic chip described in [5], where each MAC operation uses \(E_\text{MAC}\) = 4.6pJ of energy and each AC operation uses \(E_\text{AC}\) = 0.9pJ, we can estimate the total energy consumption of a Spiking Transformer by adding up the energy used by all MAC and AC operations across all layers: \[\begin{align} E_\text{SNN} &= E_\text{MAC} \times \text{FLOP}^{{1}}_\text{CONV} + E_\text{AC} \times (\sum_{n=2}^{N}\text{SOP}^{n}_\text{SNN Conv} + \sum_{j=1}^{M}\text{SOP}^{j}_\text{SNN FC}) \end{align}\]

\(\text{FLOP}^{{1}}_\text{CONV}\) represents the floating-point operations in the first layer, which converts non-spike inputs into spike form for static image classification tasks. Since this layer performs floating-point computations, we estimate its energy consumption using \(E_\text{MAC}\). For all subsequent layers, which process spike data, we estimate energy consumption using \(E_\text{AC}\). For mainstream non-spiking Transformers, the energy consumption is estimated through:

\[\begin{align} E_\text{ANN} &= E_\text{MAC} \times \text{FLOPs} \end{align}\]

9 Comparison on Train/Inference Time and Memory Consumption↩︎

Table 5: Comparison of training/inference time and memory usage between QKFormer-10-768 and Max-Former-10-768 models. All measurements were conducted with the simulation timestep of 1 and the batch size of 32. MS-QKFormer indicates the QKFormer variant with the membrane shortcut.
Time (s)
Memory (MB)
Time (s)
Memory (MB)
QKFormer (64.96M, T=1, B=32) 0.214 18227 0.053 5000
MS-QKFormer* (64.96M, T=1, B=32) 0.208 17496 0.048 4822
Max-Former* (63.99M, T=1, B=32) 0.179 15431 0.044 4354

Max-Former delivers faster training and inference speeds while consuming less memory. We compared the performance metrics of QKFormer [24], its membrane shortcut variant (MS-QKFormer), and Max-Former on ImageNet using 224 \(\times\) 224 input resolution. All tests were conducted on a CentOS 7.9 server equipped with the Intel Xeon Gold 6348 CPU (2.60GHz) and the Nvidia A30 GPU. As Table 5 shows, when compared to MS-QKFormer with the same hierarchical architecture and shortcut configuration, Max-Former reduces training time by 14% and both inference time and memory usage by 10%. Additionally, our results indicate that the pre-spike shortcut strategy used in the original QKFormer increases both processing time and memory demands.

10 Residual Connections in Spiking Neural Networks↩︎

(a) Performance and Energy Tradeoffs:

The unique asynchronous nature of spike-based computation makes implementing residual connections challenging. As a result, the research community of spiking neural networks (SNNs) has not yet reached a consensus on the standardized residual learning approach, either in terms of algorithm or hardware implementation. While the main focus of our work is not related to residual learning, we want to offer a detailed comparison between the pre-spike shortcut [54] and the membrane shortcut [53], the two most representative residual learning methods that emerged in recent years.

As explained in Section 3.3, the pre-spike shortcut [54] implements residual connections between spiking outputs, while the membrane shortcut [53] connects membrane potentials directly. In algorithm designs, membrane shortcuts have been reported to facilitate better performance in prior works [24], [41], especially on small datasets. However, our findings indicate this advantage is not universal across all scenarios. As shown in Table [tab:energy], the patch embedding stages of Spiking Transformers account for the majority of energy consumption. Consequently, the multiple patch embedding stages in hierarchical architectures like QKFormer  [24], while enabling efficient feature learning with fewer parameters, also come at higher energy usage. This makes the choice of shortcut scheme particularly impactful on overall energy efficiency. When processing ImageNet images at 224 × 224 resolution, the pre-spike QKFormer consumes three times more energy than its membrane shortcut variant, reflecting the substantially higher SOPs required. Nevertheless, this increased computational overhead does translate to notable performance gains (+2.32% accuracy when comparing QKFormer to MS-QKFormer).

From the hardware perspective, implementing either shortcut type on neuromorphic chips is technically feasible but presents significant challenges [55], [56]. Implementing the pre-spike shortcut specifically requires the chips to support multi-spike operations, as the ternary spike transmissions (values of 0, 1, or 2) can occur. This results in either higher energy consumption or increased hardware complexity [57]. Yao et al. [41] proposed implementing membrane shortcuts through the addressing function that passes the membrane potential to corresponding neurons in subsequent layers for merging. While membrane shortcuts do strictly adhere to the spike-driven computing paradigm and could possibly be supported by standard neuromorphic hardware, transmitting membrane potentials can create substantial communication overhead, making practical implementation non-trivial. Many discussions on hardware deployment still advocate avoiding shortcuts as the current preferred approach [58], [59]. However, given the critical role of residual learning in modern deep learning, avoiding shortcuts altogether is not a sustainable long-term strategy for advancing neuromorphic computing.

In our work, we primarily compare with MS-QKFormer using the membrane shortcut to ensure fair comparisons. We hope the SNN community can establish a consensus on standardized shortcut implementations in the near future, considering the significant impact of shortcut schemes on both energy efficiency and model performance.

2pt

@ l ccc ccc ccc c c @ & & & & &
(lr)2-4 (lr)5-7 (lr)8-10 & & & MLP & & & MLP & & & MLP & &
QKFormer (16.47M, T=4, 78.80%) & 1.26 & 0.16 & 0.41 & 2.68 & 0.35 & 0.77 & 2.97 & 2.71 & 3.81 & 0.006 & 15.13
MS-QKFormer* (16.47M, T=4, 76.48%) & 1.19 & 0.052 & 0.15 & 0.96 & 0.11 & 0.25 & 0.88 & 0.88 & 1.06 & 0.007 & 5.52
Max-Former* (16.23M, T=4, 77.82%) & 0.41 & 0.02 & 0.17 & 0.89 & 0.01 & 0.36 & 0.91 & 0.96 & 1.16 & 0.001 & 4.89

11 Visualization↩︎

We demonstrate the GradCAM visualizations [60] of four Spiking Transformers with the membrane shortcut of similar size. Compared to Spike-Driven Transformer [41] and SWFormer [17], the hierarchical architecture used in both QKFormer [24] and our Max-Former allows for more precise focusing on target objects. Compared to MS-QKFormer, Max-Former shows more concentrated activation patterns. For instance, in the polar bear image, Max-Former completely skips the background and precisely focuses on the bear’s key features (the head, rather than the outline or fur).

Figure 11: GradCAM visualizations [60] comparing four Spiking Transformers of similar size: Spike-Driven Transformer-8-512 [41] (29.68M), SWFormer-8-512 [17] (27.6M), QKFormer-10-512 with membrane shortcut (MS-QKFormer) (29.08M), and our Max-Former-10-512 (28.65M).

12 Impact and Limitation↩︎

Our work serves as the theoretical foundation for many existing architectural choices in Spiking Transformers. Specifically, in  [41], the authors found that directly applying known practices of MetaFormer does not achieve good results in Spiking Transformers: employing average pooling operators to replace SDSA as the token mixer surprisingly results in substantial performance degradation from 61.0% to 41.2%. Similar phenomena have been discussed in earlier works. For instance, Spikformer v2 [61] discovered that removing the max-pooling operator in the original Spikformer [6] leads to a substantial performance drop, while adding convolution layers (which act as high-pass filters) in the patch embedding stage significantly improves the performance. Our work reveals the underlying principles of these architectural designs: spiking transformers need to enhance high-frequency components to alleviate the feature degradation caused by their inherent low-pass activation.

We are well aware that there is still much space to be explored, and we hope that our Max-Former can serve as a good starting point for future research. For instance, similar to [31], Max-Former requires manually balancing frequency components, which demands considerable expertise when adapting to different tasks. Incorporating direct frequency learning approaches like Fourier-based [62] or Wavelet-based[17] methods would offer more straightforward solutions. The main challenge, however, lies in developing efficient spike-based frequency representations. Overall, we believe our work will inspire more future research to advance neuromorphic computing through exploring the unique properties of spiking neurons, rather than expending excessive effort to adapt established practices from standard non-spiking neural networks.

NeurIPS Paper Checklist↩︎

  1. Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer:

  4. Justification: The abstract and introduction reflect the paper’s contributions and scope clearly.

  5. Limitations

  6. Question: Does the paper discuss the limitations of the work performed by the authors?

  7. Answer:

  8. Justification: We discuss the limitations in Appendix 11.

  9. Theory assumptions and proofs

  10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  11. Answer:

  12. Justification: All the theorems, formulas, and proofs are clearly stated in the main paper.

  13. Experimental result reproducibility

  14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  15. Answer:

  16. Justification: We provide experiment details in the Appendix 78.

  17. Open access to data and code

  18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  19. Answer:

  20. Justification: Code is available: https://github.com/bic-L/MaxFormer.

  21. Experimental setting/details

  22. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  23. Answer:

  24. Justification: All training and test details have been revealed in the Appendix 7.

  25. Experiment statistical significance

  26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  27. Answer:

  28. Justification: Error bars are not reported because it would be too computationally expensive.

  29. Experiments compute resources

  30. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  31. Answer:

  32. Justification: See Appendix 7.

  33. Code of ethics

  34. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  35. Answer:

  36. Justification: This research is in every respect with the NeurIPS Code of Ethics.

  37. Broader impacts

  38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  39. Answer:

  40. Justification: This research belongs to foundational research and is not tied to particular applications.

  41. Safeguards

  42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  43. Answer:

  44. Justification: The paper poses no such risks.

  45. Licenses for existing assets

  46. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  47. Answer:

  48. Justification: The paper properly credits creators and mentions the license and terms of use for existing assets.

  49. New assets

  50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  51. Answer:

  52. Justification: The paper does not release new assets.

  53. Crowdsourcing and research with human subjects

  54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  55. Answer:

  56. Justification: The paper does not involve crowdsourcing or research with human subjects.

  57. Institutional review board (IRB) approvals or equivalent for research with human subjects

  58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  59. Answer:

  60. Justification: The paper does not involve crowdsourcing or research with human subjects.

  61. Declaration of LLM usage

  62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, declaration is not required.

  63. Answer:

  64. Justification: The paper does not involve LLMs as any important, original, or non-standard components.

References↩︎

[1]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
[2]
C. He et al., “Diffusion models in low-level vision: A survey,” arXiv preprint arXiv:2406.11138, 2024.
[3]
K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machine intelligence with neuromorphic computing,” Nature, vol. 575, no. 7784, pp. 607–617, 2019.
[4]
F. Akopyan et al., “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip , shorttitle = Truenorth,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 34, no. 10, pp. 1537–1557, 2015.
[5]
M. Davies et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” Ieee Micro, vol. 38, no. 1, pp. 82–99, 2018.
[6]
Z. Zhou et al., “Spikformer: When spiking neural network meets transformer,” arXiv preprint arXiv:2209.15425, 2022.
[7]
Z. Wang, Y. Fang, J. Cao, Q. Zhang, Z. Wang, and booktitle=Proceedings. of the I. I. C. on C. V. Xu Renjing, “Masked spiking transformer,” 2023, pp. 1761–1771.
[8]
Y. Li, Y. Guo, S. Zhang, S. Deng, Y. Hai, and S. Gu, “Differentiable spike: Rethinking gradient-descent for training spiking neural networks , shorttitle = Differentiable Spike,” Advances in Neural Information Processing Systems, vol. 34, pp. 23426–23439, file = C Users wangz Zotero storage DKGUP45I Li et al_2021_Differentiable spike.pdf, 2021.
[9]
Y. Guo et al., “Im-loss: Information maximization loss for spiking neural networks,” Advances in Neural Information Processing Systems, vol. 35, pp. 156–166, 2022.
[10]
J. Ding, Z. Yu, Y. Tian, and T. Huang, “Optimal ann-snn conversion for fast and accurate inference in deep spiking neural networks,” arXiv preprint arXiv:2105.11654 , eprint = 2105.11654, archiveprefix = arxiv, file = C
Users
wangz
Zotero
storage
4MWJPLBN
Ding et al_2021_Optimal ann-snn conversion for fast and accurate inference in deep spiking.pdf
, 2021.
[11]
Y. Li, S. Xu, B. Zhang, X. Cao, P. Gao, and G. Guo, “Q-vit: Accurate and fully quantized low-bit vision transformer,” Advances in neural information processing systems, vol. 35, pp. 34451–34463, 2022.
[12]
M. Rastegari, V. Ordonez, J. Redmon, and booktitle=European. conference on computer vision Farhadi Ali, “Xnor-net: Imagenet classification using binary convolutional neural networks,” 2016 , organization={Springer}, pp. 525–542.
[13]
E. D. Adrian and Y. Zotterman, “The impulses produced by sensory nerve endings: Part 3. Impulses set up by touch and pressure,” The Journal of physiology, vol. 61, no. 4, p. 465, 1926.
[14]
Y. Li, S. Deng, X. Dong, R. Gong, and S. Gu, “A free lunch from ANN: Towards efficient, accurate spiking neural networks calibration , shorttitle = A Free Lunch from ANN, booktitle = International Conference on Machine Learning,” 2021, pp. 6316–6325.
[15]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[16]
J. Rombouts and S. Bohte, “Fractionally predictive spiking neurons,” Advances in Neural Information Processing Systems, vol. 23, 2010.
[17]
Y. Fang, Z. Wang, L. Zhang, J. Cao, H. Chen, and booktitle=European. C. on C. V. Xu Renjing, “Spiking wavelet transformer,” 2024 , organization={Springer}, pp. 19–37.
[18]
W. Yu et al., “Metaformer is actually what you need for vision,” 2022, pp. 10819–10829.
[19]
W. Yu et al., “Metaformer baselines for vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 2, pp. 896–912, 2023.
[20]
W. Maass, “Networks of spiking neurons: The third generation of neural network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997.
[21]
A. Basu, L. Deng, C. Frenkel, and booktitle=2022. I. C. I. C. C. (CICC). Zhang Xueyong, “Spiking neural network integrated circuits: A review of trends and future directions,” 2022 , organization={IEEE}, pp. 1–8.
[22]
Y. Fang, Z. Wang, D. Zhou, H. Ren, S. Zhou, and R. Xu, “Dynamic token masking in spiking neural network,” 2025.
[23]
M. Yao et al., “Spike-driven transformer,” Advances in neural information processing systems, vol. 36, 2024.
[24]
C. Zhou et al., “Qkformer: Hierarchical spiking transformer using qk attention,” arXiv preprint arXiv:2403.16552, 2024.
[25]
Z. Hao, X. Shi, Y. Liu, Z. Yu, and T. Huang, “LM-HT SNN: Enhancing the performance of SNN to ANN counterpart through learnable multi-hierarchical threshold model,” arXiv preprint arXiv:2402.00411, 2024.
[26]
Z. Wang, Y. Fang, J. Cao, H. Ren, and booktitle=Proceedings. of the A. C. on A. I. Xu Renjing, “Adaptive calibration: A unified conversion framework of spiking neural networks,” 2025, vol. 39, pp. 1583–1591.
[27]
M. Yao et al., “Scaling spike-driven transformer with efficient spike firing approximation training,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.
[28]
C. Kechris, J. Dan, J. Miranda, and D. Atienza, “DC is all you need: Describing ReLU from a signal processing standpoint,” arXiv preprint arXiv:2407.16556, 2024.
[29]
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows , shorttitle = Swin Transformer, booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision,” 2021, pp. 10012–10022, file = C Users wangz Zotero storage W8VFH8XJ Liu et al_2021_Swin transformer.pdf.
[30]
M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?” Advances in neural information processing systems, vol. 34, pp. 12116–12128, 2021.
[31]
C. Si, W. Yu, P. Zhou, Y. Zhou, X. Wang, and S. Yan, “Inception transformer,” Advances in Neural Information Processing Systems, vol. 35, pp. 23495–23509, 2022.
[32]
K. He, X. Zhang, S. Ren, and booktitle=Computer. V. 2016:. 14th. E. C. A. T. N. O. 11–14,. 2016,. P. P. I. 14. Sun Jian, “Identity mappings in deep residual networks,” 2016 , organization={Springer}, pp. 630–645.
[33]
H. Zheng, Y. Wu, L. Deng, Y. Hu, and booktitle=Proceedings. of the A. conference on artificial intelligence Li Guoqi, “Going deeper with directly-trained larger spiking neural networks,” 2021, vol. 35, pp. 11062–11070.
[34]
W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 21056–21069, file = C Users wangz Zotero storage 2VT84YBQ Fang et al_2021_Deep residual learning in spiking neural networks.pdf, 2021.
[35]
Y. Hu, L. Deng, Y. Wu, M. Yao, and G. Li, “Advancing spiking neural networks toward deep residual learning,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
[36]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791.
[37]
A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
[38]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database , shorttitle = Imagenet, booktitle = 2009 IEEE Conference on Computer Vision and Pattern Recognition,” 2009, pp. 248–255.
[39]
H. Li, H. Liu, X. Ji, G. Li, and L. Shi, “CIFAR10-DVS : An Event-Stream Dataset for Object Classification , shorttitle = CIFAR10-DVS,” Frontiers in Neuroscience, vol. 11, 2017, Accessed: Mar. 01, 2023. [Online].
[40]
A. Amir et al., “A low power, fully event-based gesture recognition system,” 2017, pp. 7243–7252.
[41]
M. Yao et al., “Spike-driven transformer,” arXiv preprint arXiv:2307.01694, 2023.
[42]
A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[43]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and booktitle=International. conference on machine learning Jégou Hervé, “Training data-efficient image transformers & distillation through attention,” 2021 , organization={PMLR}, pp. 10347–10357.
[44]
W. Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” 2021, pp. 568–578.
[45]
Y. Hu, L. Deng, Y. Wu, M. Yao, and G. Li, “Advancing spiking neural networks towards deep residual learning,” arXiv preprint arXiv:2112.08954, 2021.
[46]
Q. Xu, Y. Li, J. Shen, J. K. Liu, H. Tang, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Pan Gang, “Constructing deep spiking neural networks from artificial neural networks with knowledge distillation,” 2023, pp. 7886–7895.
[47]
W. Fang et al., “SpikingJelly.” 2020 , howpublished = {\url{https://github.com/fangwei123456/spikingjelly}}, note = {Accessed: 2023-02-21}.
[48]
H. Zhang, M. Cisse, Y. N. Dauphin, and booktitle=International. C. on L. R. Lopez-Paz David, “Mixup: Beyond empirical risk minimization,” 2018.
[49]
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and booktitle=Proceedings. of the I. international conference on computer vision Yoo Youngjoon, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” 2019, pp. 6023–6032.
[50]
E. D. Cubuk, B. Zoph, J. Shlens, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition workshops Le Quoc V, “Randaugment: Practical automated data augmentation with a reduced search space,” 2020, pp. 702–703.
[51]
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[52]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Wojna Zbigniew, “Rethinking the inception architecture for computer vision,” 2016, pp. 2818–2826.
[53]
Y. Hu, H. Tang, and G. Pan, “Spiking deep residual networks,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[54]
W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 21056–21069, 2021.
[55]
P. Weidel and S. Sheik, “Wavesense: Efficient temporal convolutions with spiking neural networks for keyword spotting,” arXiv preprint arXiv:2111.01456, 2021.
[56]
I. Balafrej, S. Bahadi, J. Rouat, and F. Alibart, “Enhancing temporal learning in recurrent spiking networks for neuromorphic applications,” Neuromorphic Computing and Engineering, 2025.
[57]
A. Gautam and T. Kohno, “Adaptive STDP-based on-chip spike pattern detection,” Frontiers in Neuroscience, vol. 17, p. 1203956, 2023.
[58]
S. Deng, Y. Wu, K. Du, and S. Gu, “Spiking token mixer: An event-driven friendly former structure for spiking neural networks,” Advances in Neural Information Processing Systems, vol. 37, pp. 128825–128846, 2024.
[59]
S. M. Meyer et al., “A diagonal state space model on loihi 2 for efficient streaming sequence processing.”
[60]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization , shorttitle = Grad-Cam, booktitle = Proceedings of the IEEE International Conference on Computer Vision,” 2017, pp. 618–626, file = C Users wangz Zotero storage X8MYN3KS Selvaraju et al_2017_Grad–cam.pdf.
[61]
Z. Zhou et al., “Spikformer v2: Join the high accuracy club on imagenet with an snn ticket,” arXiv preprint arXiv:2401.02020, 2024.
[62]
Z. Li et al., “Fourier neural operator for parametric partial differential equations,” 2020.

  1. Corresponding author↩︎