November 17, 2023
The influence of natural image transformations on receptive field responses is crucial for modelling visual operations in computer vision and biological vision. In this regard, covariance properties with respect to geometric image transformations in the earliest layers of the visual hierarchy are essential for expressing robust image operations, and for formulating invariant visual operations at higher levels.
This paper defines and proves a set of joint covariance properties for spatio-temporal receptive fields in terms of spatio-temporal derivative operators applied to spatio-temporally smoothed image data under compositions of spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations. Specifically, the derived relations show how the parameters of the receptive fields need to be transformed, in order to match the output from spatio-temporal receptive fields under composed spatio-temporal image transformations.
For this purpose, we also fundamentally extend the notion of scale-normalized derivatives to affine-normalized derivatives, that are computed based on spatial smoothing with affine Gaussian kernels, and analyze the covariance properties of the resulting affine-normalized derivatives for the affine group as well as for important subgroups thereof.
We conclude with a geometric analysis, showing how the derived joint covariance properties make it possible to relate or match spatio-temporal receptive field responses, when observing, possibly moving, local surface patches from different views, under locally linearized perspective or projective transformations, as well as when observing different instances of spatio-temporal events, that may occur either faster or slower between different views of similar spatio-temporal events. We do furthermore describe how the parameters in the studied composed spatio-temporal image transformation models directly relate to geometric entities in the image formation process and the 3-D scene structure.
In these ways, this paper presents a unified theory for the interaction between spatio-temporal receptive field responses and geometric image transformations, with generic implications for both: (i) designing computer vision systems that are to compute image features and image descriptors, to be robust under the variabilities in spatio-temporal image structures as caused by geometric image transformations, and (ii) understanding fundamental geometric constraints for interpreting and constructing models of biological vision.
When images, video sequences or video streams are acquired from the real world, they are subject to natural image transformations, as caused by variations in the positions, the relative orientations and the motions between the objects in the world and the observer:
Depending on the distance between the objects in the world and the observer, the perspective projections of objects onto the image surfaces may become smaller or larger, which to first-order of approximation can be modelled as local spatial scaling transformations.
Depending on the orientations of the surface normals of the objects in relation to the viewing direction, the image patterns may be compressed by different amounts in different directions (perspective foreshortening), which to first-order of approximation can be modelled as local spatial affine transformations.
Depending on how the objects in the world move relative to the (possibly time-dependent) viewing direction, the image patterns of objects may move in the image plane, which to first-order of approximation can be modelled as local Galilean transformations.
Depending on how fast the perspective projections of objects move in the image plane, or how fast spatio-temporal actions occur, the time-line along the temporal dimension may be compressed or expanded, which can be modelled as temporal scaling transformations.
These types of geometric image transformations will have a profound effect on the spatio-temporal receptive fields, that register and process the image information at the earliest stages in the visual hierarchy, in that the output from the receptive fields to particular image patterns may be strongly dependent on the imaging conditions. Specifically, if the interaction effects between the geometric image transformations and the receptive fields are not properly taken into account, then the robustness of the visual modules can be strongly affected in a negative manner. If, on the other hand, the interaction effects between the natural image transformations are properly taken into account, then the robustness of the visual measurements may be substantially improved.
A particular way of handling the interaction effects between the geometric image transformations and the receptive fields, is by requiring the family of receptive fields to be covariant under the relevant classes of image transformations (Lindeberg [1], [2]). Covariance in this context means that the geometric image transformations essentially commute with the image operations induced by the receptive fields, and do in this way provide a way to propagate well-defined relationships between the geometric image transformations and the receptive fields. Specifically, covariance properties of the receptive fields at lower levels in the visual hierarchy make it possible to define invariant image measurements at higher levels in the visual hierarchy (Lindeberg [3], [4], Poggio and Anselmi [5]).
The subject of this paper is to describe and derive a set of joint covariance properties of receptive fields according to a specific model for spatio-temporal receptive fields, in terms of the generalized Gaussian derivative model for visual receptive fields (to be detailed below), under joint combinations of spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations; see Figure 1 for a visualized motivation regarding the importance of covariance under geometric image transformations for visual receptive fields.
Then, we will show with a geometric analysis how these derived joint geometric covariance properties make it possible to, to first order of approximation, perfectly match the spatio-temporal receptive field responses between different views of the same, possibly moving, local surface patch, in relation to a visual observer. In these ways, the resulting joint covariance properties make it possible for a vision system, biological or artificial, to perform more accurate inference to cues of the 3-D environment, compared to a vision system that does not obey such geometric covariance properties. Such a possibility, for geometrically accurate inference to the 3-D structure and motion in the environment, may, in turn, constitute an essential desirable property of a vision system for biological agent, who relies critically on a very well-developed vision system for its survival.
For the purpose of the theoretical analysis to be performed, we will build upon the regular Gaussian derivative model for visual receptive fields, proposed by Koenderink and van Doorn ([6]–[8]), which has been used for modelling biological receptive fields by Young ([9]) as well used as a component in more developed models of biological vision by Lowe ([10]), May and Georgeson ([11]), Hesse and Georgeson ([12]), Georgeson et al.([13]), Wallis and Georgeson ([14]), Hansen and Neumann ([15]), Wang and Spratling ([16]) and Pei et al.([17]).
In this work, we will, however, consider a more developed generalized Gaussian derivative model for visual receptive fields, extended with a variability over affine image transformations (Lindeberg and Gårding [18]) as well as furthermore extended from being applied over purely spatial image domain to being applied over a joint spatio-temporal image domain (Lindeberg [4], [19], [20]). Compared to the earlier spatio-temporal modelling work by Young et al. ([21], [22]), we will here specifically consider a more geometric way of parameterizing the degrees of freedom over the joint spatio-temporal domain, as will be further described in Section 2.
While Gabor filters have also been commonly used for modelling spatial receptive fields by Marcelja ([23]), Jones and Palmer ([24], [25]), Ringach ([26], [27]), Serre et al. ([28]), De and Horwitz ([29]) and others, the potential applicability of Gabor filters for modelling joint spatio-temporal receptive fields has, however, not been as extensively explored. For this reason, we will restrict ourselves to modelling visual receptive fields over the joint spatio-temporal domain in terms of the generalized Gaussian derivative theory for visual receptive fields in the following treatment.
Although we will in this treatment be mainly concerned with studying the properties of the generalized Gaussian derivative model for visual receptive fields, which has been proposed as a theoretically principled model for the simple cells in the primary visual cortex in Lindeberg ([4]), the implications of the combined analysis of spatio-temporal receptive field responses and geometric image transformations should also have more general applications in the area of computer vision.
For example, recent work has explored scale-covariant or scale-equivariant architectures for deep learning, which have the ability to properly handle the influence of spatial scaling transformations on the receptive field responses, see Worrall and Welling ([30]), Bekkers ([31]), Sosnovik et al. ([32], [33]), Lindeberg ([34], [35]), Jansson and Lindeberg ([36], [37]), Zhu et al. ([38]), Penaud et al. ([39]), Sangalli et al. ([40]), Zhan et al. ([41]), Yang et al. ([42]), Wimmer et al. ([43]), Barisin et al. ([44], [45]), and Perzanowski and Lindeberg ([46]).
Based on the theoretical analysis presented in this paper, we propose that it ought to be possible to extend similar ideas to more general approaches for geometric deep learning (see Bronstein et al. ([47]) and Gerken et al. ([48])) that are covariant under wider classes of geometric image transformations.
The main new contributions in this paper concern:
the formulation of a joint covariance property of the spatio-temporal smoothing transformation under the composition of (i) a spatial scaling transformation, (ii) a spatial affine transformation, (iii) a Galilean transformation and (iv) a temporal scaling transformation in Section 5.2 and
the formulation of joint transformation properties of both regular and scale-normalized spatio-temporal derivative responses under corresponding compositions of the same set of primitive geometric image transformations in Sections 5.3–5.5, as well as
the explicit geometric interpretations of the above joint covariance and transformation properties between multi-view image observations of dynamic scenes in Sections 6.2 and 6.3, which extend previous studies of multi-view geometry for static scenes to multi-view geometry for scenes with relative motions between the objects or events in the environment and the observer, with
the corresponding explicit expressions for the resulting transformation properties for the spatio-temporal smoothing operations and the spatio-temporal receptive field responses between different pairwise views in Sections 7.2 and 7.3.
Fundamentally, to be able to express the above covariance and transformation properties for scale-normalized derivatives under spatial affine transformations, we also:
In this way, we generalize the previous notion of scale-normalized spatial derivative operators over a more regular isotropic spatial scale-space representation, as reviewed in Sections 3.1–3.2.
Additionally, to put the presented theoretical results into a wider perspective of overall theoretical properties regarding an either artificial or biological vision system, we
interpret the presented theoretical results regarding joint covariance properties in terms of relationships between the variabilities of image structures under the studied classes of natural image transformations in relation to the degrees of freedom, that are spanned by the spatio-temporal receptive model, in Section 8, and
relate the parameters in the studied spatio-temporal deformation models of image patterns, resulting from the joint image transformations to geometric properties of local surface patterns in the environment, in Section 9.
To be able to present these results in a reasonably self-contained manner for a reader, for which essential components of the theoretical background may not be already fully known, we:
review the underlying spatio-temporal receptive field model in Section 2,
review the notion of scale-normalized temporal derivatives in Section 3.9 with its resulting temporal scale covariance property in Section 3.10, although in a generalized form, with the previous use of either non-causal 1-D temporal Gaussian kernels or the time-causal limit kernel replaced by a family of more general scale-covariant temporal kernels, and
review the notion of scale-normalized velocity-adapted temporal derivative operators in Section 3.11, with its associated (although not previously explicitly formulated) covariance properties under joint spatial and temporal scaling transformations, in Sections 3.12.
Furthermore, to make the motivation clear for the in-depth treatment of joint covariance and transformation properties in Section 5, for a reader who may not be already familiar with the material in (Lindeberg [1]), we
In this context, one more added value of the treatment of the joint transformation properties in Section 5, beyond the fundamental extension from four different types of individual covariance properties to joint covariance properties, as well as beyond the also fundamental extensions to algebraically much simpler covariance and transformation properties in terms of scale-normalized derivatives, as also performed in Section 5, is that the proof for the joint transformation property in Section 5.2 also constitutes a general proof for each one of the individual transformation properties in Section 4.1. Such proofs were not provided in the previous treatment in (Lindeberg [1]), because of the there implied complexity of the need for then providing four individual proofs, according to the previous individual treatments for each one of the different types of geometric covariance properties. With the unified treatment of the joint covariance properties developed in this paper, the joint covariance properties can here instead be proved in an all encompassing single proof.
Beyond the purely review-oriented Section 2, the purpose of the underpinning theory-oriented Sections 3 and 4 is thus to provide the conceptual and theoretical foundations for formulating and deriving the main results regarding joint covariance and transformation properties in Sections 5–8.
In summary, the overall aim of this paper is to present a unified theory for the interaction between spatio-temporal receptive field responses and geometric image transformations, which comprises several previous contributions in the area as different special cases, while here also providing substantial generalizations to: (i) joint geometric image transformations, (ii) receptive fields in terms of richer and more explicit sets of both regular and scale-normalized spatio-temporal derivatives, as well as to (iii) covariant spatial derivatives defined from an anisotropic affine scale-space representation, to make the affine-extended notion of scale-normalized spatial derivatives essentially equal under the influence of spatial affine transformations.
Furthermore, by the presented (iv) geometric interpretations of the derived theoretical results, we show how (v) essential components of early visual perception can be expressed in terms of the studied class of composed geometric image transformations between multiple views, for either a monocular or a binocular observer, that observes a dynamic environment from possibly different viewing directions.
This paper is organized as follows: Section 2 begins by describing the model for spatio-temporal receptive fields, that we will build upon, in terms of an underlying joint spatio-temporal smoothing operation followed by the computation of spatio-temporal derivatives for different orders of spatial and temporal differentiation. We do also give a brief summary of how these spatio-temporal receptive field models can be used for modelling linear receptive fields in the retina, the lateral geniculate nucleus (LGN) and the primary visual cortex (V1).
Section 3 then describe the notions of scale-normalized spatial and temporal derivative operators, with their associated covariance properties under (individual) spatial and scaling transformations, which constitute an important concept to use, when to match receptive field responses that have been computed for different values of the scale parameters of the receptive fields. Specifically, we formulate a new notion of affine scale-normalized directional derivatives, to be applied in connection with anisotropic affine Gaussian smoothing kernels, and show that this concept leads to provable covariance properties, for two important subgroups of the group of more general spatial affine transformations. More generally, we do also formulate new notions of a scale-normalized affine gradient operator and a scale-normalized affine Hessian operator, and show that these concepts, up to possibly unknown low-dimensional perturbation operators applied to these entities, lead to full affine covariance.
Section 4 gives an overview of how the studied model for joint spatio-temporal receptive fields obeys specific (individual) covariance properties under either spatial scaling transformations, spatial affine transformations, Galilean transformations or temporal scaling transformations. Section 5 defines the class of joint compositions of those spatio-temporal image transformations that we will consider, and does then develop explicit proofs for how both the underlying spatio-temporal smoothing operation as well as the spatial and temporal derivative operators, in the spatio-temporal receptive field model that we study, are transformed under this class of composed spatio-temporal image transformations.
Section 6 then gives a geometric interpretation of the studied class of composed spatio-temporal image transformations, that we study the covariance properties for, in terms of the scaled orthographic projection from the tangent plane of a local surface patch, complemented by a local translation motion model, to account for relative motions between the surface patch and the observer, as well as a temporal scaling transformation, to account for spatio-temporal events that may occur either faster or slower relative to a reference view. We do also present extensions, showing how a slight modification of the composed spatio-temporal transformation model makes it possible to represent first-order linearized approximations of the projective transformations between pairwise views of the same local surface patch.
Section 7 then states explicit covariance properties for the underlying spatio-temporal smoothing transformation, as well as the underlying spatial and temporal derivative operators, in the composed model for spatio-temporal receptive fields, for locally linearized transformations between pairwise views of the same local surface patch.
In Section 8, we complement the geometric interpretation of the model, by describing how the degrees of freedom in the parameters in the spatio-temporal receptive field model studied in this treatment span a similar variability, as the degrees of freedom in the locally linearized scaled orthographic projection model complemented with a Galilean motion component, to account for possibly unknown relative motions between the observed object and the observer, as well a temporal scaling transformation, to account for similarly looking spatio-temporal events that may occur either faster or slower relative to a previous view of a similar spatio-temporal event. Then, we use this connection for interpreting the functional properties of the receptive fields of simple cells in the primary visual cortex (V1), to provide complementary theoretical support for a previously formulated working hypothesis, that the receptive fields in the primary visual cortex can be regarded as very well adapted to handling the variability of image structures caused by observing a dynamic 3-D environment.
Section 9 then outlines how the parameters in the studied spatio-temporal image transformation models can be interpreted as constituting direct cues to the 3-D structure of the environment, provided that the parameters in this image deformation models can be computed with sufficient accuracy, based on combinations of receptive field responses.
Finally, Section 10 gives a summary and conclusions regarding some of the main results, as well as an outlook concerning more general applications of the presented theoretical results to computer vision and biological vision.
Given spatio-temporal image data of the form \(f(x, t)\) with \(f \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\) over the spatial coordinates \(x = (x_1, x_2)^T \in {\mathbb{R}}^2\) and time \(t \in {\mathbb{R}}\), in Lindeberg ([19], [20]) a principled model \(T \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \times {\mathbb{R}}_+ \times {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) for spatio-temporal receptive fields is derived and applied of the form (here, however, with slightly modified notation) \[\label{eq-spat-temp-RF-model} T(x, t;\; s, \Sigma, \tau, v) = g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau),\tag{1}\] where
\(s \in {\mathbb{R}}_+\) denotes a spatial scale parameter, corresponding to the spatial variance of a non-negative spatial smoothing kernel,
\(\Sigma \in {\mathbb{S}}_+^2\) denotes a symmetric and positive definite \(2 \times 2\) spatial covariance matrix, that describes the spatial shape of the spatial smoothing kernel,
\(\tau \in {\mathbb{R}}_+\) denotes a temporal scale parameter, corresponding to the temporal variance of a non-negative temporal smoothing kernel,
\(v = (v_1, v_2)^T \in {\mathbb{R}}^2\) denotes an image velocity vector,
\(g \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) denotes a 2-D affine Gaussian kernel of the form \[\label{eq-gauss-fcn-2D} g(x;\; s, \Sigma) = \frac{1}{2 \pi \, s \sqrt{\det \Sigma}} \, e^{-x^T \Sigma^{-1} x/2 s},\tag{2}\]
\(h \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) denotes a temporal smoothing kernel, that for any temporal scaling factor \(S_t \in {\mathbb{R}}_+\) obeys the temporal scale covariance property \[\label{eq-temp-sc-cov-temp-kernel} h(t';\; \tau') = \frac{1}{S_t} \, h(t;\; \tau)\tag{3}\] for \(t' = S_t \, t\) and \(\tau' = S_t^2 \, \tau\).
Based on the treatments in Lindeberg ([19], [20]), the choice of the temporal smoothing operation as the convolution with a 1-D temporal Gaussian kernel \(g_{1D} \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) with \[\label{eq-def-temp-gauss-kern} h(t;\; \tau) = g_{1D}(t;\; \tau) = \frac{1}{\sqrt{2 \pi \tau}} \, e^{-t^2/2 \tau}\tag{4}\] stands out as a canonical choice over a non-causal temporal domain, where the relative future in relation to any time moment can be accessed, whereas the choice of the temporal kernel as the time-causal limit kernel (Lindeberg [49]) \(\Psi \colon {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{R}}_{>1} \rightarrow {\mathbb{R}}\) with \[\label{eq-time-caus-limit-kern} h(t;\; \tau) = \Psi(t;\; \tau, c),\tag{5}\] defined by having a Fourier transform \(\hat{\Psi} \colon {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{R}}_{>1} \rightarrow {\mathbb{C}}\) of the form \[\label{eq-FT-comp-kern-log-distr-limit} \hat{\Psi}(\omega;\; \tau, c) = \prod_{k=1}^{\infty} \frac{1}{1 + i \, c^{-k} \sqrt{c^2-1} \sqrt{\tau} \, \omega},\tag{6}\] and corresponding to an infinite number of truncated exponential kernels, with specially chosen time constants to obtain temporal scale covariance, stands out as a canonical choice over a time-causal temporal domain, where the future cannot be accessed. The distribution parameter \(c > 1\) of this time-causal limit kernel is for practical purposes often chosen as \(c = \sqrt{2}\) or \(c = 2\).

Figure 3: Variability of first-order directional spatial derivatives of Gaussian kernels \(T_{\varphi}(x;\; s, \Sigma) = \partial_{\varphi} (g(x;\; s, \Sigma))\) over a purely spatial domain, here shown in terms of a uniform distribution on a hemisphere, for different values of the orientation angle \(\varphi\), the spatial scale parameter \(s\) and the spatial covariance matrices \(\Sigma\), and in this way simulating the variability of spatial receptive field shapes, that will be the result by interpreting the purely spatial affine covariance property, such that the underlying spatial smoothing kernels are required to be rotationally symmetric in the tangent plane of a surface patch, while varying the slant and the tilt angles of the surface patch over all the angles on the visible hemisphere. Similar variabilities will result from directional derivatives of higher order. In this figure, the spatial scale parameters of the receptive fields have been normalized, such that the maximum eigenvalue of the spatial covariance matrix \(\Sigma\) is the same for all the receptive fields. (Horizontal and vertical axes: the spatial coordinates \(x_1\) and \(x_2\), for multiple spatial receptive fields shown within the same frame.).
The above purely spatio-temporal smoothing components of receptive fields are then to combined with spatial and temporal derivative operations. Over the spatial domain, we can compute either partial spatial derivatives \[\label{eq-def-spat-part-der} \partial_{x^{\alpha}} = \partial_{x_1}^{\alpha_1} \, \partial_{x_2}^{\alpha_2}\tag{7}\] for different orders \(\alpha = (\alpha_1, \alpha_2)^T\) of spatial differentiation, or oriented directional derivatives in any direction \(\varphi \in {\mathbb{R}}\) \[\label{eq-dir-der-def} \partial_{\varphi}^m = (\cos \varphi \, \partial_{x_1} + \sin \varphi \, \partial_{x_2})^m = (e_{\varphi}^T \, \nabla_x)^m\tag{8}\] over different orientations \(\varphi\) and different orders \(m\) of spatial differentiation, where \(e_{\varphi} = (\cos \varphi, \sin \varphi)^T\) denotes the unit vector in the direction \(\varphi\) and \(\nabla_x\) denotes the spatial gradient operator according to \[\nabla_x = \left ( \begin{array}{c} \partial_{x_1} \\ \partial_{x_2} \end{array} \right).\] Over the temporal domain, we can, in turn, compute partial temporal derivatives \[\label{eq-temp-der-def} \partial_t^n\tag{9}\] for different orders \(n\) of temporal differentiation, or velocity-adapted temporal derivatives \[\label{eq-vel-adapt-der-def} \partial_{\bar t}^n = (v_1 \, \partial_{x_1} + v_2 \, \partial_{x_2} + \partial_t )^n = (v^T \, \nabla_x + \partial_t)^n\tag{10}\] for different image velocities \(v = (v_1, v_2)^T\) and orders \(n\) of temporal differentiation.
Specifically, in relation to the parameters \(\Sigma\) and \(v\) of the purely spatio-temporal smoothing component of the spatio-temporal receptive fields in (1 ), the image orientations \(\varphi\) in the directional derivative operators (8 ) should preferably be chosen in the directions of the eigendirections of the spatial covariance matrix \(\Sigma\), whereas the image velocities \(v\) in the velocity-adapted derivative operators \(\partial_{\bar t}^n\) should preferably be chosen equal to the image velocity \(v\) in the spatio-temporal smoothing kernel \(T(x, t;\; s, \Sigma, \tau, v)\).
In Lindeberg ([4]), it was demonstrated that the receptive fields of neurons in the lateral geniculate nucleus (LGN) as well as the receptive fields of simple cells in the primary visual cortex (V1), as measured by neurophysiological cell recordings by DeAngelis et al.([50], [51]), Conway and Livingstone ([52]) and Johnson et al.([53]), can be well modelled by idealized receptive fields derived from this generalized Gaussian derivative model for receptive fields.
According to the treatment in Lindeberg ([4]) Sections 4.1–4.2, the spatio-temporal receptive fields of “non-lagged neurons” and “lagged neurons”, which have rotationally symmetric response properties over the spatial domain, can be modelled by idealized receptive fields \(h_{\scriptsizeLGN} \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) of the form \[\label{eq-lgn-model-1} h_{\scriptsizeLGN}(x, t;\; s, \tau) = \pm \nabla_x^2 \, g(x;\; s) \, \partial_{t^n} \, h(t;\; \tau),\tag{11}\] where \(\nabla_x^2 = \partial_{x_1 x_1} + \partial_{x_2 x_2}\) represents the spatial Laplacian operator, and \(h(t;\; \tau)\) represents a temporal smoothing kernel, which in the most idealized situation may correspond to the time-causal limit kernel \(\Psi(t;\; \tau, c)\) according to (6 ).
In this context, “non-lagged neurons” correspond to first-order temporal derivatives, whereas “lagged neurons” correspond to second-order temporal derivatives, see also Ghodrati et al. ([54]) for a more extensive treatment of the properties of visual neurons in the lateral geniculate nucleus (LGN).
The spatio-temporal receptive fields of orientation selective simple cells in the primary visual cortex (V1) can, in turn, be modelled by idealized receptive fields \(T_{{\varphi}^{m} {\bar t}^n} \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \times {\mathbb{R}}_+ \times {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) of the form \[\begin{gather} \label{eq-spat-temp-RF-model-der} T_{{\varphi}^{m} {\bar t}^n}(x, t;\; s, \Sigma, \tau, v) = \\ = \partial_{\varphi}^{m} \, \partial_{\bar t}^n \left( g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau) \right), \end{gather}\tag{12}\] where
\(\partial_{\varphi}\) denotes a directional derivative operator in one of the eigendirection of the spatial covariance matrix \(\Sigma\) according to (8 ),
\(\partial_{\bar t}\) denotes a velocity-adapted temporal derivative operator in the direction \(v\) according to (10 ), and
\(h(t;\; \tau)\) again represents a set of first-order truncated exponential kernels coupled in cascade, which in the most idealized situation may correspond to the time-causal limit kernel \(\Psi(t;\; \tau, c)\) according to (6 );
see Lindeberg ([4]) Section 4.3 for explicit biological modelling results.
Figures 2–5 show variabilities of idealized receptive fields according to this model under (i) spatial scaling transformations, (ii) spatial affine transformations, (iii) Galilean transformations and (iv) temporal scaling transformations.
This paper addresses the problem of modelling the effects, that joint compositions of these types of image transformations have upon receptive field responses, as well as how such joint compositions of these geometric image image transformations can be interpreted geometrically, for multi-view observations of dynamic scenes.
| Section | Topic | Contribution |
|---|---|---|
| 3.1 | Spatial scale-normalized derivatives over isotropic scale space | Mainly review of Lindeberg ([55]) |
| 3.2 | Covariance properties of isotropic derivatives under pure spatial scaling transformations | Mainly review of Lindeberg ([55]) |
| 3.3 | Spatial affine scale-normalized directional derivatives over affine scale space | New |
| 3.4 | Covariance properties of affine directional derivatives under spatial affine transformations | New |
| 3.5 | Spatial scale-normalized affine gradient operator over affine scale space | New |
| 3.6 | Covariance properties of scale-normalized affine gradient under spatial affine transformations | New |
| 3.7 | Spatial scale-normalized affine Hessian operator over affine scale space | New |
| 3.8 | Covariance properties of scale-normalized affine Hessian under spatial affine transformations | New |
| 3.9 | Temporal scale-normalized derivatives over temporal scale space | Generalization of Lindeberg ([56]) |
| 3.10 | Covariance properties of temporal derivatives under temporal scaling transformations | Generalization of Lindeberg ([56]) |
| 3.11 | Spatio-temporal scale-normalized velocity-adapted derivatives over spatio-temporal scale space | Extension of Lindeberg ([2]) |
| 3.12 | Covariance property of velocity-adapted derivatives under joint scaling transformations | New |
When computing spatial and temporal derivatives from spatio-temporally smoothed video data, as obtained by convolution of the video data with the spatio-temporal smoothing kernel (1 ), a basic observation is that the magnitude of the computed spatio-temporal derivatives will decrease in magnitude with increasing values of the spatial and the temporal scale parameters. To handle this problem, and to enable the definition of spatial and temporal derivative operators, that are truly covariant with regard to variations of the spatial and the temporal scale parameters, that occur as parameters in the models of the spatio-temporal receptive fields, we will make use of scale-normalized derivative operators.
In this section, we will state the definitions of such scale-normalized derivative operators, regarding both spatial and temporal derivatives over different types of spatial, temporal or spatio-temporal domains, and show how this notion leads to basic covariance properties under different types of individual spatial and temporal scaling transformations.
For the purely spatial scale-normalized derivatives of the isotropic scale-space representation, based on convolutions with rotationally symmetric Gaussian kernels, as well as for the purely temporal scale-space representations, defined from convolutions with either 1-D non-causal temporal Gaussian kernels or the time-causal limit kernel, the corresponding treatments in Sections 3.1–3.2 and 3.9–3.10 will largely be reviews of previously existing results, although with explicit formulations of scale-normalized directional derivatives and gradient vectors over the spatial domain in Sections 3.1–3.2, as covered by the general theory in Lindeberg ([55]) but not explicitly stated there, as well as to scale-normalized temporal derivates for more general families of scale-covariant temporal smoothing kernels in Sections 3.9–3.10, restated as well as extended here to constitute foundations for further developments building on these results, as well as to provide a balance in regard to the new theoretical formulations, as detailed further below.2
For the purely spatial scale-normalized derivative operators, based on convolutions with anisotropic Gaussian kernels, the treatments in Sections 3.3–3.8 will, on the other hand present a set of novel theoretical constructions regarding affine-extended scale-normalized derivative operators, which will then allow for provable covariance properties over either different subgroups of the affine group, or the full affine group, depending on the different formulations of affine-extended scale-normalized derivative operators.
Additionally, regarding the notion of scale-normalized velocity-adapted derivatives over a joint spatio-temporal domain, Sections 3.11–3.12 will state corresponding covariance properties of such spatio-temporal derivative operations, which have not been sufficiently formulated in previous work.
The scale-normalized spatial and temporal derivative operators, defined in these ways, will then be essential to obtain true covariance properties with associated transformation properties of a particularly simple form, under the different classes of locally linearized joint spatio-temporal image transformations, that we will consider between the image data obtained from different views of the same scene.
Later in Section 5, we will then consider specific compositions of these primitive individual types of geometric image transformations studied in this section.
Table 1 provides a comprehensive overview of the different types of contributions, that will follow in this section.
Given 2-D spatial image data \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\), for a regular (isotropic) Gaussian scale-space representation \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\), defined by convolution with rotationally symmetric Gaussian kernels \(g(\cdot;\; s, I)\) at scale \(s \in {\mathbb{R}}_+\), for which the covariance matrix \(\Sigma \in {\mathbb{S}}_+^2\) in (2 ) is a unit matrix \(\Sigma = I\), \[L(\cdot;\; s) = g(\cdot;\; s, I) * f(\cdot),\] the notion of scale-normalized derivative operators corresponding to the regular partial derivative operators (7 ), to be used at the spatial scale level \(s \in {\mathbb{R}}_+\) in the corresponding spatial scale-space representation, can be defined according to Lindeberg ([55]) \[\label{eq-def-spat-part-der-sc-norm-basic} \partial_{x^{\alpha},\scriptsizenorm} = s^{(\alpha_1 + \alpha_2)/2} \, \partial_{x_1}^{\alpha_1} \, \partial_{x_2}^{\alpha_2}.\tag{13}\] The corresponding scale-normalized analogues of the directional derivative operators (8 ) in the direction \(\varphi\) will then be of the form \[\label{eq-dir-der-def-sc-norm-basic} \partial_{\varphi,\scriptsizenorm}^m = s^{m/2} \, (\cos \varphi \, \partial_{x_1} + \, \sin \varphi \, \partial_{x_2})^m = s^{m/2} \, (e_{\varphi}^T \, \nabla_x)^m,\tag{14}\] and the corresponding scale-normalized spatial gradient operator will be \[\label{eq-nabla-op} \nabla_{x,\scriptsizenorm} = s^{1/2} \, \nabla_x.\tag{15}\] By multiplying the regular spatial derivative operators by the scale parameter raised to a suitable power, proportional to the order of spatial differentiation, the scale-normalized spatial derivative concept will in this way compensate for the otherwise general decrease in the magnitude of the spatially smoothed spatial derivatives with increasing spatial scales, to enable truly scale-covariant spatial derivative operators, whose magnitudes can be perfectly matched under spatial scaling transformations, as will be detailed in the next section.
Consider next a spatial scaling transformation \[f'(x') = f(x) \quad\quadfor\quad\quad x' = S_x \, x,\] with the spatial scaling factor \(S_x \in {\mathbb{R}}_+\), applied to the corresponding isotropic Gaussian purely spatial scale-space representations \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) of \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) and \(f' \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\), respectively, defined according to \[\begin{align} \begin{aligned} L(\cdot;\; s) = g(\cdot;\; s, I) * f(\cdot), \end{aligned}\\ \begin{align} L'(\cdot;\; s') = g(\cdot;\; s', I) * f'(\cdot). \end{align} \end{align}\] As shown in (Lindeberg [55] Equation (16)), these scale-space representations obey spatial scale covariance over the underlying spatial smoothing transformation, such that \[L'(x';\; s') = L(x;\; s)\] holds for matching values of the spatial scale parameters according to \[s' = S_x^2 \, s.\] Given the definitions of the scale-normalized partial derivatives (13 ) and the directional derivatives (14 ) in the direction \(\varphi\) in the original domain, let us define corresponding scale-normalized spatial derivatives over the transformed spatial domain according to \[\begin{align} \begin{aligned} \partial_{{x'}^{\alpha},\scriptsizenorm} = {s'}^{(\alpha_1 + \alpha_2)/2} \, \partial_{x'_1}^{\alpha_1} \, \partial_{x'_2}^{\alpha_2}, \end{aligned}\\ \begin{align} \partial_{\varphi',\scriptsizenorm}^m = {s'}^{m/2} \, (e_{\varphi'}^T \, \nabla_{x'})^m, \end{align} \end{align}\] where
\(e_{\varphi'} = (\cos \varphi', \sin \varphi')^T\) denotes the corresponding unit vector in the transformed direction \(\varphi' \in {\mathbb{R}}\) after the image transformation,
the entity \[\nabla_{x'} = \left ( \begin{array}{c} \partial_{x'_1} \\ \partial_{x'_2} \end{array} \right)\] denotes the transformed gradient operator, and
here the angles for the directional derivatives are not affected by the uniform spatial scaling transformation, such that \[\varphi' = \varphi.\]
Let us also define the corresponding scale-normalized gradient operator over the transformed domain according to \[\nabla_{x',\scriptsizenorm} = {s'}^{1/2} \, \nabla_{x'}.\] Then, since the scale-normalized spatial derivative operators over the two respective domains will be related according to (see Lindeberg ([55]) Equation (20) for the specific choice of the scale normalization power \(\gamma = 1\) in that paper) \[\partial_{x'_i,\scriptsizenorm} = \partial_{x_i,\scriptsizenorm},\] it follows that the scale-normalized spatial derivatives of the transformed spatial scale-space representation \(L'\) will be equal to the scale-normalized spatial derivatives of the transformed spatial scale-space representation \(L\), such that \[\begin{align} \begin{aligned} L'_{{x'}^{\alpha},\scriptsizenorm}(x';\; s') & = L_{{x}^{\alpha},\scriptsizenorm}(x;\; s), \end{aligned}\\ \begin{align} \tag{16} (\nabla_{x',\scriptsizenorm} L')(x';\; s') & = (\nabla_{x,\scriptsizenorm} L)(x;\; s), \end{align}\\ \begin{align} \tag{17} L'_{{\varphi'}^m,\scriptsizenorm}(x';\; s') & = L_{\varphi^m,\scriptsizenorm}(x;\; s), \end{align} \end{align}\] which thus constitute covariance properties for scale-normalized spatial derivatives of an isotropic purely spatial scale-space representation under spatial scaling transformations.
When interpreted geometrically, these spatial scale covariance properties mean that, if we observe the same scene from different distances, while keeping the viewing direction constant, then the scale-normalized spatial derivative responses can, to first order of approximation, be perfectly matched, when viewing the same local surface patch from different distances, along the same viewing direction.
For the later developments in this paper, we do, in addition to the above isotropic scale-normalized derivative concept, for spatial scale-space representations based on convolutions with rotationally symmetric Gaussian kernels, also need to define scale-normalized spatial derivative operators for spatial derivatives that are to be computed based on an affine Gaussian scale-space representation \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) of any 2-D image \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\), obtained by spatial smoothing with anisotropic affine Gaussian kernels \(g \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\), according to (2 ) \[\label{eq-def-aff-scsp} L(\cdot;\; s, \Sigma) = g(\cdot;\; s, \Sigma) * f(\cdot),\tag{18}\] that is, based on (symmetric and positive definite) \(2 \times 2\) spatial covariance matrices \(\Sigma \in {\mathbb{S}}_+^2\) that are not generally equal to a unit matrix \(I\). For this reason, we will in the following extend the above scale-normalized derivative concept from an isotropic Gaussian scale-space representation to an affine Gaussian scale-space representation in different ways.
In this section, we will first develop such a notion of scale-normalized derivatives for directional derivatives defined from an affine Gaussian scale-space representation. Later, we will then develop the other notions of the scale-normalized affine gradient operator (see Section 3.5) and the scale-normalized affine Hessian operator (see Section 3.7).
Given an affine spatial scale-space representation \(L(x;\; s, \Sigma)\), that has been computed according to (18 ), we define the affine scale-normalized directional derivative operator in the direction \(\varphi\), with the unit vector in this direction denoted \(e_{\varphi}\), according to \[\label{eq-aff-sc-norm-dir-der} \partial_{\varphi,\scriptsizenorm}^m = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m.\tag{19}\] A general motivation for this definition, is that the entity \(e_{\varphi}^T \, \Sigma \, e_{\varphi}\) should, disregarding the effect of the complementary scalar scale parameter \(s\), reflect the amount of spatial smoothing, that convolution with an affine Gaussian kernel with spatial covariance matrix \(\Sigma\) corresponds to, when measured in the spatial direction \(e_{\varphi}\) only; see Figure 6 for an illustration.

Figure 6: The definition of the affine scale-normalized derivative operator according to (19 ) is based on extracting the amount of spatial smoothing in the direction \(e_{\varphi} = (\cos \varphi, \sin \varphi)\) of the spatial covariance matrix \(\Sigma\), here visualized in terms of an intersection of an ellipse representation of the spatial covariance matrix \(\Sigma\)..
Specifically, if the coordinate system is rotated in such a way that the covariance matrix becomes a diagonal matrix \(\Sigma = {\operatorname{diag}}(\lambda_1, \lambda_2)\) with \(\lambda_1 \in {\mathbb{R}}_+\) and \(\lambda_2 \in {\mathbb{R}}_+\), then if the unit vector \(e_{\varphi}\) for computing the directional derivative is selected equal to one of the eigenvectors \(e_i\) of the spatial covariance matrix \(\Sigma\), then the entity \(e_{\varphi}^T \, \Sigma \, e_{\varphi}\) will select the eigenvalue \(\lambda_i\) of the spatial covariance matrix corresponding to that eigenvector \(e_i\). Since the affine Gaussian convolution operation in such a configuration will reduce to separable smoothing with 1-D Gaussian kernels with spatial scale parameters \(\lambda_1\) and \(\lambda_2\) along the spatial eigendirections \(e_1\) and \(e_2\), respectively, this implies that the resulting spatial scale normalization factor \(\lambda_i^{m/2}\), for spatial derivatives of orders \(m\) in the eigendirection \(e_i\) of the spatial covariance matrix \(\Sigma\), will, for the stated definition, correspond precisely to the regular spatial scale normalization factor \(\lambda_i^{m/2}\) for the separated 1-D scale-space representations along the respective orthogonal eigendirections of the spatial covariance matrix.
Furthermore, we can note that in the special case of choosing the spatial covariance matrix \(\Sigma\) equal to the unit matrix \(I\), the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}^m\) according to (19 ) reduces to the previously defined isotropic scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}^m\) according to (14 ).
For the special case of a regular Gaussian scale-space representation, defined by convolution with rotationally symmetric Gaussian kernels for \(\Sigma = I\), this new definition of affine scale-normalized Gaussian directional derivatives is therefore consistent with the previous formulated notion of isotropic scale-normalized directional derivatives.
In the following, we will analyze and formulate explicit covariance properties for this notion of affine scale-normalized directional derivatives, for the cases of two important subgroups of the affine group. By necessity, the mathematical details may be somewhat technical. The reader more interested in the general overall results than the details of the derivations should however, without major loss of continuity, be able to skip this treatment, to then continue with Section 3.9, while noting the three below main theoretical results, summarized under the below boldface headers “Summary of main result”, with the corresponding geometric illustrations of the first two main results in Figures 7 and 8.
Consider a spatial affine transformation between two 2-D images \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) and \(f' \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) of the form \[\label{eq-spat-aff-transf-def-aff-sc-norm-ders} f'(x') = f(x) \quad\quadfor\quad\quad x' = A \, x,\tag{20}\] where \(A\) is a non-singular \(2 \times 2\) affine transformation matrix, and with the corresponding affine Gaussian scale-space representations \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) of \(f\) and \(f'\), respectively, defined according to \[\begin{align} \begin{aligned} \tag{21} L(\cdot;\; s, \Sigma) = g(\cdot;\; s, \Sigma) * f(\cdot), \end{aligned}\\ \begin{align} \tag{22} L'(\cdot;\; s', \Sigma') = g(\cdot;\; s', \Sigma') * f'(\cdot), \end{align} \end{align}\] with matching values of the spatial scale parameters \(s \in {\mathbb{R}}_+\) and \(s' \in {\mathbb{R}}_+\) as well as the \(2 \times 2\) spatial covariance matrices \(\Sigma \in {\mathbb{S}}_+^2\) and \(\Sigma' \in {\mathbb{S}}_+^2\) over the two domains, such that3 \[\label{eq-transf-prop-sc-par-spat-cov-mat-pure-aff-scsp} s' \Sigma' = s \, A \, \Sigma \, A^T,\tag{23}\] which then implies that the affine scale-space representations will be equal for these matching values of the scale parameters and the covariance matrices (Lindeberg and Gårding ([18]) Equation (29)) \[\label{eq-equal-aff-scsp-repr-def-aff-sc-norm-ders} L'(x';\; s', \Sigma') = L(x;\; s, \Sigma).\tag{24}\] Let us, in analogy with the definition of affine scale-normalized derivatives along the direction \(\varphi\) in the original domain according to (19 ), define affine scale-normalized directional derivatives along the direction \(\varphi'\) in the transformed domain according to \[\label{eq-aff-sc-norm-dir-der-prim} \partial_{\varphi',\scriptsizenorm}^m = {s'}^{m/2} \, (e_{\varphi'}^T \, \Sigma' \, e_{\varphi'})^{m/2} \, \partial_{\varphi'}^m.\tag{25}\] Replacing \(s' \, \Sigma'\) in this expression by \(s' \Sigma' = s \, A \, \Sigma \, A^T\) according to (23 ), and using the following transformation property of the unit vectors \[\label{eq-match-unit-vectors-spat-aff} e_{\varphi'} = \frac{A \, e_{\varphi}}{\| A \, e_{\varphi} \| },\tag{26}\] while additionally noting that \[\label{eq-norm-aff-transf-unit-vector} \| A \, e_{\varphi} \|^2 = e_{\varphi}^T \, A^T A \, e_{\varphi},\tag{27}\] implies that we can rewrite (25 ) as \[\label{eq-aff-sc-norm-dir-der-prim-rewrite} \partial_{\varphi',\scriptsizenorm}^m = {s}^{m/2} \, \left( \frac{e_{\varphi}^T \, A^T A \, \Sigma \, A^T A \, e_{\varphi}}{e_{\varphi}^T \, A^T A \, e_{\varphi}} \right)^{m/2} \partial_{\varphi'}^m.\tag{28}\] From the definitions of the regular directional derivative operators over the respective image domains \[\begin{align} \begin{aligned} \partial_{\varphi} = e_{\varphi}^T \, \nabla_{x}, \end{aligned}\\ \begin{align} \partial_{\varphi'} = e_{\varphi'}^T \, \nabla_{x'}, \end{align} \end{align}\] and the transformation property4 of the spatial gradient operator under the spatial affine transformation (20 ) \[\nabla_x = A^T \, \nabla_{x'} \quad\quadimplying that\quad\quad \nabla_{x'} = A^{-T} \, \nabla_{x},\] we can after some simplifications obtain that the scale-normalized directional derivative operators in the two domains are related according to \[\partial_{\varphi',\scriptsizenorm} = \frac{\partial_{\varphi,\scriptsizenorm}}{\| A \, e_{\varphi} \| }.\] To make a preliminary summary, while again making use of the relationship (27 ), this means that the affine scale-normalized directional derivative operators over the two image domains can be written as \[\begin{align} \begin{aligned} \tag{29} \partial_{\varphi,\scriptsizenorm}^m & = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m, \end{aligned}\\ \begin{align} \tag{30} \partial_{\varphi',\scriptsizenorm}^m & = {s}^{m/2} \, \left( \frac{e_{\varphi}^T \, A^T A \, \Sigma \, A^T A \, e_{\varphi}}{(e_{\varphi}^T \, A^T A \, e_{\varphi})^2} \right)^{m/2} \partial_{\varphi}^m. \end{align} \end{align}\]
To analyse the possibility of the expressions (29 ) and (30 ) leading to the same result, for the special case of spatial similarity transformations, let us insert \[A = S_x \, R,\] where \(S_x \in {\mathbb{R}}_+\) is a uniform spatial scaling factor and \(R\) is a \(2 \times 2\) rotation matrix with \(R^T \, R = R \, R^T = I\), such that \(A^T A = S_x^2 \, I\), where \(I\) is the \(2 \times 2\) unit matrix, into the above expressions. Then, after simplification, (29 ) and (30 ) reduce to the similar expressions \[\begin{align} \begin{aligned} \tag{31} \partial_{\varphi,\scriptsizenorm}^m & = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m, \end{aligned}\\ \begin{align} \tag{32} \partial_{\varphi',\scriptsizenorm}^m & = {s}^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m, \end{align} \end{align}\] implying that when applied over their respective image domains, these two underlying affine scale-normalized directional operators will lead to the same result \[\label{eq-aff-sc-norm-cov-sim-transf} \partial_{\varphi',\scriptsizenorm}^m L'(x';\; s', \Sigma') = \partial_{\varphi,\scriptsizenorm}^m L(x;\; s, \Sigma).\tag{33}\]
Summary of main result: To conclude, this result shows that, when applied to an affine Gaussian scale-space representation (18 ), defined by convolutions with arbitrary affine Gaussian kernels, the affine scale-normalized directional derivative concept, defined according to (25 ) \[\partial_{\varphi,\scriptsizenorm}^m = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m,\] is for every direction \(e_{\varphi}\) in the image plane covariant under arbitrary combinations of uniform scaling transformations and rotations of the form \[f'(x') = f(x) \quad\quadfor\quad\quad x' = S_x \, R \, x,\] such that for the matching image orientation \(e_{\varphi'} = R \, e_{\varphi}\), and with the transformed affine scale-normalized directional derivative operator in this direction defined according to \[\partial_{\varphi,'norm}^m = {s'}^{m/2} \, (e_{\varphi'}^T \, \Sigma' \, e_{\varphi'})^{m/2} \, \partial_{\varphi'}^m,\] then the relationship \[\label{eq-aff-norm-dir-der-sim-transf} \partial_{\varphi',\scriptsizenorm}^m L'(x';\; s', \Sigma') = \partial_{\varphi,\scriptsizenorm}^m L(x;\; s, \Sigma)\tag{34}\] will hold for all values of the spatial scale parameter \(s \in {\mathbb{R}}_+\) and the \(2 \times 2\) spatial covariance matrix \(\Sigma\) in the original domain, provided that the spatial scale parameter \(s' \in {\mathbb{R}}_+\) and the spatial covariance matrix \(\Sigma'\) over the transformed domain are related according to \[\label{eq-rel-sc-pars-cov-mats-cov-prop-aff-sc-norm-dir-der-sim-transf} s' \Sigma' = s \, A \, \Sigma \, A^T= s \, S_x^2 \, R \, \Sigma \, R^T.\tag{35}\]

Figure 7: In terms of geometric interpretation, the covariance property (34 ) for the affine scale-normalized derivative operator (25 ) under similarity transformations means that, if two cameras view the same local surface patch along the same optical axis, while being at possibly difference distances from the object and also having possibly rotated orientations of the image planes relative to each other around the optical axis, then the affine scale-normalized derivatives according to (31 ) and (32 ) can, to first order of approximation, be perfectly matched between such different views of the same local surface patch, provided that the scale parameters and the covariance matrices of the resulting spatial receptive fields are matched according to (35 )..
With regard to geometric interpretation of the above result for the special case of the similarity group, if we interpret the composed spatial image transformation as a locally linear approximation of the perspective mapping from the tangent plane of a local surface patch to the image plane, this covariance property under spatial similarity transformations implies that the affine scale-normalized directional derivative responses can, to first order of approximation, be perfectly matched between different views of the same local surface patch, when varying the distance between the viewed object and the observer, as well as when rotating either the camera or the object around the optical axis; see Figure 7 for an illustration.
Concerning the dimensionality of the manifold spanned by this covariance property, we have that the spatial scale parameter \(s\) has dimensionality 1 and the spatial covariance matrix \(\Sigma\) has dimensionality 3. Due to the coupling of these parameters of the form \(s \, \Sigma\), these parameters together do, however, only correspond to a variability over 3 effective dimensions in the effective parameter space. The affine transformation matrices according to the similarity group \(A = S_x \, R\) span 2 out of the 4 dimensions in the variability of the 2-D affine group. The degree of freedom in the direction \(\varphi\) adds 1 dimension to this space. Thus, we have that this covariance result under spatial similarity transformations spans 6 out of the totally 8 dimensions in the variability of computing affine scale-normalized derivatives from an affine scale-space representation in different directions in the image plane, that are possible under the different types of image transformations that are spanned by the full affine group, as well as under the different parameter settings that can be performed for the affine Gaussian spatial smoothing component in the spatial receptive fields.
Let us next assume that the eigenvalue decompositions of the spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\) are coupled, in such a way that \[\begin{align} \begin{aligned} \tag{36} \Sigma & = U \, {\operatorname{diag}}(\lambda_1, \lambda_2) \, U^T, \end{aligned}\\ \begin{align} \tag{37} A & = U \, {\operatorname{diag}}(S_1, S_2) \, U^T, \end{align} \end{align}\] where
\({\operatorname{diag}}(\lambda_1, \lambda_2)\) is a \(2 \times 2\) diagonal matrix with the eigenvalues \(\lambda_1 \in {\mathbb{R}}_+\) and \(\lambda_2 \in {\mathbb{R}}_+\) of the spatial covariance matrix \(\Sigma\),
\({\operatorname{diag}}(S_1, S_2)\) is a \(2 \times 2\) diagonal matrix with the eigenvalues \(S_1 \in {\mathbb{R}}_+\) and \(S_2 \in {\mathbb{R}}_+\) of the affine transformation matrix \(A\), and thus representing the two spatial scaling factors \(S_1\) and \(S_2\) of a non-uniform spatial scaling transformation, and
\(U\) is a \(2 \times 2\) real unitary matrix, such that \(U^T U = U\, U^T = I\), with its two columns \(e_1\) and \(e_2\) constituting the eigenvectors of both the spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\).
Note, however, that we do not here assume the eigenvalues to be ordered with respect to magnitude. Instead, we assume that the eigenvalues are ordered in such a way that, the order of the eigenvectors is the same for both the spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\).
Given the expressions for the spatial covariance matrix \(\Sigma\) according to (36 ) and the affine transformation matrix \(A\) according to (37 ), the expression for the transformed transformation matrix \(\Sigma'\) according to (23 ) does after simplification assume the form \[s' \, \Sigma' = s \, A \, \Sigma \, A^T = s \, U \, {\operatorname{diag}}(S_1^2 \, \lambda_1, S_2^2 \, \lambda_2) \, U^T,\] which does thus also share a coupled eigenvalue decomposition, in relation to the original spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\).
If we next choose the unit vector \(e_{\varphi}\), for defining the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}\) in the original domain according to (19 ), equal to \(e_i\), where \(e_i\) is one of the unit vectors \(e_1\) or \(e_2\) in the above unitary matrix \(U\), which in turn for \(e_{\varphi} = e_i\) implies that also the transformed unit vector becomes \[e_{\varphi'} = \frac{A \, e_{\varphi}}{\| A \, e_{\varphi} \|} = \frac{U \, {\operatorname{diag}}(S_1, S_2) \, U^T e_i}{\| U \, {\operatorname{diag}}(S_1, S_2) \, U^T e_i \|} = e_i,\] and then make use of the result that \[A^T A = U \, {\operatorname{diag}}(S_1^2, S_2^2) \;U^T,\] we then obtain that the expression \[e_{\varphi}^T \, \Sigma \, e_{\varphi} = e_i^T \, U \, {\operatorname{diag}}(\lambda_1, \lambda_2) \;U^T e_i\] in the expression for the affine scale-normalized derivative operator \(\partial_{\varphi,\scriptsizenorm}\) in (29 ) for the unit direction \(e_{\varphi} = e_i\) reduces to \[e_{\varphi}^T \, \Sigma \, e_{\varphi} = \lambda_i.\] It also follows that the expression5 \[e_{\varphi'}^T \, \Sigma' \, e_{\varphi'} = \frac{s}{s'} \times \frac{e_{\varphi}^T \, A^T A \, \Sigma \, A^T A \, e_{\varphi}}{(e_{\varphi}^T \, A^T A \, e_{\varphi})^2}\] in the expression for the affine scale-normalized derivative operator \(\partial_{\varphi',\scriptsizenorm}\) in (30 ) for the transformed unit direction \(e_{\varphi'} = e_i\) reduces to \[e_{\varphi'}^T \, \Sigma' \, e_{\varphi'} = \frac{s}{s'} \times \lambda_i.\] Thus, we therefore, for \(e_{\varphi} = e_{\varphi'} = e_i\), have that \[\begin{align} \begin{aligned} \tag{38} \partial_{\varphi,\scriptsizenorm}^m & = s^{m/2} \, \lambda_i^{m/2} \, \partial_{\varphi}^m, \end{aligned}\\ \begin{align} \tag{39} \partial_{\varphi',\scriptsizenorm}^m & = {s}^{m/2} \, \lambda_i^{m/2} \, \partial_{\varphi}^m, \end{align} \end{align}\] implying that when applied their respective original domains, these two underlying affine scale-normalized directional operators will, for \(e_{\varphi} = e_{\varphi'} = e_i\), lead to the same result \[\partial_{\varphi',\scriptsizenorm}^m L'(x';\; s', \Sigma') = \partial_{\varphi,\scriptsizenorm}^m L(x;\; s, \Sigma).\]
Summary of main result: To summarize, this result shows that provided that we assume that the eigenvalue decompositions of the \(2 \times 2\) spatial covariance matrix \(\Sigma\) and the \(2 \times 2\) affine transformation matrix \(A\) are coupled according to (36 ) and (37 ) \[\begin{align} \begin{aligned} \tag{40} \Sigma & = U \, {\operatorname{diag}}(\lambda_1, \lambda_2) \, U^T, \end{aligned}\\ \begin{align} \tag{41} A & = U \, {\operatorname{diag}}(S_1, S_2) \, U^T, \end{align} \end{align}\] where \(U\) is a some real \(2 \times 2\) unitary matrix, and provided that we then apply the affine scale-normalized directional derivative operator according to (25 ) \[\partial_{\varphi,\scriptsizenorm}^m = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m\] in a unit direction \(e_{\varphi} = e_i\), chosen as either of the two columns \(\{e_1, e_2 \}\) in the unitary matrix \(U\) in the above eigendecompositions, to the affine Gaussian scale-space representation \(L(x;\; s, \Sigma)\) in the original domain, then for arbitrary choices of the non-uniform scaling matrix \({\operatorname{diag}}(S_1, S_2)\) in spatial image transformations of the form \[f'(x') = f(x) \quad\quadfor\quad\quad x' = U \, {\operatorname{diag}}(S_1, S_2) \, U^T x,\] such that for the matched image orientation \(e_{\varphi}' = e_i\) corresponding to the same unit vector in the unitary matrix \(U\), with the transformed affine scale-normalized directional derivative operator in this direction defined according to \[\label{eq-def-aff-sc-norm-ders-cov-prop-coupl-eigendecomp} \partial_{\varphi,'norm}^m = {s'}^{m/2} \, (e_{\varphi'}^T \, \Sigma' \, e_{\varphi'})^{m/2} \, \partial_{\varphi'}^m,\tag{42}\] then the relationship \[\label{eq-cov-prop-aff-sc-norm-dir-ders-coupl-eigen-decomp} \partial_{\varphi',\scriptsizenorm}^m L'(x';\; s', \Sigma') = \partial_{\varphi,\scriptsizenorm}^m L(x;\; s, \Sigma)\tag{43}\] will hold over the affine Gaussian scale-space representations \(L(x;\; s, \Sigma)\) and \(L(x';\; s', \Sigma')\) of \(f(x)\) and \(f'(x')\), respectively, for all values of the spatial scale parameter \(s \in {\mathbb{R}}_+\) and the \(2 \times 2\) spatial covariance matrix \(\Sigma\) in the original domain, provided that the spatial scale parameter \(s' \in {\mathbb{R}}_+\) and the \(2 \times 2\) spatial covariance matrix \(\Sigma'\) in the transformed domain are related according to \[s' \, \Sigma' = s \, U \, {\operatorname{diag}}(S_1^2 \, \lambda_1, S_2^2 \, \lambda_2) \, U^T.\]
Interpreted geometrically, the above result from the analysis for the special case with coupled eigenvalue decompositions has a special meaning, when the spatial affine spatial transformations constitute locally linearized projections from the tangent plane of a local surface patch to the image domain, for different viewing directions in relation to the local surface normal. Then, the special form \(x' = U \, {\operatorname{diag}}(S_1, S_2) \, U^T x\) of the image transformation corresponds to the unitary matrix \(U^T\) transforming the original coordinate frame to a new coordinate frame, where the spatial affine image transformation \(A\) reduces to a pure non-uniform scaling transformation \({\operatorname{diag}}(S_1, S_2)\). The preferred choice of image orientation \(e_{\varphi} = e_{\varphi'}\) will then specifically correspond to either the tilt direction6 or its perpendicular direction, and with the variability spanned by varying the spatial scaling factor \(S_i\) specifically corresponding to the tilt direction, then corresponding to varying the slant angle between the direction of the local surface normal at the observation point and the viewing direction.
This derived covariance property does, thus, mean that the affine scale-normalized directional derivative responses computed in the tilt direction can, to first order of approximation, be perfectly matched, when varying the slant angle of the surface, as well as when varying the distance between the object and the observer along the viewing direction; see Figure 8 for an illustration.

Figure 8: In terms of geometric interpretation, the covariance property (43 ) of affine scale-normalized derivatives according to (42 ), in the case of coupled eigendecompositions of the spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\) of the forms (40 ) and (41 ), means that if we consider a camera that views a smooth local surface patch, and then move the camera in the plane spanned by the backprojected tilt direction and the optical axis, then the affine scale-normalized directional derivative responses can, to first order of approximation, be perfectly matched between the resulting different views of the same local surface patch. In this way, we can thus, to first order of approximation, perfectly match the receptive field responses obtained for different slant angles relative to the viewing direction, provided that that one of the eigendirections of the spatial covariance matrix \(\Sigma\), used for defining the spatial receptive fields, is parallel to the tilt direction, and that the relative motion of the observer between the different views also coincides with the backprojected tilt direction..
Concerning the dimensionality of the manifold spanned by this covariance property, with the coupled eigenvalue decompositions of the spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\) according to (40 ) and (41 ), the diagonal matrix \({\operatorname{diag}}(\lambda_1, \lambda_2)\) and the diagonal matrix \({\operatorname{diag}}(S_1, S_2)\) do both each span variabilities of dimensionality 2, to which the unitary matrix \(U\) adds another variability of dimensionality 1. The spatial scale parameter \(s\) does not add any effective dimensionality to the parameter space, because of its coupled occurrence in the product \(s \, \Sigma\). Similarly, the direction \(\varphi\) in the image plane does not add any dimensionality to the variability of the resulting manifold either, since the unit vector \(e_{\varphi}\) is determined from the unitary matrix \(U\). Thus, we have that this covariance result, under coupled eigenvalue decompositions of the spatial covariance matrix \(\Sigma\) and the affine transformation matrix \(A\), spans 5 out of the totally 8 dimensions in the variability of computing affine scale-normalized derivatives \(\partial_{\varphi,\scriptsizenorm}\) from an affine scale-space representation \(L(x;\; s, \Sigma)\) in different directions \(\varphi\) in image space, that are possible under the different types of image transformations that are spanned by the full affine group, as well as the different parameter settings of the composed spatial covariance matrix \(s \, \Sigma\) that can be performed for the affine Gaussian spatial smoothing component in the spatial receptive fields.
Out of these 5 dimensions, one of these dimensions, corresponding to the ratio between the singular values of the affine transformation matrix \(A\), here manifested in terms of the ratio \(S_1/S_2\) between the two spatial scaling factors, is different, compared to the previously treated covariance result under spatial similarity transformations. In that respect, this covariance result provides important added value in relation to the previously formulated covariance result under spatial similarity transformations in Section 3.4.1.
In this section, we will analyze the abilities of the affine scale-normalized derivative operators according to (19 ) \[\label{eq-aff-sc-norm-dir-der-again} \partial_{\varphi,\scriptsizenorm}^m = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m\tag{44}\] and (25 ) \[\label{eq-aff-sc-norm-dir-der-prim-again} \partial_{\varphi',\scriptsizenorm}^m = {s'}^{m/2} \, (e_{\varphi'}^T \, \Sigma' \, e_{\varphi'})^{m/2} \, \partial_{\varphi'}^m\tag{45}\] in the matching image directions \(\varphi\) and \(\varphi'\), respectively, according to the following relationship between their corresponding unit vectors (26 ) \[\label{eq-match-unit-vectors-spat-aff-again} e_{\varphi'} = \frac{A \, e_{\varphi}}{\| A \, e_{\varphi} \| },\tag{46}\] to allow for covariance under general affine transformations, where \(s \in {\mathbb{R}}_+\) and \(s' \in {\mathbb{R}}_+\) as well as \(\Sigma \in {\mathbb{S}}_+^2\) and \(\Sigma' \in {\mathbb{S}}_+^2\) are matching scale parameters and spatial covariance matrices, respectively, according to the relationship (23 ) \[\label{eq-transf-prop-sc-par-spat-cov-mat-pure-aff-scsp-again} s' \Sigma' = s \, A \, \Sigma \, A^T.\tag{47}\] To perform this analysis, we will start from the intermediate results in Equations (29 ) and (30 ), where we reduced the expressions for the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}^m\) along the direction \(\varphi\) in the original domain according to (19 ) as well as the directional derivative operator \(\partial_{\varphi',\scriptsizenorm}^m\) along the direction \(\varphi'\) in the transformed domain according to (25 ) to the following forms: \[\begin{align} \begin{aligned} \tag{48} \partial_{\varphi,\scriptsizenorm}^m & = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m, \end{aligned}\\ \begin{align} \tag{49} \partial_{\varphi',\scriptsizenorm}^m & = {s}^{m/2} \, \left( \frac{e_{\varphi}^T \, A^T A \, \Sigma \, A^T A \, e_{\varphi}}{(e_{\varphi}^T \, A^T A \, e_{\varphi})^2} \right)^{m/2} \partial_{\varphi}^m, \end{align} \end{align}\] as defined over the affine Gaussian scale-space representations \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\), respectively, obtained by convolution with affine Gaussian kernels according to (18 ) of two 2-D images \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) and \(f' \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\), respectively, that are related according to a spatial affine transformation of the form (20 ) \[\label{eq-spat-aff-transf-def-aff-sc-norm-ders-again} f'(x') = f(x) \quad\quadfor\quad\quad x' = A \, x,\tag{50}\] where \(\Sigma\) denotes any \(2 \times 2\) symmetric and positive definite covariance matrix and \(A\) denotes any non-singular \(2 \times 2\) affine transformation matrix.
From the expressions in Equations (48 ) and (49 ), we can therefore see that a both necessary and sufficient condition, for these affine scale-normalized derivatives to be fully affine covariant, such that \[\partial_{\varphi',\scriptsizenorm}^m L'(x';\; s', \Sigma') = \partial_{\varphi,\scriptsizenorm}^m L(x;\; s, \Sigma)\] would hold for all matching image points \(x' = A \, x\) according to (20 ), for all matching parameters of the receptive fields \(s' \Sigma' = s \, A \, \Sigma \, A^T\) according to (23 ), for all matching unit vectors \(e_{\varphi'} = e_{\varphi}/\| A e_{\varphi} \|\) as well as for all orders \(m\) of spatial differentiation, would be that the relationship \[\label{eq-cond-aff-cov-general-in-matrices} \frac{e_{\varphi}^T \, A^T A \, \Sigma \, A^T A \, e_{\varphi}}{(e_{\varphi}^T \, A^T A \, e_{\varphi})^2} = e_{\varphi}^T \, \Sigma \, e_{\varphi}\tag{51}\] would hold for all \(2 \times 2\) affine transformation matrices \(A\), for all \(2 \times 2\) symmetric and positive definite matrices \(\Sigma\), as well as for all 2-D unit vectors \(e_{\varphi}\).
In the following, we will explicitly show that such a relationship does not, however, hold generally, although that we have in Sections 3.4.1 and 3.4.3 previously shown that such a relationship holds for two, for our purposes very important, subgroups of the full affine group.
Since both the matrices \(\Sigma\) and \(A^T A\) are symmetric and positive definite, let us start by replacing these matrices with their eigenvalue decompositions \[\begin{align} \begin{aligned} \tag{52} \Sigma = U \Lambda \, U^T, \end{aligned}\\ \begin{align} \tag{53} A^T A = V D \, V^T, \end{align} \end{align}\] where \(U\) and \(V\) are real unitary \(2 \times 2\) matrices, and \(\Lambda\) and \(D\) are \(2 \times 2\) diagonal matrices with strictly positive elements. Then, the question of whether the relation (51 ) holds for all non-singular \(2 \times 2\) affine transformation matrices \(A\), all \(2 \times 2\) affine covariance matrices \(\Sigma\) and all 2-D unit vectors \(e_{\varphi}\), or not, can reformulated into investigating whether \[\frac{e_{\varphi}^T \, V \, D \, V^T U \Lambda \, U^T V \, D \, V^T e_{\varphi}}{(e_{\varphi}^T \, V D \, V^T e_{\varphi})^2} = e_{\varphi}^T \, U \Lambda \, U^T e_{\varphi}\] would hold for all \(2 \times 2\) unitary matrices \(U\) and \(V\), all \(2 \times 2\) diagonal matrices \(\Lambda\) and \(D\) and all 2-D unit vectors \(e_{\varphi}\), or not. By, in turn, setting \[\begin{align} \begin{aligned} V^T e_{\varphi} = e_{\psi}, \end{aligned}\\ \begin{align} U^T V = W, \end{align} \end{align}\] which gives \[\begin{align} \begin{aligned} e_{\varphi} = V^{-T} e_{\psi}, \end{aligned}\\ \begin{align} U= V^{-T} W^T, \end{align} \end{align}\] this expression can after a few simplifications be reformulated into investigating whether the expression \[\frac{e_{\psi}^T \, D \, W^T \Lambda \, W \, D \, e_{\psi}}{(e_{\psi}^T \, D \, e_{\psi})^2} = e_{\psi}^T \, W^T \Lambda \, W \, e_{\psi}\] would hold for all \(2 \times 2\) unitary vectors \(W\), all \(2 \times 2\) diagonal matrices \(\Lambda\) and \(D\) and all 2-D unit vectors \(e_{\psi}\), or not. By further setting \[C = W^T \Lambda \, W,\] where \(C\) then will become an arbitrary \(2 \times 2\) symmetric and positive definite matrix, we thus have that the necessary and sufficient relationship for affine covariance (51 ) can then be reformulated as whether the relationship \[\label{eq-cond-aff-cov-general-in-matrices-transformed} e_{\psi}^T \, D \, C \, D \, e_{\psi} = (e_{\psi}^T \, C \, e_{\psi}) \, (e_{\psi}^T \, D \, e_{\psi})^2\tag{54}\] would hold for all symmetric and positive semi-definite \(2 \times 2\) matrices \(C\), all \(2 \times 2\) diagonal matrices \(D\) and all 2-D unit vectors \(e_{\psi}\).
By further parameterizing these entities as \[\begin{align} \begin{aligned} C = \left( \begin{array}{cc} c_{11} & c_{12} \\ c_{12} & c_{22} \end{array} \right), \end{aligned}\\ \begin{align} D = \left( \begin{array}{cc} d_1 & 0 \\ 0 & d_2 \end{array} \right), \end{align}\\ \begin{align} e_{\psi} = \left( \begin{array}{c} \cos \psi \\ \sin \psi \end{array} \right), \end{align} \end{align}\] and then expanding (54 ) with respect to this parameterization, we can thus reduce the problem of investigating whether the necessary and sufficient condition for affine covariance (51 ) would hold, to investigating whether the following equation would hold for all combinations of \(c_{ij} \in {\mathbb{R}}\), \(d_k \in {\mathbb{R}}_+\) and \(\psi \in {\mathbb{R}}\): \[\begin{align} \begin{aligned} c_{11} \, \cos^2 \psi & \left(d_1^2 \left(1 -\cos ^4\psi\right) \right. \end{aligned}\nonumber\\ \begin{align} & \left. \quad -2 \, d_1 \, d_2 \, \sin ^2\psi \, \cos ^2\psi \right. \end{align}\nonumber\\ \begin{align} & \left. \quad -d_2^2 \, \sin ^4\psi \right) \end{align}\nonumber\\ \begin{aligned} +c_{12} \, \cos \psi \, \sin \psi & \left(-2 \, d_1^2 \, \cos ^4\psi \right. \end{aligned}\nonumber\\ \begin{align} & \left. \quad +d_1 d_2 \left(2 -4 \, \sin ^2\psi \, \cos ^2\psi\right) \right. \end{align}\nonumber\\ \begin{align} & \left. \quad -2 \, d_2^2 \, \sin ^4\psi \right) \end{align}\nonumber\\ \begin{aligned} +c_{22} \, \sin^2 \psi \, & \left(-d_1^2 \, \cos ^4\psi \right. \end{aligned}\nonumber\\ \begin{align} & \left. \quad -2 \, d_1 \, d_2 \, \sin ^2 \psi \, \cos ^2 \psi \right. \end{align}\nonumber\\ \begin{align} & \left. \quad + d_2^2 \left(1 -\sin ^4\psi\right)\right) = 0. \end{align} \end{align}\] Disregarding the singular cases when either \(\cos \psi = 0\) or \(\sin \psi = 0\), if this expression is to hold for all combinations of the parameters \(c_{ij}\), \(d_k\) and \(\psi\), then this specifically implies that the coefficients for each one of the matrix elements \(c_{ij}\) must be zero, implying that the following relations must hold for all combinations of \(d_k \in {\mathbb{R}}_+\) and \(\psi \in {\mathbb{R}}\): \[\begin{align} \begin{aligned} \tag{55} d_1^2 \left(1 -\cos ^4\psi\right) -2 \, d_1 \, d_2 \, \sin ^2\psi \, \cos ^2\psi - d_2^2 \, \sin ^4\psi = 0, \end{aligned}\\ \begin{align} -2 \, d_1^2 \, \cos ^4\psi +d_1 d_2 \left(2 -4 \sin ^2\psi \, \cos ^2\psi\right) -2 \, d_2^2 \, \sin ^4\psi = 0, \end{align}\\ \begin{align} \tag{56} -d_1^2 \, \cos ^4\psi -2 \, d_1 \, d_2 \, \sin ^2 \psi \, \cos ^2 \psi +d_2^2 \left(1 -\sin ^4\psi\right) = 0. \end{align} \end{align}\] Subtracting Equation (56 ) from Equation (55 ), then leads to the following necessary condition for the affine scale-normalized directional derivatives (48 ) and (49 ) to be equal: \[d_1^2 - d_2^2 = 0.\] Given the restriction that the elements \(d_k\) of the diagonal matrix \(D = {\operatorname{diag}}(d_1, d_2)\) are to be non-negative, we thus have that the requirement, for the affine scale-normalized directional derivative operators according to (48 ) and (49 ) to be equal, implies that the affine covariance matrix \(\Sigma = U \Lambda \, U^T\) according to (52 ) must be an isotropic matrix, and that the matrix \(A^T A = V D \, V^T\) formed from the affine transformation matrix \(A\) according to (53 ) must also be an isotropic matrix, thus, in turn, implying that the affine transformation matrix \(A\) must be in the similarity group, corresponding to \(A = S_x \, R\), where \(S_x\) is a uniform spatial scaling factor and \(R\) is a rotation matrix.
In this way, we have formally shown that the affine scale-normalized directional derivative operators according to (44 ) and (45 ) do not allow for covariance properties under general affine transformations.
Summary of main result: The affine scale-normalized derivative operators according to (44 ) and (45 ) do not allow for covariance under general affine transformations.
Thus, if we want to aim at full affine covariance, we have to consider some other definition of a scale-normalized derivative concept based on the affine scale-space representation, which we will now develop in the next section.
In this section, we will define another new type of scale-normalized spatial derivatives for an affine Gaussian scale-space representation, which, in contrast to the previous definition of affine scale-normalized derivatives, will, however, lead to true covariance covariance properties over the full group of spatial affine transformations.
Given any 2-D image \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\), let us again consider its affine Gaussian scale-space representation \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\), of the form \[\label{yzrdoeib} L(\cdot;\; s, \Sigma) = g(\cdot;\; s, \Sigma) * f(\cdot),\tag{57}\] generated by convolutions with anisotropic affine Gaussian kernels (2 ), based on \(2 \times 2\) spatial covariance matrices \(\Sigma \in {\mathbb{S}}_+^2\) that are not generally equal to a unit matrix.
Given an eigenvalue decomposition of the \(2 \times 2\) symmetric and positive definite spatial covariance matrix \(\Sigma\) of the form \[\label{eq-eigen-decomp-Sigma-aff-grad} \Sigma = U \Lambda \, U^T,\tag{58}\] where \(\Lambda = {\operatorname{diag}}(\lambda_1, \lambda_2)\) is a \(2 \times 2\) diagonal matrix with positive elements, and \(U\) is a \(2 \times 2\) real unitary matrix, let us first define the square root of the diagonal matrix \(\Lambda\) as \(\Lambda^{1/2} = {\operatorname{diag}}(\lambda_1^{1/2}, \lambda_2^{1/2})\), to rewrite (58 ) as \[\Sigma = U \, \Lambda^{1/2} \, \Lambda^{1/2}\, U^T = (U \, \Lambda^{1/2}) \, (U \, \Lambda^{1/2})^T.\] From this expression, let us then define the square root \(\Sigma^{1/2}\) of \(\Sigma\) as \[\label{eq-def-sqrt-of-Sigma} \Sigma^{1/2} = \Lambda^{1/2}\, U^T,\tag{59}\] such that \[\label{eq-Sigma-from-sqrt} \Sigma = (\Sigma^{1/2})^T (\Sigma^{1/2}).\tag{60}\] Note, however, that this definition is not unique. For a general square root of a matrix, also the matrix \[\label{eq-general-sq-root-mat} \Sigma^{1/2} = \rho \, \Lambda^{1/2}\, U^T,\tag{61}\] where \(\rho\) is an arbitrary \(2 \times 2\) unitary matrix, would also7 satisfy (60 ), since then \[(\Sigma^{1/2})^T (\Sigma^{1/2}) = U \, \Lambda^{1/2} \, \rho^T \, \rho \, \Lambda^{1/2} \, U^T = U \Lambda \, U^T.\]
Given the above specially chosen definition of the square root \(\Sigma^{1/2}\) of the spatial covariance matrix \(\Sigma\), and given that we have applied the regular spatial gradient operator \(\nabla_x\) to the affine Gaussian scale-space representation \(L(x;\; s, \Sigma)\) for the spatial scale parameter \(s \in {\mathbb{R}}_+\) and the \(2 \times 2\) spatial covariance matrix \(\Sigma\), let us define the scale-normalized affine gradient operator as \[\label{eq-def-sc-norm-aff-grad-op} \nabla_{x,\scriptsizeaffnorm} = s^{1/2} \, \Sigma^{1/2} \, \nabla_x.\tag{62}\] The motivation for this definition of the scale-normalized directional derivative operator is to, as for the previous definition of the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}\) according to (19 ), first of all compensate for the general decrease in the magnitude of spatial derivatives with increasing amount of spatial smoothing, and specifically also take into account the different amounts of spatial smoothing in the different spatial directions in the image plane, as resulting from using anisotropic affine Gaussian kernels \(g(x;\; s, \Sigma)\), as opposed to rotationally symmetric Gaussian kernels \(g(x;\; s, I)\) for the spatial smoothing operation in the spatial receptive fields.
In contrast to the previous definition of the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}\) according to (19 ), which only considers a single orientation in the image plane, and then takes into account the amount of spatial smoothing in that direction, for which the directional derivative is computed, the definition of the affine scale-normalized gradient operator \(\nabla_{x,\scriptsizeaffnorm}\) according to (62 ) does instead consider the genuine 2-D image gradient as the conceptual object, and does then also take into account all the information about the spatial covariance matrix \(\Sigma\), when defining the corresponding scale-normalized object. As we will see in the next section, this new definition will therefore allow for full affine covariance, as opposed to covariance properties over smaller specific subgroups of the affine group, as obtained for the previously defined affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}\) according to (19 ).
Before turning into the details of the derivation of the general full affine covariance property, let us, however, note that in the special case when the spatial covariance matrix \(\Sigma\) is a diagonal matrix \(\Sigma = {\operatorname{diag}}(\lambda_1, \lambda_2)\), we obtain \(\Sigma^{1/2} = {\operatorname{diag}}(\lambda_1^{1/2}, \lambda_2^{1/2})\), which implies that the definition of the scale-normalized affine gradient operator (62 ) then reduces to the relation \[\left( \begin{array}{c} \partial_{x_1,\scriptsizeaffnorm} \\ \partial_{x_2,\scriptsizeaffnorm} \end{array} \right) = \left( \begin{array}{c} s^{1/2} \, \lambda_1^{1/2} \, \partial_{x_1} \\ s^{1/2} \, \lambda_2^{1/2} \, \partial_{x_2} \end{array} \right),\] which is the same result as we then obtain for the previously defined affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}\) according to (19 ), if we choose the directions \(\varphi\) for computing the affine scale-normalized directional derivatives as the two coordinate directions \(e_1\) for \(\varphi_1 = 0\) and \(e_2\) for \(\varphi_2 = \pi/2\), for the spatial differentiation order \(m = 1\), such that \[\begin{align} \begin{aligned} \partial_{\varphi_1,\scriptsizenorm} = s^{1/2} \, \lambda_1^{1/2} \, \partial_{\varphi_1} = s^{1/2} \, \lambda_1^{1/2} \, \partial_{x_1}, \end{aligned}\\ \begin{align} \partial_{\varphi_2,\scriptsizenorm} = s^{1/2} \, \lambda_2^{1/2} \, \partial_{\varphi_2} = s^{1/2} \, \lambda_2^{1/2} \, \partial_{x_2}. \end{align} \end{align}\] Thus, in this very special case of the spatial covariance matrix \(\Sigma\) being a diagonal matrix \({\operatorname{diag}}(\lambda_1, \lambda_2)\), and then choosing the directions for the directional derivative operators along the coordinate directions, the definition of the scale-normalized affine gradient operator \(\nabla_{x,\scriptsizeaffnorm}\) according to (62 ) is consistent with the previous definition of the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}\) according to (19 ).
Furthermore, in the case when the spatial covariance matrix \(\Sigma\) is a unit matrix \(I\), the scale-normalized affine gradient operator \(\nabla_{x,\scriptsizeaffnorm}\) according to (62 ) then reduces to the isotropic scale-normalized gradient operator \(\nabla_{x,\scriptsizenorm}\) according to (15 ).
Consider a spatial affine transformation of the form \[\label{eq-spat-aff-transf-def-aff-sc-norm-grad-op} f'(x') = f(x) \quad\quadfor\quad\quad x' = S_x \, A \, x,\tag{63}\] where \(S_x \in {\mathbb{R}}_+\) is a spatial scaling factor, \(A\) is a non-singular \(2 \times 2\) affine transformation matrix, from which we define the respective affine Gaussian scale-space representations \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) according to \[\begin{align} \begin{aligned} \tag{64} L(\cdot;\; s, \Sigma) = g(\cdot;\; s, \Sigma)* f(\cdot), \end{aligned}\\ \begin{align} \tag{65} L'(\cdot;\; s', \Sigma') = g(\cdot;\; s', \Sigma')* f'(\cdot), \end{align} \end{align}\] for matching values of the spatial scale parameters \(s \in {\mathbb{R}}_+\) and \(s' \in {\mathbb{R}}_+\) as well as the \(2 \times 2\) spatial covariance matrices \(\Sigma \in {\mathbb{S}}_+^2\) and \(\Sigma' \in {\mathbb{S}}_+^2\) over the two domains, such that (23 ) \[\label{eq-transf-prop-sc-par-spat-cov-mat-pure-aff-scsp-full} s' \, \Sigma' = s \, (S_x \, A) \, \Sigma \, (S_x A)^T = s \, S_x^2 \, A \, \Sigma \, A^T,\tag{66}\] which then implies that the affine scale-space representations for these parameter values of the affine Gaussian smoothing kernels are equal (24 ) \[\label{eq-equal-aff-scsp-repr-aff-cov-proof} L'(x';\; s', \Sigma') = L(x;\; s, \Sigma).\tag{67}\] Given an eigenvalue decomposition of the spatial covariance matrix \(\Sigma'\) in the transformed domain \[\Sigma' = U' \Lambda' \, {U'}^T,\] where \(U'\) is a \(2 \times 2\) unitary matrix and \(\Lambda'\) is a \(2 \times 2\) diagonal matrix with positive elements, let us in a similar way as in (61 ) define the square root of \(\Sigma'\) as \[\label{eq-def-sqrt-of-Sigma-prim} {\Sigma'}^{1/2} = {\Lambda'}^{1/2}\, {U'}^T,\tag{68}\] while noting that any other definition of the square root of \(\Sigma'\) according to \[\label{eq-def-sqrt-of-Sigma-prim-perm} {\Sigma'}^{1/2} = \rho' \, {\Lambda'}^{1/2}\, {U'}^T,\tag{69}\] where \(\rho'\) is an arbitrary \(2 \times 2\) rotation matrix, would also satisfy \[\Sigma' = ( {\Sigma'}^{1/2})^T ({\Sigma'}^{1/2}).\] Inserting this expression, as well as \[\Sigma = ( \Sigma^{1/2})^T (\Sigma^{1/2})\] according to (60 ), into the coupled relationship (66 ) between the spatial scale parameters \(s\) and \(s'\) as well as the spatial covariance matrices \(\Sigma\) and \(\Sigma'\), then, with the added degree of freedom corresponding to different possible rotation matrices in (61 ) and (69 ), gives \[\begin{gather} s' \, ({\Sigma'}^{1/2})^T \, {\rho'}^T \, {\rho'} \, ({\Sigma'}^{1/2}) = \\ = s \, S_x^2 \, A \, (\Sigma^{1/2})^T \, \rho^T \, \rho \, (\Sigma^{1/2}) \, A^T. \end{gather}\] This relationship does then imply that square roots \(\Sigma^{1/2}\) and \({\Sigma'}^{1/2}\) of the spatial covariance matrices \(\Sigma\) and \(\Sigma'\) must be related according to \[{s'}^{1/2} \, \rho' \, {\Sigma'}^{1/2} = s^{1/2} \, S_x \, \rho \, \Sigma^{1/2} \, A^T,\] which in turn implies that the following relationship must hold for some, possibly other, \(2 \times 2\) rotation matrix \(\tilde{\rho} = {\rho'}^T \rho\): \[\label{eq-rel-sqrt-cov-mat-aff-cov-proof} {s'}^{1/2} \, {\Sigma'}^{1/2} = \tilde{\rho} \, s^{1/2} \, S_x \, \Sigma^{1/2} \, A^T.\tag{70}\] Under the image transformation (63 ), the spatial gradient operators are related according to \[\label{eq-nabla-x-transf-aff-cov-proof} \nabla_x = (S_x \, A)^T \, \nabla_{x'},\tag{71}\] implying that \[\label{eq-nabla-x-transf-aff-cov-inv-proof} \nabla_{x'} = \frac{1}{S_x} \times A^{-T} \nabla_x.\tag{72}\] Inserting this expression (72 ), as well as the relationship (70 ) between the square roots of the spatial covariance matrices between the two domains, into the corresponding definition of the scale-normalized affine gradient operator over the transformed doman \[\nabla_{x',\scriptsizeaffnorm} = \tilde{\rho'} \, {s'}^{1/2} \, {\Sigma'}^{1/2} \, \nabla_{x'},\] then implies that the expression for the scale-normalized affine gradient operator over the transformed domain reduces to \[\nabla_{x',\scriptsizeaffnorm} = \tilde{\rho} \, s^{1/2} \, \Sigma^{1/2} \, \nabla_x.\] After comparison with the scale-normalized affine gradient operator over the original domain (62 ), it therefore holds that the scale-normalized affine gradient operators over the two different domains must be related according to \[\nabla_{x',\scriptsizeaffnorm} = \tilde{\rho} \, \nabla_{x,\scriptsizeaffnorm},\] for some rotation matrix \(\tilde{\rho}\). Thus, by, in turn, applying these operators to the affine scale-space representations \(L(x;\; s, \Sigma)\) and \(L'(x';\; s', \Sigma')\) over their respective domains, for matching values of the parameter values of the affine Gaussian smoothing kernel according to (66 ), this implies that the scale-normalized affine gradient vectors \(\nabla_{x,\scriptsizeaffnorm} L\) and \(\nabla_{x',\scriptsizeaffnorm} L'\) over the two domains must be related according to \[\label{eq-equal-aff-scsp-repr-aff-cov-proof-again} (\nabla_{x',\scriptsizeaffnorm} L')(x';\; s', \Sigma') = \tilde{\rho} \, (\nabla_{x,\scriptsizeaffnorm} L)(x;\; s, \Sigma)\tag{73}\] for some rotation matrix \(\tilde{\rho}\). Specifically, if the affine transformation \(A\) is in the similarity group, i.e. if \(A = S_x \, R\) for some positive scaling factor \(S_x \in {\mathbb{R}}\) and some rotation matrix \(R\), then the rotation matrix \(\tilde{\rho}\) in (73 ) can be shown to be restricted to a unit matrix \(\tilde{\rho} = I\).
To realize why the rotation matrix \(\tilde{\rho}\) reduces to a unit matrix for the case of similarity transformations, let us insert \(A = S_x \, R\) into the relationship (66 ), which then gives \[\label{eq-transf-prop-sc-par-spat-cov-mat-pure-aff-scsp-full-sim} s' \, \Sigma' = s \, (S_x \, R) \, \Sigma \, (S_x R)^T = s \, S_x^2 \, R \, \Sigma \, R^T\tag{74}\] and into (72 ), which gives \[\label{loczrenf} \nabla_{x'} L' = \frac{1}{S_x} \times R^{-T} \nabla_x L.\tag{75}\] From the definition of the scale-normalized affine gradient vector of the transformed domain \[\nabla_{x',\scriptsizeaffnorm} L' = {s'}^{1/2} \, {\Sigma'}^{1/2} \, \nabla_{x'} L',\] and inserting \(\Sigma = U \Lambda \, U^T\) into (74 ) \[\Sigma' = \frac{s}{s'} \times S_x^2 \, R \, U \Lambda \, U^T R^T,\] which gives \[{\Sigma'}^{1/2} = \left( \frac{s}{s'} \right)^{1/2} S_x \, \Lambda^{1/2} \, U^T R^T,\] we then obtain \[\begin{gather} \nabla_{x',\scriptsizeaffnorm} L' = {s'}^{1/2} \, \left( \frac{s}{s'} \right)^{1/2} \, \Lambda^{1/2} \, U^T R^T R^{-T} \nabla_x L = \\ = s^{1/2} \, \Sigma^{1/2} \, \nabla_x L = \nabla_{x,\scriptsizeaffnorm} L, \end{gather}\] thus showing that \(\tilde{\rho}\) in this case reduces to a unit matrix \(I\).
Summary of main result. To summarize, this result shows that, if we under an arbitrary non-singular spatial affine transformation of the form \[\label{eq-spat-aff-transf-def-aff-sc-norm-ders-main-result} f'(x') = f(x) \quad\quadfor\quad\quad x' = S_x A \, x,\tag{76}\] define the affine Gaussian scale-space representations \(L(x;\; s, \Sigma)\) and \(L'(x';\; s', \Sigma')\) of the images \(f\) and \(f'\), respectively, according to (64 ) and (65 ), with the spatial scale parameters \(s \in {\mathbb{R}}_+\) and \(s' \in {\mathbb{R}}_+\) as well as the \(2 \times 2\) spatial covariance matrices \(\Sigma\) and \(\Sigma'\), respectively, and then define the scale-normalized affine gradient operator over the original domain as \[\label{eq-def-sc-norm-aff-grad-op-aff-cov-main-result} \nabla_{x,\scriptsizeaffnorm} = s^{1/2} \, \Sigma^{1/2} \, \nabla_x,\tag{77}\] as well as define the corresponding scale-normalized affine gradient operator over the transformed domain as \[\label{eq-def-sc-norm-aff-grad-op-aff-cov-prim-main-result} \nabla_{x',\scriptsizeaffnorm} = {s'}^{1/2} \, {\Sigma'}^{1/2} \, \nabla_{x'},\tag{78}\] then the corresponding scale-normalized affine gradient vectors over the two domains will, up to some rotation matrix \(\rho\), be equal, such that \[\label{eq-equal-aff-scsp-repr-aff-cov-proof-again-again} (\nabla_{x',\scriptsizeaffnorm} L')(x';\; s', \Sigma') = \tilde{\rho} \, (\nabla_{x,\scriptsizeaffnorm} L)(x;\; s, \Sigma)\tag{79}\] holds for some rotation matrix \(\tilde{\rho}\), provided that the parameters of the underlying affine Gaussian smoothing kernels are related according to \[\label{eq-transf-prop-sc-par-spat-cov-mat-pure-aff-scsp-main-result} s' \, \Sigma' = s \, (S_x \, A) \, \Sigma \, (S_x \, A)^T = s \, S_x^2 \, A \, \Sigma \, A^T.\tag{80}\] In the case of similarity transformations \(A = S_x \, R\), the rotation matrix in (79 ) does specifically degenerate to an identity matrix \(\tilde{\rho} = I\).
Interpreted geometrically, this result means that if we interpret the composed affine transformation \(S_x \, A\) as constituting a local linearization of the perspective mapping from the tangent plane of a local surface patch, or as a local linearization of the projective mapping between different views of the same local surface patch, then this result means that the scale-normalized affine gradient vectors \(\nabla_{x,\scriptsizeaffnorm} L\) and \(\nabla_{x',\scriptsizeaffnorm} L'\) can, up to an arbitrary rotation of the elements, and to first order of approximation, be perfectly matched between different views of the same local surface patch; see Figure 9 for a geometric illustration and Figure 10 for a commutative diagram.

Figure 9: The covariance property (79 ) of the scale-normalized affine gradient operator (77 ) under general (non-singular) affine transformations means that, if we consider two cameras, that view the same local surface patch from general (non-degenerate) viewing conditions, then, to first order of approximation, the resulting affine gradient responses for the different views, here illustrated as arrows before the affine scale normalization, can, up to a rotation transformation \(\tilde{\rho}\), be perfectly matched, provided that the scale parameters and the covariance matrices of the receptive fields are properly matched according to (80 )..
None
Figure 10: Commutative diagram for scale-normalized affine gradient operators under spatial affine transformations of the form \(x' = S_x \, A\). This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input image \(f(x)\) is first subject to the affine transformation \(x' = S_x \, A\) and then filtered with a scale-normalized affine gradient kernel \((\nabla_{x',\scriptsizeaffnorm} \, g)(x';\; s', \Sigma')\), or instead directly convolved with the scale-normalized affine gradient kernel \((\nabla_{x,\scriptsizeaffnorm} \, g)(x;\; s, \Sigma)\) and then subject to the same affine transformation, we do then, up to a possibly unknown rotation transformation \(\tilde{\rho}\), get the same result, provided that the parameters of the spatial smoothing kernels \(g(x;\; s, \Sigma)\) and \(g(x';\; s', \Sigma')\) are related according to \(s' \, \Sigma' = s \, S_x^2 \, A \, \Sigma \, A^{T}\)..
By counting the number of dimensions involved in the the variabilities spanned by this covariance result, we have that the spatial scale parameter \(s\) and the spatial covariance matrix \(\Sigma\) together span an effective dimensionality of 3, because their joint occurrence in the product \(s \, \Sigma\). The space of the full8 2-D affine transformations (with the spatial translation offset not considered here) spans a variability over 4 dimensions. Thus, the stated covariance results holds over independent variabilities over all the totally 7 dimensions of the resulting manifold.
Given the affine Gaussian scale-space representation \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \times \, {\mathbb{S}}_+^2 \rightarrow {\mathbb{R}}\) according to (64 ) of any 2-D image \(f \colon {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\), we can define the regular Hessian matrix as \[{\cal H}_x L = \nabla_x \, \nabla_x^T L = \left( \begin{array}{cc} L_{x_1x_1} & L_{x_1x_2} \\ L_{x_1x_2} & L_{x_2x_2} \end{array} \right).\] For an affine Gaussian scale-space representation \(L(x;\, s, \Sigma)\) computed for spatial scale parameter \(s \in {\mathbb{R}}_+\) and spatial covariance matrix \(\Sigma\), with the previously defined affine gradient operator \(\nabla_{x,\scriptsizeaffnorm}\) according to (62 ), it is therefore natural to define a corresponding scale-normalized affine Hessian operator \({\cal H}_{x,\scriptsizeaffnorm}\) according to \[{\cal H}_{x,\scriptsizeaffnorm} = \nabla_{x,\scriptsizeaffnorm} \, \nabla^T_{x,\scriptsizeaffnorm},\] which, when expanded from the definition, then assumes the form \[\begin{align} \begin{aligned} {\cal H}_{x,\scriptsizeaffnorm} & = s \, (\Sigma^{1/2}) \, \nabla_x \, \nabla_x^T (\Sigma^{1/2})^T \end{aligned}\\ \begin{align} \label{eq-def-sc-norm-aff-hess-mat} & = s \, (\Sigma^{1/2}) \, {\cal H}_x \, (\Sigma^{1/2})^T, \end{align} \end{align}\tag{81}\] with the interpretation that, since the matrix \((\Sigma^{1/2})^T\) is to be regarded as a constant with regard to the spatial differentiation operators \(\nabla_x\) and \({\cal H}_x\), these operators can be applied through this matrix.
Consider again a spatial affine transformation of the form \[\label{eq-spat-aff-transf-def-aff-sc-norm-ders-hess} f'(x') = f(x) \quad\quadfor\quad\quad x' = S_x \, A \, x,\tag{82}\] where \(S_x \in {\mathbb{R}}_+\) represents an overall spatial scaling factor, \(A\) is a \(2 \times 2\) affine transformation matrix, with the affine Gaussian scale-space representations \(L(x;\; s, \Sigma)\) and \(L'(x';\; s', \Sigma')\) according to (64 ) and (65 ), for matching values of the spatial scale parameters \(s \in {\mathbb{R}}_+\) and \(s' \in {\mathbb{R}}_+\) as well as the \(2 \times 2\) spatial covariance matrices \(\Sigma\) and \(\Sigma'\) over the two domains according to (80 ) \[\label{eq-transf-prop-sc-par-spat-cov-mat-pure-aff-scsp-hess} s' \Sigma' = s \, S_x^2 \, A \, \Sigma \, A^T,\tag{83}\] such that the affine scale-space representations \(L(x;\; s, \Sigma)\) and \(L'(x';\; s', \Sigma')\), for these parameter values of the affine Gaussian smoothing kernels, are equal (24 ) \[\label{eq-equal-aff-scsp-repr-aff-cov-proof-hess} L'(x';\; s', \Sigma') = L(x;\; s, \Sigma).\tag{84}\] Let us then define the scale-normalized affine Hessian operator over the two domain as \[\begin{align} \begin{aligned} \label{eq-def-sc-norm-aff-hess-op-aff-cov-main-result} {\cal H}_{x,\scriptsizeaffnorm} & = \nabla_{x,\scriptsizeaffnorm} \, \nabla^T_{x,\scriptsizeaffnorm}, \end{aligned}\\ \begin{align} {\cal H}_{x',\scriptsizeaffnorm} & = \nabla_{x',\scriptsizeaffnorm} \, \nabla^T_{x',\scriptsizeaffnorm}, \end{align} \end{align}\tag{85}\] with the underlying scale-normalized affine gradient operators of the forms (77 ) and (78 ), \[\begin{align} \begin{aligned} \nabla_{x,\scriptsizeaffnorm} & = s^{1/2} \, \Sigma^{1/2} \, \nabla_x, \end{aligned}\\ \begin{align} \nabla_{x',\scriptsizeaffnorm} & = {s'}^{1/2} \, {\Sigma'}^{1/2} \, \nabla_{x'}, \end{align} \end{align}\] which when combining these expressions over the transformed domain gives \[\begin{align} \begin{aligned} {\cal H}_{x',\scriptsizeaffnorm} & = s' \, ({\Sigma'}^{1/2}) \, \nabla_{x'} \, \nabla_{x'}^T \, ({\Sigma'}^{1/2})^T \end{aligned}\\ \begin{align} & = s' \, ({\Sigma'}^{1/2}) \, {\cal H}_{x'} \, ({\Sigma'}^{1/2})^T. \end{align} \end{align}\] By additionally taking the indeterminacy with respect to a possible rotation matrix \(\tilde{\rho}\) into account, we obtain \[\label{eq-equal-aff-scsp-repr-aff-cov-proof-hess-again} {\cal H}_{x',\scriptsizeaffnorm} = \tilde{\rho} \, {\cal H}_{x,\scriptsizeaffnorm} \, \tilde{\rho}^T.\tag{86}\] This result thereby implies that, when these scale-normalized affine Hessian operators are applied to the affine Gaussian scale-space representations \(L(x;\; s, \Sigma)\) and \(L'(x';\; s', \Sigma')\) over their respective domains, we obtain that \[({\cal H}_{x',\scriptsizeaffnorm} L')(x';\; s', \Sigma') = \tilde{\rho} \, ({\cal H}_{x,\scriptsizeaffnorm} L)(x;\; s, \Sigma) \, \tilde{\rho}^T\] will hold for some rotation matrix \(\tilde{\rho}\), provided that the parameters of the underlying affine Gaussian smoothing kernels are related according to \[\label{eq-match-sc-pars-cov-mats-cov-prop-sc-norm-aff-hess} s' \, \Sigma' = s \, (S_x \, A) \, \Sigma \, (S_x A)^T = s \, S_x^2 \, A \, \Sigma \, A^T.\tag{87}\] Thus, this definition of the scale-normalized affine Hessian matrix is also covariant under the full group of non-singular spatial affine transformations. Again, the rotation matrix \(\tilde{\rho}\) is restricted to a unit matrix in the case of similarity transformations.

Figure 11: The covariance property (86 ) of the scale-normalized affine Hessian operator (85 ) under general (non-singular) affine transformations means that, if we consider two cameras, that view the same local surface patch from general (non-degenerate) viewing conditions, then, to first order of approximation, the resulting affine Hessian responses for the different views, here illustrated as ellipses before the affine scale normalization, can, up to a combination of two (in this 2-D case equal) rotation transformations \(\tilde{\rho}\) and \(\tilde{\rho}^T\), be perfectly matched, provided that the scale parameters and the covariance matrices of the receptive fields are properly matched according to (87 )..
Interpreted geometrically, this result means that if we interpret the spatial affine transformation as a local linearization of either the perspective mapping from the tangent plane of a local surface patch to the image domain, or as a local linearization of the projective transformation between two different views of the same local surface patch, then it holds that the scale-normalized affine Hessian matrices computed for matching image points and matching receptive field parameters of the two domains, can to, first order of approximation, be perfectly matched, provided that the scale parameters and the covariance matrices in the two domains are matched according to (83 ); see Figure 11 for an illustration.
Given any 1-D temporal signal \(f \colon {\mathbb{R}}\rightarrow {\mathbb{R}}\), consider its temporal scale-space representation \(L \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\), at temporal scale level \(\tau \in {\mathbb{R}}_+\): \[L(\cdot;\; \tau) = h(\cdot;\; \tau) * f(\cdot),\] obtained by convolution with either a non-causal temporal Gaussian kernel (4 ), the time-causal limit kernel (5 ), or, more generally, some other scale-covariant temporal kernel \(h \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\), that for any temporal scaling factor \(S_t \in {\mathbb{R}}_+\) obeys the temporal scaling property (3 ) \[h(t';\; \tau') = \frac{1}{S_t} \, h(t;\; \tau)\] under any temporal scaling transformation of the form \[t' = S_t \, t, \quad\quadand\quad\quad \tau' = S_t^2 \, \tau.\] Then, corresponding scale-normalized analogues of the regular temporal derivative operators (9 ) can be defined according to Lindeberg ([56]) \[\label{eq-temp-der-def-sc-norm} \partial_{t,\scriptsizenorm}^n = \tau^{n/2} \, \partial_t^n.\tag{88}\] In analogy with the above scale-normalized spatial derivative operators, the multiplication of the regular temporal derivative operators by the temporal scale parameter raised to a power proportional to the order of temporal differentiation, will compensate for the otherwise general decrease in the magnitude of temporally smoothed temporal derivatives with increasing temporal scales, to enable truly scale-covariant temporal derivative operators, whose magnitudes can be perfectly matched under temporal scaling transformations, as will be described in the next section.
Given any 1-D temporal signal \(f \colon {\mathbb{R}}\rightarrow {\mathbb{R}}\), define a rescaled temporal signal \(f' \colon {\mathbb{R}}\rightarrow {\mathbb{R}}\) by a temporal scaling transformation of the form \[f'(t') = f(t) \quad\quadfor\quad\quad t' = S_t \, t,\] for a temporal scaling factor \(S_t \in {\mathbb{R}}_+\), and define purely temporal scale-space representations \(L \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) of \(f\) and \(f'\), respectively, according to \[\begin{align} \begin{aligned} L(\cdot;\; \tau) = h(\cdot;\; \tau) * f(\cdot), \end{aligned}\\ \begin{align} L'(\cdot;\; \tau') = h(\cdot;\; \tau') * f'(\cdot), \end{align} \end{align}\] that obey temporal scale covariance for the underlying temporal smoothing transformation, such that \[L'(t';\; \tau') = L(t;\; \tau)\] holds for matching values of the temporal scale parameters according to \[\tau' = S_t^2 \, \tau.\] This property does, for example, both hold for the non-causal temporal scale-space representation, defined from convolutions with 1-D temporal Gaussian kernels of the form (4 ), and for the time-causal temporal scale-space representation, defined from convolutions with the time-causal limit kernel of the form (5 ) (see Equations (10) and (104) in Lindeberg ([56]) for the temporal differentiation order \(n = 0\) in that paper).
Let us next define corresponding scale-normalized temporal derivatives over the transformed temporal domain according to \[\begin{align} \begin{aligned} \partial_{t',\scriptsizenorm}^n & = {\tau'}^{n/2} \, \partial_{t'}^n. \end{aligned} \end{align}\] Then, since the scale-normalized temporal derivatives over the two different domains will be related according to (see Equations (10) and (104) in Lindeberg ([56]) for the scale normalization power \(\gamma\) set to \(\gamma = 1\) in that paper): \[\partial_{t',\scriptsizenorm} = \partial_{t,\scriptsizenorm},\] it follows that the scale-normalized temporal derivatives will be equal in the two domains, such that \[\begin{align} \begin{aligned} \label{eq-temp-sc-cov-property-pure-temp-ders} L'_{t',\scriptsizenorm}(t';\; \tau') & = L_{t,\scriptsizenorm}(t;\; \tau), \end{aligned} \end{align}\tag{89}\] which constitute covariance properties for scale-normalized temporal derivatives of a purely temporal scale-space representation.
Consider any 2+1-D spatio-temporal video sequence or video stream \(f \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\) of the form \(f(x, t)\), where \(x = (x_1, x_2) \in {\mathbb{R}}^2\) denotes the spatial coordinates and \(t \in {\mathbb{R}}\) denotes the time variable. To describe the properties of scale-normalized analogues of the velocity-adapted derivative operators, which also involve the computation of spatial derivatives, let us next consider a space-time separable spatio-temporal representation \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) obtained by joint convolution with an isotropic Gaussian kernel \(g \colon {\mathbb{R}}^2 \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) and a temporal smoothing kernel \(h \colon {\mathbb{R}}\times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) according to \[L(\cdot, \cdot;\; s, \tau) = g(\cdot;\; s) *_x h(\cdot;\; \tau) *_t f(\cdot, \cdot),\] where the convolution with the spatial Gaussian kernel \(g(\cdot;\; s)\) is performed over the spatial domain only, and the convolution with the temporal kernel \(h(\cdot;\; \tau)\) is performed over the temporal domain only, here indicated by the corresponding spatial and temporal convolution operators \(*_x\) and \(*_t\), respectively. Then, we can define scale-normalized analogues of the velocity-adapted temporal derivative operators (10 ) according to (as an extension of Lindeberg ([2]) Equation (82)) \[\label{eq-vel-adapt-der-def-sc-norm} \partial_{{\bar t},\scriptsizenorm}^n = \tau^{n/2} \, (v^T \, \nabla_x + \partial_t)^n.\tag{90}\]
Given any 2+1-D spatio-temporal video sequence or video stream \(f \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\), a spatial scaling factor \(S_x \in {\mathbb{R}}_+\) and a temporal scaling factor \(S_t \in {\mathbb{R}}_+\), let us next define a scaled video sequence or video stream \(f' \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\) under the composition of a spatial scaling transformation and a temporal scaling transformation over the joint space-time domain of the form \[f'(x', t') = f(x, t) \quadfor\quad x' = S_x \, x \quadand\quad t' = S_t \, t.\] Furthermore, let us define the space-time-separable spatio-temporal scale-space representations \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{R}}_+ \rightarrow {\mathbb{R}}\) of \(f\) and \(f'\), respectively, according to \[\begin{align} \begin{aligned} L(\cdot, \cdot;\; s, \tau) = g(\cdot;\; s) *_x h(\cdot;\; \tau) *_t f(\cdot, \cdot), \end{aligned}\\ \begin{align} L'(\cdot, \cdot;\; s', \tau') = g(\cdot;\; s') *_x h(\cdot;\; \tau') *_t f'(\cdot, \cdot), \end{align} \end{align}\] for matching values of the spatial and the temporal scale parameters according to \[s' = S_x^2 \, s \quadand\quad \tau' = S_t^2 \, \tau.\] Let us also, in addition to the definition of scale-normalized velocity-adapted temporal derivatives for the original video sequence or video stream \(f\) according to (90 ), define corresponding scale-normalized velocity-adapted temporal derivatives over the jointly rescaled video sequence or video stream \(f'\) according to \[\begin{align} \begin{aligned} \label{eq-def-sc-norm-vel-adapt-ders} \partial_{{\bar t}',\scriptsizenorm}^n = {\tau'}^{n/2} \, ({v'}^T \, \nabla_{x'} + \partial_{t'})^n. \end{aligned} \end{align}\tag{91}\] Then, provided that we define the transformed velocity vector \(v'\) according to \[v' = \frac{S_x}{S_t} \, v,\] it follows that the scale-normalized temporal derivatives of the spatio-temporal scale-space representations in the two domains will be equal \[\begin{align} \begin{aligned} \label{eq-temp-sc-cov-property-veladapt-temp-ders} L'_{{\bar t}',\scriptsizenorm}(x', t';\; s', \tau') & = L_{\bar t,\scriptsizenorm}(x, t;\; s, \tau). \end{aligned} \end{align}\tag{92}\] This constitutes the covariance property for scale-normalized velocity-adapted temporal derivatives of a space-time-separable spatio-temporal scale-space representation, under compositions of spatial scaling transformations and temporal scaling transformations.
Both this temporal scale covariance property and the previously treated temporal scale covariance property (89 ) have the geometric interpretation that we can, to first order of approximation, perfectly match the temporal receptive field responses between different views of a similar spatio-temporal event, that occurs either faster or slower in relation to a previous view of an otherwise similar event, with the other viewing parameters, except the speed of the event, being different.
The additional degree of freedom introduced here, by also including an arbitrary uniform spatial scaling transformation of the spatial domain, has been introduced here, to demonstrate that the underlying space-time separable spatio-temporal scale-space representation \(L(x, t;\; s, \tau)\) is closed, also under arbitrary combinations of such variabilities, however, then with the important constraint that the image velocity must be adapted, as determined by the spatial and temporal the temporal scaling factors \(S_x\) and \(S_t\). If we, on the other hand, would like to achieve closedness under free variabilities of the velocity vector \(v\), independent of the spatial scaling factor \(S_x\) and the temporal scaling factor \(S_t\), then a more complex joint spatio-temporal scale-space concept with additional parameters for the receptive fields is needed, as will be addressed in the next section.
| Section | Topic | Contribution |
|---|---|---|
| 4.1 | Covariance properties of the pure smoothing operation under individual image transformations | Review of Lindeberg ([1]) |
| 4.2 | Covariance properties of spatio-temporal derivatives under individual image transformations | Extension of Lindeberg ([1]) |
| 5.1 | Definition of the studied form of composed spatio-temporal image transformations | New |
| 5.2 | Covariance property of smoothing operation under composed spatio-temporal transformations | New |
| 5.3 | Covariance properties of derivative operators under composed spatio-temporal transformations | New |
| 5.4 | Covariance properties of geometric derivatives under composed spatio-temporal transformations | New |
| 5.5 | Covariance properties of scale-normalized spatio-temporal derivatives under compositions | New |
For processing 2+1-D spatio-temporal image data \(f \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\), convolution of the input video sequence or video stream \(f(x, t)\), where \(x = (x_1, x_2) \in {\mathbb{R}}^2\) denotes the image coordinates and \(t \in {\mathbb{R}}\) the time variable, with the purely spatio-temporal smoothing kernel \(T \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \times {\mathbb{R}}_+ \times {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) according to (1 ) \[\label{eq-spat-temp-RF-model-again-cov-props-basic} T(x, t;\; s, \Sigma, \tau, v) = g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau),\tag{93}\] defines spatio-temporal smoothed image data \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \times {\mathbb{R}}_+ \times {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) according to \[\label{eq-def-spat-temp-scsp} L(\cdot, \cdot;\; s, \Sigma, \tau, v) = T(\cdot, \cdot;\; s, \Sigma, \tau, v) * f(\cdot, \cdot),\tag{94}\] which is referred to as a spatio-temporal scale-space representation of \(f\) over the spatial scale parameter \(s \in {\mathbb{R}}_+\), the spatial covariance matrix \(\Sigma \in {\mathbb{S}}_+^2\), the temporal scale parameter \(\tau \in {\mathbb{R}}_+\) and the velocity vector \(v \in {\mathbb{R}}^2\) (Lindeberg [19]).
In this section, we will first in Section 4.1 restate covariance properties of spatio-temporal receptive fields under four individual classes of geometric image transformations, as previously formulated in Lindeberg ([1]). Then, we will combine these results with the transformation properties of the regular (not scale-normalized) either purely spatial, the purely temporal or the joint spatio-temporal derivative operators expressed for the restricted subdomains in Section 3, to in Section 4.2 express transformation properties of spatio-temporal derivative operators over a joint spatio-temporal domain, for each class of individual geometric image transformations. This conceptual background will then constitute a conceptual foundation for formulating both covariance properties and transformation properties of spatio-temporal receptive fields in Section 5, for a specific way of composing the four different classes of individual geometric image transformations in cascade.
Table 2 gives a comprehensive overview of the different theoretical contributions that will follow in Sections 4 and 5.
In Lindeberg ([1]), covariance properties of this generalized Gaussian derivative model for receptive fields were studied in detail. It was specifically shown that:
Under purely spatial scaling transformations of the form \[f'(x', t') = f(x, t)\] for \(t' = t\) and \[x' = S_x \, x,\] where \(S_x \in {\mathbb{R}}_+\) denotes the spatial scaling factor, the spatio-temporal scale-space representations \(L'\) and \(L\), obtained by convolving the input signals \(f'\) and \(f\) with spatio-temporal convolution kernels of the form (1 ), are related according to \[L'(x', t';\; s', \Sigma', \tau', v') = L(x, t;\; s, \Sigma, \tau, v),\] provided that the spatial scale parameters and the velocity vectors are related according to \[\label{eq-transf-pars-spat-scal} s' = S_x^2 \, s \quad\quadand\quad\quad v' = S_x \, v.\tag{95}\]
Under spatial affine transformations of the form \[f_R(x_R, t_R) = f_L(x_L, t_L)\] for \(t_R = t_L\) and \[x_R = A \, x_L,\] where \(x_L \in {\mathbb{R}}^2\) and \(x_R \in {\mathbb{R}}^2\) denote the spatial image coordinates in the two domains and \(A\) denotes a non-singular \(2 \times 2\) affine transformation matrix, the spatio-temporal scale-space representations \(L_R\) and \(L_L\), obtained by convolving the input signals \(f_R \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\) and \(f_L \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\) with spatio-temporal convolution kernels of the form (1 ), are related according to \[\begin{gather} L_R(x_R, t_R;\; s_R, \Sigma_R, \tau_R, v_R) = \\ = L_L(x_L, t_L;\; s_L, \Sigma_L, \tau_L, v_L) , \end{gather}\] provided that the spatial covariance matrices \(\Sigma_L \in {\mathbb{S}}_+^2\) and \(\Sigma_R \in {\mathbb{R}}_+^2\) as well as the velocity vectors \(v_L \in {\mathbb{R}}^2\) and \(v_R \in {\mathbb{R}}^2\) are related according to \[\label{eq-transf-pars-spat-aff} \Sigma_R = A \, \Sigma_L \, A^T \quadand\quad v_R = A \, v_L,\tag{96}\] and provided that the other parameters of the receptive fields are the same.
Under purely temporal scaling transformations of the form \[f'(x', t') = f(x, t)\] for \(x' = x\) and \[t' = S_t^2 \, t,\] where \(S_t \in {\mathbb{R}}_+\) is the temporal scaling factor, the spatio-temporal scale-space representations \(L'\) and \(L\), obtained by convolving the input signals \(f'\) and \(f\) with spatio-temporal convolution kernels of the form (1 ), are related according to \[L'(x', t';\; s', \Sigma', \tau', v') = L(x, t;\; s, \Sigma, \tau, v),\] provided that the temporal scale parameters \(\tau \in {\mathbb{R}}_+\) and \(\tau' \in {\mathbb{R}}_+\) as well as the velocity vectors \(v \in {\mathbb{R}}^2\) and \(v' \in {\mathbb{R}}^2\) are related according to \[\label{eq-transf-pars-temp-sc} \tau' = S_t^2 \, \tau \quad\quadand\quad\quad v' = v/S_t,\tag{97}\] and provided that the other parameters of the receptive field are the same.
Under Galilean transformations of the form \[f'(x', t') = f(x, t)\] for \(t' = t'\) and \[\label{eq-gal-trans-sec-cov-prop} x' = x + u \, t,\tag{98}\] where \(u \in {\mathbb{R}}^2\) is a velocity vector, the spatio-temporal scale-space representations \(L'\) and \(L\), obtained by convolving the input signals \(f'\) and \(f\) with spatio-temporal convolution kernels of the form (1 ), are related according to \[L'(x', t';\; s', \Sigma', \tau', v') = L(x, t;\; s, \Sigma, \tau, v),\] provided that the velocity parameters \(v \in {\mathbb{R}}^2\) and \(v' \in {\mathbb{R}}^2\) for the two domains are related according to \[\label{eq-transf-pars-galilean} v' = v + u\tag{99}\] and provided that the other parameters of the receptive fields are the same.
A notable characteristics of the transformation properties of the spatio-temporal scale-space representations under these classes of natural image transformations, is that the four classes of geometric image transformations are not independent. Instead, for example, both the spatial scaling transformations and the temporal scaling transformations affect the image velocities, beyond the spatial and temporal scale parameters, respectively. Furthermore, the spatial affine transformations also affect the image velocities. For this reason, it is of interest to additionally model all the four classes of image transformations jointly, which we will do in Section 5.
Before that, let us, however, first also address the transformation properties of the spatial and the temporal derivative operators, which substantially extends the previous treatment of transformation properties of spatio-temporal derivative operators over joint space-time in Lindeberg ([1]), as here based on the in-depth treatment of transformation properties of spatial and temporal derivatives over spatial or temporal subdomains in Section 3 in the current paper.
Beyond the spatio-temporal smoothing kernel \(T(x, t;\; s, \Sigma, \tau, v)\), that defines the spatio-temporal scale-space representation \(L(x, t;\; s, \Sigma, \tau, v)\) in (94 ), the spatio-temporal receptive field model (12 ) \[\begin{gather} \label{eq-spat-temp-RF-model-der-again} T_{{\varphi}^{m} {\bar t}^n}(x, t;\; s, \Sigma, \tau, v) = \\ = \partial_{\varphi}^{m} \, \partial_{\bar t}^n \left( g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau) \right), \end{gather}\tag{100}\] that has been used for modelling the the receptive fields of simple cells in the primary visual cortex, does additionally comprise spatial derivative operators in terms of directional derivative operators \(\partial_{\varphi}^{m}\) of the form (8 ) as well as temporal derivative operators \(\partial_{\bar t}^n\) of the forms (9 ) and (10 ).
In summary, under the four classes of spatio-temporal image transformations studied in this work, the spatial and the temporal derivative operators transform as follows:
Under purely spatial scaling transformations of the form \[f'(x', t') = f(x, t)\] for \(t' = t\) and \[x' = S_x \, x,\] where \(S_x \in {\mathbb{R}}_+\) denotes the spatial scaling factor, and the spatial gradient operator \(\nabla_x = (\partial_{x_1}, \partial_{x_2})^T\) transforms according to \[\label{eq-transf-prop-nabla-pure-sc-transf-spat-temp} \nabla_x = S_x \, \nabla_{x'},\tag{101}\] with the transformed spatial gradient operator defined as \(\nabla_{x'} = (\partial_{x'_1}, \partial_{x'_2})^T\). This means that also the directional derivative operator, defined according to (8 ), transforms according to \[\label{eq-transf-prop-dphi-pure-sc-transf-spat-temp} \partial_{\varphi} = S_x \, \partial_{\varphi'}.\tag{102}\] Under a purely spatial scaling transformation, the regular temporal derivative operator is, however, unchanged \[\partial_t = \partial_{t'}.\] Due to the transformation property (95 ) of the velocity vector \(v\) according to \(v' = S_x \, v\), the velocity-adapted temporal derivative operator (10 ) does also, with the transformed temporal derivative operator of the form \[\partial_{{\bar t}'} = v'_1 \, \partial_{x'_1} + v'_2 \, \partial_{x'_2} + \partial_{t'},\] under a uniform spatial scaling transformation, transform according to \[\partial_{\bar t} = \partial_{{\bar t}'}.\]
Under spatial affine transformations of the form \[f_R(x_R, t_R) = f_L(x_L, t_L)\] for \(t_R = t_L\) and \[x_R = A \, x_L,\] where \(A\) denotes a \(2 \times 2\) affine transformation matrix, the spatial gradient operator \(\nabla_x\) transforms according to \[\label{eq-transf-prop-nabla-spat-aff-transf-spat-temp} \nabla_x = A^T \, \nabla_{x'}.\tag{103}\] This implies that if we define directional derivative operator (8 ) as \[\partial_{\varphi} = e_{\varphi}^T \, \nabla_x\] with the unit vector \(e_{\varphi}\) in the direction \(\varphi\) transforming according to \[e_{\varphi'} = \frac{A \, e_{\varphi}}{\| A \, e_{\varphi} \|}\] to guarantee that also the transformed unit vector will be of unit length, then the corresponding transformed directional derivative operator \(\partial_{\varphi'} = e_{\varphi'}^T \nabla_{x'}\) is \[\partial_{\varphi} = \| A \, e_{\varphi} \| \, \partial_{\varphi'}.\] Both the regular and the velocity-adapted temporal derivative operators are, however, unchanged under purely spatial affine transformations \[\begin{align} \begin{aligned} \tag{104} \partial_t = \partial_{t'}, \end{aligned}\\ \begin{align} \tag{105} \partial_{\bar t} = \partial_{{\bar t}'}, \end{align} \end{align}\] when taking into account the transformation property \(v' = A \, v\) (96 ) of the velocity vector.
Under purely temporal scaling transformations of the form \[f'(x', t') = f(x, t)\] for \(x' = x\) and \[t' = S_t^2 \, t,\] where \(S_t \in {\mathbb{R}}_+\) is a temporal scaling factor, both the regular spatial gradient operator and the spatial directional derivative operators are unchanged \[\begin{align} \begin{aligned} \nabla_{x} = \nabla_{x'}, \end{aligned}\\ \begin{align} \partial_{\varphi} = \partial_{\varphi'}. \end{align} \end{align}\] Both the regular and the velocity-adapted temporal derivative operators do, however, transform according to \[\begin{align} \begin{aligned} \partial_t = S_t \, \partial_{t'}, \end{aligned}\\ \begin{align} \partial_{\bar t} = S_t \, \partial_{{\bar t}'} \end{align} \end{align}\] when taking the transformation property \(v' = v/S_t\) (97 ) of the velocity vector into account.
Under Galilean transformations of the form \[f'(x', t') = f(x, t)\] for \(t' = t'\) and \[\label{eq-gal-trans-sec-cov-prop-again} x' = x + u \, t,\tag{106}\] where \(u \in {\mathbb{R}}^2\) is a velocity vector, both the regular spatial gradient operator and the spatial directional derivative operators are unchanged \[\begin{align} \begin{aligned} \nabla_{x} = \nabla_{x'}, \end{aligned}\\ \begin{align} \partial_{\varphi} = \partial_{\varphi'}. \end{align} \end{align}\] Taking into account the transformation property \(v' = v + u\) (99 ) of the velocity vector, the regular temporal derivative transforms according to \[\partial_t = u^T \, \nabla_{x'} + \partial_{t'},\] whereas the velocity-adapted derivatives are equal \[\partial_{\bar t} = \partial_{{\bar t}'}.\]
Thus, we can from this summarizing overview see how the spatial and the temporal derivative operators are transformed in ways that interact strongly with the parameters of the corresponding spatio-temporal image transformations.9 Of particular interest is therefore to also make explicit how these transformation properties are composed, when coupling the different types of primitive image transformations in cascade, which we will do in the next section.
In this section, we will derive a set of joint covariance properties over the composition of (i) a spatial scaling transformation, (ii) a spatial affine transformation, (iii) a Galilean transformation and (iv) a temporal scaling transformation.
For this purpose, we will first in Section 5.1 define the composed geometric transformation, and then in Section 5.2 consider how different components in the integral formulation of the convolution operation are transformed under the corresponding change of variables, which, when combined, leads to the desired transformation property regarding the spatio-temporal smoothing components of the spatio-temporal receptive fields.
Then, we will additionally in Sections 5.3–5.5 complement with explicit transformation properties regarding the spatial and the temporal derivative operators, underlying the formulation of the joint spatio-temporal receptive fields, obtained by applying composed spatio-temporal derivatives to the joint spatio-temporal smoothing kernel, specifically including explicit formulations of joint spatio-temporal covariance and transformation properties in terms of scale-normalized derivatives.
The results presented in this section will: (i) extend the individual covariance properties of the spatio-temporal smoothing operation for each one of the four primitive types geometric image transformations reviewed in Section 4.1 to a joint spatio-temporal covariance property of the spatio-temporal smoothing operation, (ii) extend the transformation properties of the spatio-temporal derivative operators for each one of the four primitive types geometric image transformations described in Section 4.2 to transformation properties of spatio-temporal derivative operators under a joint composition of the four primitive types of geometric image transformations, and (iii) extend the transformation properties of the regular spatio-temporal derivative operators under the composed geometric transformation to algebraically much simpler covariance and transformation properties in terms of scale-normalized derivatives.
In these ways, the presented results will show how spatio-temporal receptive field responses in terms of spatio-temporal derivatives of spatio-temporal smoothing operations can be matched under composed geometric image transformations, provided that the parameters of the receptive fields are properly matched between the domains before vs.after the composed spatio-temporal image transformation.
Consider the composition of:
a spatial scaling transformation with the spatial scaling factor \(S_x \in {\mathbb{R}}_+\),
a spatial affine transformation with the non-singular \(2 \times 2\) affine transformation matrix \(A\),
a Galilean transformation with the velocity vector \(u \in {\mathbb{R}}^2\), and
a temporal scaling transformation with the temporal scaling factor \(S_t \in {\mathbb{R}}_+\)
of the form \[\begin{align} \begin{aligned} \tag{107} x' = S_x \, (A \, x + u \, t), \end{aligned}\\ \begin{align} \tag{108} t' = S_t \, t, \end{align} \end{align}\] where \(x \in {\mathbb{R}}^2\), \(x' \in {\mathbb{R}}^2\), \(t \in {\mathbb{R}}\) and \(t' \in {\mathbb{R}}\).
As will be described in more detail in Section 6, this way of composing the four different types primitive image transformations, geometrically corresponds to interpreting:
the \(2 \times 2\) affine transformation matrix \(A\) as an orthonormal projection of surface patterns from the tangent plane of a local surface patch to a plane, that is parallel with the image plane of the observer,
the velocity vector \(u = (u_1, u_2)^T \in {\mathbb{R}}^2\) as the projection of the 3-D motion vector \(U = (U_1, U_2, U_3)^T\) of local surface patterns onto a plane, that is parallel to the image plane, by local orthonormal projection,
the spatial scaling factor \(S_x \in {\mathbb{R}}_+\) as corresponding to the perspective scaling factor proportional to the inverse depth \(Z\), which will then affect both the projection of a spatial surface pattern and the magnitude of the perceived motion in the image plane, and
the temporal scaling factor \(S_t \in {\mathbb{R}}_+\) as capturing a variability of similar spatio-temporal events that may occur either faster or slower, when observing different instances of a similar event at different occasions.
In this way, the composed image transformation model captures the variabilities of the scaled orthographic projection model, complemented with a variability over projections of 3-D motions between an observed object and the observer, including spatio-temporal events that may occur faster or slower relative to a reference view.
In the following, we will derive a joint covariance property for the spatio-temporal scale-space representation obtained by convolution with the studied class of spatio-temporal smoothing kernels, under the above class of composed spatio-temporal image transformations. By necessity, parts of this treatment may be somewhat technical. For the reader, who may be more interested in the final result, than the details of the derivation, it should be possible to, without major loss of continuity, skip the details, and then proceed to the below boldface header “Summary of main result”, for a condensed summary of the resulting joint covariance property.
Prerequisites: Let us assume that we have two video sequences or video streams \(f \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\) and \(f' \colon {\mathbb{R}}^2 \times {\mathbb{R}}\rightarrow {\mathbb{R}}\), that are related according to (107 ) and (108 ), such that \[\label{eq-f-fprim-transf-proof} f'(x', t') = f(x, t)\tag{109}\] for all \(x = (x_1, x_2)^T \in {\mathbb{R}}^2\) and \(t \in {\mathbb{R}}\) at all points \(p = (x_1, x_2, t)^T \in {\mathbb{R}}^3\), with these coordinates interpreted as local coordinates in some local region \(\Omega\) in joint space-time, around the origin \(O = (0, 0, 0)^T\), assumed to correspond to the image point \(x = (0, 0)\), and the temporal moment \(t = 0\) corresponding to the time moment when a particular receptive field response is computed.
What we want to derive, is a relationship for how the scale-space representations \(L \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \times {\mathbb{R}}_+ \times {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) and \(L' \colon {\mathbb{R}}^2 \times {\mathbb{R}}\times {\mathbb{R}}_+ \times {\mathbb{S}}_+^2 \times {\mathbb{R}}_+ \times {\mathbb{R}}^2 \rightarrow {\mathbb{R}}\) of \(f'\) and \(f\), respectively, around this point in joint space-time are related, when they are defined according to \[\begin{align} \begin{aligned} & L(x, t;\; s, \Sigma, \tau, v) = \end{aligned}\nonumber\\ \begin{align} \tag{110} & = \int_{\xi \in {\mathbb{R}}^2} \int_{\eta \in {\mathbb{R}}} T(\xi, \eta;\; s, \Sigma, \tau, v) \, f(x - \xi, t - \eta) \, d\xi \, d\eta, \end{align}\\ \begin{align} & L'(x', t';\; s', \Sigma', \tau', v') = \\ \end{align}\nonumber\\ \begin{aligned} & = \int_{\xi' \in {\mathbb{R}}^2} \int_{\eta' \in {\mathbb{R}}} T(\xi', \eta';\; s', \Sigma', \tau', v') \times \end{aligned}\nonumber\\ \begin{align} \tag{111} & \phantom{= \int_{\xi' \in {\mathbb{R}}^2} \int_{\eta' \in {\mathbb{R}}}} \quad f'(x' - \xi', t' - \eta') \, d\xi' \, d\eta'. \end{align} \end{align}\] Step I: Let us first consider how the velocity-adapted Gaussian kernel \(g(x' - v' t';\; s' \, \Sigma')\) transforms in the expression for the spatio-temporal receptive field according to (1 ) \[\label{eq-spat-temp-RF-model-prim} T(x', t';\; s', \Sigma', \tau', v') = g(x' - v' t';\; s' \, \Sigma') \, h(t';\; \tau').\tag{112}\] Expanding the expression for the function \(g(x' - v' t';\; s' \, \Sigma')\) according to the definition (2 ) of the 2-D affine Gaussian kernel, and making use of the explicit expressions for the spatio-temporal image transformation in (107 ) and (108 ), then gives \[\begin{align} \begin{aligned} & g(x' - v' t';\; s' \, \Sigma') = \end{aligned}\nonumber\\ \begin{align} & = \frac{1}{2 \pi s' \sqrt{\det{\Sigma'}}} \, e^{-(x' - v' t')^T {\Sigma'}^{-1} (x' - v' t')/2 s'} \end{align}\nonumber\\ \begin{align} & = \frac{1}{2 \pi s' \sqrt{\det{\Sigma'}}} \times \end{align}\nonumber\\ \begin{aligned} & \phantom{=} \quad e^{-(S_x (A x + u t) - v' S_t t)^T {\Sigma'}^{-1} (S_x (A x + u t) - v' S_t t)/2 s'}. \end{aligned} \end{align}\] If we, inspired by the transformation property of the spatial scale parameters under a spatial scaling transformation in (95 ), introduce a similar relationship \[s' = S_x^2 \, s,\] as well as inspired by the transformation property of the spatial covariance matrices \(\Sigma \in {\mathbb{S}}_+^2\) and \(\Sigma' \in {\mathbb{S}}_+^2\) under a spatial affine transformation in (96 ), introduce a similar relationship \[\Sigma' = A \, \Sigma \, A^T,\] which gives \[{\Sigma'}^{-1} = (A \, \Sigma A^T)^{-1} = A^{-T} \, \Sigma^{-1} \, A^{-1},\] as well as \[\det \Sigma' = | \det A |^2 \, \det \Sigma,\] we then obtain \[\begin{align} \begin{aligned} & g(x' - v' t';\; s' \, \Sigma') = \end{aligned}\nonumber\\ \begin{align} & = \frac{1}{2 \pi \, S_x^2 \, s \, | \det A |\sqrt{\det{\Sigma}}} \times \end{align}\nonumber\\ \begin{align} & \phantom{=} e^{-(A^{-1}(S_x (A x + u t) -v' S_t t))^T \Sigma^{-1} (A^{-1}(S_x (A x + u t) -v' S_t t))/(2 S_x^2 s)}. \end{align} \end{align}\] Notably, this expression can be written as \[\begin{align} \begin{aligned} & g(x' - v' t';\; s' \, \Sigma') = \end{aligned}\nonumber\\ \begin{align} \label{eq-transf-gauss-over-prime} & = \frac{1}{2 \pi \, s \, S_x^2 \, | \det A |\sqrt{\det{\Sigma}}} e^{-(x - v t)^T \Sigma^{-1} (x - v t)/2 s}, \end{align} \end{align}\tag{113}\] provided that the velocity parameters \(v \in {\mathbb{R}}^2\) and \(v' \in {\mathbb{R}}^2\) in the two domains are related according to \[- v = A^{-1} u - A^{-1} v' S_t/S_x,\] in other words if \[\frac{S_t}{S_x} A^{-1} \, v' = v + A^{-1} u,\] that is provided that the velocity parameters \(v'\) and \(v\) are related according to \[v' = \frac{S_x}{S_t} (A \, v + u).\] Step II: Consider next the spatio-temporal scale-space representation \(L'\) of \(f'\) according to (111 ), and perform the change of variables \[\begin{align} \begin{aligned} \tag{114} \xi' = S_x \, (A \, \xi + u \, t), \end{aligned}\\ \begin{align} \tag{115} \eta' = S_t \, \eta, \end{align} \end{align}\] which for \(d\xi' = d\xi'_1 \, d\xi'_2\) and \(d\xi = d\xi_1 \, d\xi_2\) gives \[\begin{align} \begin{aligned} \tag{116} d\xi' = S_x^2 \, | \det A | \, d\xi, \end{aligned}\\ \begin{align} \tag{117} d\eta' = S_t \, d\eta. \end{align} \end{align}\] Let us additionally, in the convolution integral (111 ), transform the kernel \(T(x', t';\; s', \Sigma', \tau', v')\) according to (112 ), with its components \(g(x' - v' t';\; s' \, \Sigma')\) transforming according to (113 ), and \(h(t';\; \tau')\) transforming according to (3 ), with the parameters of the receptive fields transforming according to \[\begin{align} \begin{aligned} s' & = S_x^2 \, s, \end{aligned}\\ \begin{align} \Sigma' & = A \, \Sigma \, A^{T}, \end{align}\\ \begin{align} \tau' & = S_t^2 \, \tau, \end{align}\\ \begin{aligned} v' & = \frac{S_x}{S_t} (A \, v + u), \end{aligned} \end{align}\] which then gives \[\label{eq-transf-T-in-proof} T(x', t';\; s', \Sigma', \tau', v') = \frac{1}{S_x^2 \, |\det A| \, S_t} \, T(x, t;\; s, \Sigma, \tau, v).\tag{118}\] Step III: Thus, given that the functions \(f\) and \(f'\) transform according to (109 ), combined with the previous result, that \(T\) transforms according to (118 ), as well as that \(d\xi\), \(d\xi'\), \(d\eta\) and \(d\eta'\) transform according to (116 ) and (117 ), we obtain \[\begin{align} \begin{aligned} & L'(x', t';\; s', \Sigma', \tau', v') = \\ \end{aligned}\nonumber\\ \begin{align} & = \int_{\xi' \in {\mathbb{R}}^2} \int_{\eta' \in {\mathbb{R}}} T(\xi', \eta';\; s', \Sigma', \tau', v') \times \end{align}\nonumber\\ \begin{align} & \phantom{= \int_{\xi' \in {\mathbb{R}}^2} \int_{\eta' \in {\mathbb{R}}}} \quad f'(x' - \xi', t' - \eta') \, d\xi' \, d\eta' = \end{align}\nonumber\\ \begin{aligned} & = \int_{\xi \in {\mathbb{R}}^2} \int_{\eta \in {\mathbb{R}}} T(\xi, \eta;\; s, \Sigma, \tau, v) \, f(x - \xi, t - \eta) \, d\xi \, d\eta \end{aligned}\nonumber\\ \begin{align} & = L(x, t;\; s, \Sigma, \tau, v). \end{align} \end{align}\]
None
Figure 12: Commutative diagram for the joint spatio-temporal smoothing component (1 ) in the joint spatio-temporal receptive field model (12 ) under the composition of (i) a spatial scaling transformation, (ii) a spatial affine transformation, (iii) a Galilean transformation and (iv) a temporal scaling transformation according to (107 ) and (108 ). This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input video sequence or video stream \(f(x, t)\) is first subject to the composed transformation \(x' = S_x (A \, x + u \, t)\) and \(t' = S_t \, t\) and then smoothed with a spatio-temporal kernel \(T(x', t';\; s', \Sigma', \tau', v')\), or instead directly convolved with the spatio-temporal smoothing kernel \(T(x, t;\; s, \Sigma, \tau, v)\) and then subject to the same joint spatio-temporal transformation, we do then get the same result, provided that the parameters of the spatio-temporal smoothing kernels are related according to \(s' = S_x^2 \, s\), \(\Sigma' = A \, \Sigma \, A^{T}\), \(\tau' = S_t^2 \, \tau\) and \(v' = \frac{S_x}{S_t} (A \, v + u)\)..
Summary of main result: To conclude, given two video sequences or video streams \(f'\) and \(f\), that are related according to \[f'(x', t') = f(x, t)\] under a composed image transformation of the form \[\begin{align} \begin{aligned} \tag{119} x' = S_x \, (A \, x + u \, t), \end{aligned}\\ \begin{align} \tag{120} t' = S_t \, t, \end{align} \end{align}\] we have shown that the corresponding spatio-temporal scale-space representations \(L'\) and \(L\) of \(f'\) and \(f\), respectively, are related according to \[\label{eq-joint-cov-prop-result-of-proof} L'(x', t';\; s', \Sigma', \tau', v') = L(x, t;\; s, \Sigma, \tau, v),\tag{121}\] provided that the parameters of the receptive fields transform according to \[\begin{align} \begin{aligned} \tag{122} s' & = S_x^2 \, s, \end{aligned}\\ \begin{align} \tag{123} \Sigma' & = A \, \Sigma \, A^{T}, \end{align}\\ \begin{align} \tag{124} \tau' & = S_t^2 \, \tau, \end{align}\\ \begin{aligned} \tag{125} v' & = \frac{S_x}{S_t} (A \, v + u), \end{aligned} \end{align}\] which proves the joint spatio-temporal covariance property, see Figure 12 for a commutative diagram that illustrates this joint covariance property.
Notably, this result also serves as an explicit proof of all the individual transformation properties in Lindeberg ([1]), where the explicit proofs were omitted there, because of lack of space.
None
Figure 13: Commutative diagram for spatio-temporal derivative operators underlying the joint spatio-temporal receptive field model (12 ) under the composition of (i) a spatial scaling transformation, (ii) a spatial affine transformation, (iii) a Galilean transformation and (iv) a temporal scaling transformation according to (107 ) and (108 ). This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input video sequence or video stream \(f(x, t)\) is first subject to the composed transformation \(x' = S_x (A \, x + u \, t)\) and \(t' = S_t \, t\) and then filtered with a spatio-temporal derivative kernel \((\nabla_{x'} \partial_{t'} T)(x', t';\; s', \Sigma', \tau', v')\), or instead directly convolved with the spatio-temporal smoothing kernel \((\nabla_x \partial_t T)(x, t;\; s, \Sigma, \tau, v)\) and then subject to the same joint spatio-temporal transformation, we do then get the same result, provided that the spatial and the temporal derivative operators are transformed according to \(\nabla_{x'} = \frac{1}{S_x} \, A^{-T} \, \nabla_{x}\) and \(\partial_{t'} = - \frac{1}{S_x} \, u^T A^{-T} \, \nabla_x + \frac{1}{S_t} \, \partial_t\) and that the parameters of the spatio-temporal smoothing kernels are related according to \(s' = S_x^2 \, s\), \(\Sigma' = A \, \Sigma \, A^{T}\), \(\tau' = S_t^2 \, \tau\) and \(v' = \frac{S_x}{S_t} (A \, v + u)\). (In this commutative diagram, we have illustrated the general covariance properties of spatio-temporal derivatives for the particular choice of the composed spatio-temporal derivative operator \(\nabla_x \partial_t T\) in the spatio-temporal receptive field model ( 12 ). Similar covariance properties can, of course, also be obtained for other combinations of the spatial and the temporal derivative operators \(\nabla_x\) and \(\partial_t\), in a structurally similar manner.).
None
Figure 14: Commutative diagram for scale-normalized spatio-temporal derivative operators defined from the joint spatio-temporal receptive field model (12 ) under the composition of (i) a spatial scaling transformation, (ii) a spatial affine transformation, (iii) a Galilean transformation and (iv) a temporal scaling transformation according to (107 ) and (108 ). This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input video sequence or video stream \(f(x, t)\) is first subject to the composed transformation \(x' = S_x (A \, x + u \, t)\) and \(t' = S_t \, t\) and then filtered with a scale-normalized spatio-temporal derivative kernel \((\nabla_{x',\scriptsizeaffnorm} \partial_{t',\scriptsizenorm} T)(x', t';\; s', \Sigma', \tau', v')\), or instead directly convolved with the scale-normalized spatio-temporal smoothing kernel \((\nabla_{x,\scriptsizeaffnorm} \partial_{t,\scriptsizenorm} T)(x, t;\; s, \Sigma, \tau, v)\) and then subject to the same joint spatio-temporal transformation, we do then, up to a possibly unknown rotation transformation \(\tilde{\rho}\), get the same result, provided that the parameters of the spatio-temporal smoothing kernels are related according to \(s' = S_x^2 \, s\), \(\Sigma' = A \, \Sigma \, A^{T}\), \(\tau' = S_t^2 \, \tau\) and \(v' = \frac{S_x}{S_t} (A \, v + u)\). Note, in particular, the conceptual simplification in relation to the corresponding commutative diagram based on regular partial derivatives that have not been subject to scale normalization or velocity adaptation regarding the temporal derivatives, in that the scale-normalized spatio-temporal derivatives in this commutative diagram are essentially equal, up to a possibly unknown rotation transformation. (In this commutative diagram, we have illustrated the general covariance properties of spatio-temporal derivatives for the particular choice of the composed spatio-temporal derivative operator \(\nabla_{x,\scriptsizeaffnorm} \partial_{t,\scriptsizenorm} T\) in the spatio-temporal receptive field model (12 ). Similar covariance properties can, of course, also be obtained for other selections of the spatial and the temporal derivative operators \(\nabla_{x,\scriptsizeaffnorm}\) and \(\partial_{t,\scriptsizenorm}\) for which corresponding covariance properties hold.).
Let us denote the spatio-temporal coordinates for the original and the transformed domains by \(p = (x_1, x_2, t)^T \in {\mathbb{R}}^3\) and \(p' = (x'_1, x'_2, t')^T \in {\mathbb{R}}^3\), respectively, and let us denote the components of the \(2 \times 2\) affine transformation matrix \(A\) by \(a_{ij}\) for \(i\) and \(j\) \(\in \{ 1, 2\}\) and again let the velocity vector in the Galilean transformation be \(u = (u_1, u_2)^T\). Then, the composed image transformation according to (107 ) and (108 ) can be written \[\begin{align} \begin{aligned} p' & = \left( \begin{array}{c} x_1' \\ x_2' \\ t' \end{array} \right) = S_x \, \left( \begin{array}{ccc} a_{11} & a_{12} & u_1 \\ a_{21} & a_{22} & u_2 \\ 0 & 0 & S_t/S_x \\ \end{array} \right) \left( \begin{array}{c} x_1 \\ x_2 \\ t \end{array} \right) \end{aligned}\nonumber\\ \begin{align} & = Q \, p, \end{align} \end{align}\] where \(Q\) is a \(3 \times 3\) joint transformation matrix, that operates on the spatio-temporal coordinates \(p\).
According to the general transformation property of derivative operators under a linear change of variables between the domains \(p = (x_1, x_2, t)^T\) and \(p' = (x'_1, x'_2, t')^T\), which in terms of explicit partial derivatives can be expressed on the form \[\begin{align} \begin{aligned} \partial_{x_1} & = \frac{\partial x_1'}{\partial x_1} \, \partial_{x_1'} + \frac{\partial x_2'}{\partial x_1} \, \partial_{x_2'} + \frac{\partial t'}{\partial x_1} \, \partial_{t'} \end{aligned}\\ \begin{align} \partial_{x_2} & = \frac{\partial x_1'}{\partial x_2} \, \partial_{x_1'} + \frac{\partial x_2'}{\partial x_2} \, \partial_{x_2'} + \frac{\partial t'}{\partial x_2} \, \partial_{t'} \end{align}\\ \begin{align} \partial_{t} & = \frac{\partial x_1'}{\partial t} \, \partial_{x_1'} + \frac{\partial x_2'}{\partial t} \, \partial_{x_2'} + \frac{\partial t'}{\partial t} \, \partial_{t'}, \end{align} \end{align}\] it then follows that the spatio-temporal derivative operators in the original and the transformed domains are related according to \[\begin{align} \begin{aligned} \nabla_{p} & = \left( \begin{array}{c} \partial_{x_1} \\ \partial_{x_2} \\ \partial_t \end{array} \right) = S_x \left( \begin{array}{ccc} a_{11} & a_{21} & 0 \\ a_{12} & a_{22} & 0 \\ u_1 & u_2 & S_t/S_x \\ \end{array} \right) \left( \begin{array}{c} \partial_{x_1'} \\ \partial_{x_2'} \\ \partial_{t'} \end{array} \right) \end{aligned}\nonumber\\ \begin{align} \label{eq-transf-spat-temp-grad} & = Q^T \, \nabla_{p'}, \end{align} \end{align}\tag{126}\] which, in turn, gives the following explicit transformation property for the spatio-temporal derivative operator under the inverse composed spatio-temporal transformation \[\label{eq-transf-spat-temp-grad-inv} \nabla_{p'} = Q^{-T} \, \nabla_{p}\tag{127}\] with \[\begin{gather} Q^{-T} = \frac{1}{S_x\, \det A} \times \\ \left( \begin{array}{ccc} a_{22} & -a_{21} & 0 \\ -a_{12} & a_{11} & 0 \\ \, - a_{22} \, u_1 + a_{12} \, u_2 \, & \, a_{21} \, u_1 - a_{11} \, u_2 \, & \, S_x \det A / S_t \, \end{array} \right). \end{gather}\] In terms of vector notation, after introducing the purely spatial gradient operators \(\nabla_x = (\partial_{x_1}, \partial_{x_2})^T\) and \(\nabla_{x'} = (\partial_{x_1'}, \partial_{x_2'})^T\), the transformation property (126 ) of the spatio-temporal gradient operators can then be written as \[\begin{align} \begin{aligned} \tag{128} \nabla_{x} = S_x \, A^T \, \nabla_{x'}, \end{aligned}\\ \begin{align} \tag{129} \partial_{t} = S_x \, u^T \, \nabla_{x'} + S_t \, \partial_{t'}, \end{align} \end{align}\] whereas the corresponding inverse relationship (127 ) can be expressed as \[\begin{align} \begin{aligned} \tag{130} \nabla_{x'} = \frac{1}{S_x} \, A^{-T} \, \nabla_{x}, \end{aligned}\\ \begin{align} \tag{131} \partial_{t'} = - \frac{1}{S_x} \, u^T A^{-T} \, \nabla_x + \frac{1}{S_t} \, \partial_t. \end{align} \end{align}\] Based on these relations, expressions for spatio-temporal derivatives can be transformed between the two domains under the composed image transformation, thus extending the transformation property (121 ) of the spatio-temporal receptive fields to beyond the effect of purely spatio-temporal smoothing operation in the spatio-temporal receptive fields also cover the spatio-temporal derivative operators in the composed spatio-temporal receptive field model of the form (12 ), see Figure 13 for a commutative diagram that illustrates this joint covariance property.
Beyond the above, essentially partial-derivative-based spatial and temporal derivative operators, it can often be convenient to also introduce more geometrically defined spatio-temporal derivative operators.
For example, given the vector notation for the derivative operators, the velocity-adapted derivative operators corresponding to (10 ) are with \(v = (v_1, v_2)^T\) and \(v' = (v_1', v_2')^T\) given by \[\label{eq-def-vel-adapt-ders-both-domains-repeated} \partial_{\bar t} = v^T \, \nabla_x + \partial_t \quadand\quad \partial_{\bar t'} = {v'}^T \, \nabla_{x'} + \partial_{t'},\tag{132}\] where the velocity vectors \(v\) and \(v'\) are related according to (125 ). Such velocity-adapted spatio-temporal derivative operators are natural to use, when computing spatio-temporal receptive responses from moving image structures. Specifically, there are velocity-sensitive receptive fields in the primary cortex that can be rather well modelled by such spatio-temporal derivatives; see Figure 18 in Lindeberg ([4]).
By combining the transformation properties of the spatial and temporal derivative operators according to (130 ) and (131 ) with the transformation property (125 ) of the velocity parameters \(v\) and \(v'\) in the receptive fields, we can thus obtain explicit expressions for how such velocity-adapted receptive fields are transformed in a Galilean covariant way, under relative motions between the objects in the world and the observer. Specifically, inserting the expressions (128 ) and (129 ) for the spatial gradient operator \(\nabla_x\) and the regular temporal derivative operator \(\partial_t\) as well as the velocity vector \(v\) obtained by solving for this velocity vector as function of the transformed velocity vector \(v'\) in the transformation property (125 ) of the velocity vector under the composed spatio-temporal transformation defined by (119 ) and (120 ), does after simplification of this expression lead to the following simple relationship \[\label{eq-equal-veladapt-ders-composed-transf-main-result} \partial_{\bar t} = S_t \, \partial_{{\bar t}'}.\tag{133}\] In this way, the velocity-adapted derivatives constitute a geometrically very meaningful way to define spatio-temporal derivative responses on image observations of a dynamic world.
Similarly, the directional derivative operators \(\partial_{\varphi}\) and \(\partial_{\varphi'}\) corresponding to (8 ) are, with the unit vectors \[e_{\varphi} = (\cos \varphi, \sin \varphi)^T \quadand\quad e_{\varphi'} = (\cos \varphi', \sin \varphi')^T,\] given by \[\label{eq-def-dir-ders-composed} \partial_{\varphi} = e_{\varphi}^T \, \nabla_x \quadand\quad \partial_{\varphi'} = e_{\varphi'}^T \, \nabla_{x'}.\tag{134}\] With regard to the composed spatio-temporal image transformation (119 ), inserting the expression (128 ) for the spatial gradient operator \(\nabla_x\) into the definition (134 ), and defining the transformed unit vector \(e_{\varphi'}\) as \[e_{\varphi'} = \frac{S_x \, A \, e_{\varphi}}{\| S_x \, A \, e_{\varphi} \|} = \frac{A \, e_{\varphi}}{\| A \, e_{\varphi} \|} ,\] implies that the directional derivative operator transforms according to \[\partial_{\varphi} = \| S_x \, A \, e_{\varphi} \| \, \partial_{\varphi'}.\] In the special case, when the composed affine transformation matrix \(S_x \, A\) is a pure rotation matrix \(S_x \, A = R_{\theta}\), the eigenvectors of the spatial covariance matrix \(\Sigma\) in the spatio-temporal smoothing kernel do also transform according to a rotation, according to (123 ), implying that the rotational angles \(\varphi\) and \(\varphi'\) will be related according to \[\varphi' = \varphi + \theta,\] which with regard to the unit vectors used for defining the directional derivatives, can in terms of matrix operations be accomplished according to \[e_{\varphi'} = R_{\theta} \, e_{\varphi}.\] In these ways, we can in a rotationally covariant way transform the responses of the spatial components of the spatio-temporal receptive field model (12 ) under transformations within the similarity group over the image domain. For more general affine transformations over the image domain, the corresponding relations are, however, more complex.
By using these transformation properties of spatio-temporal gradient operators, we can thus in a geometric way transform all the spatio-temporal derivative operators in the spatio-temporal receptive field models described in Sections 2.2 and 2.3.
With the introduction of scale-normalized derivatives according to Section 3, the transformation properties of the spatio-temporal receptive fields can be further simplified:
If we require the family of affine transformation matrices \(A\) to be reduced to the group of rotation matrices \(A = R_{\theta}\), such that the composed effect of the spatial scaling factor \(S_x\) and the rotation matrix \(A = R_{\theta}\) spans the variability of the similarity group, then, based on the theoretical results in Section 3.4, the affine scale-normalized directional derivative operators in the direction \(e_{\varphi} = (\cos \varphi, \sin \varphi)\) according to (19 ) \[\partial_{\varphi,\scriptsizenorm}^m = s^{m/2} \, (e_{\varphi}^T \, \Sigma \, e_{\varphi})^{m/2} \, \partial_{\varphi}^m.\] are covariant under the resulting similarity group extended with the group of Galilean transformations and the group of temporal scaling transformations, such that \[\begin{gather} \label{eq-sc-norm-dir-der-joint-cov-prop-sc-norm-ders} \partial_{\varphi',\scriptsizenorm}^m L'(x', t';\; s', \Sigma', \tau', v') = \\ = \partial_{\varphi,\scriptsizenorm}^m L(x, t;\; s, \Sigma, \tau, v), \end{gather}\tag{135}\] provided that the scale parameters \(s\) and \(s'\) are matched with the effect of the scaling transformation, the orientation angles \(\varphi\) and \(\varphi'\) are matched with the effect of the rotation matrix \(R_{\theta}\), and provided that the other parameters of the receptive fields are matched according to (122 )–(125 ) such that \[\begin{align} \begin{aligned} s' & = S_x^2 \, s, \end{aligned}\\ \begin{align} \varphi' & = \varphi + \theta, \end{align}\\ \begin{align} \Sigma' & = R_{\theta} \, \Sigma \, R_{\theta}^T, \end{align}\\ \begin{aligned} \tau' & = S_t^2 \, \tau, \end{aligned}\\ \begin{align} v' & = \frac{S_x}{S_t} (R_{\theta} \, v + u). \end{align} \end{align}\]
If we consider the group of general affine transformation matrices \(A\), and define the scale-normalized affine gradient vector according to (62 ) \[\nabla_{x,\scriptsizeaffnorm} = s^{1/2} \, \Sigma^{1/2} \, \nabla_x.\] the scale-normalized affine Hessian matrix according to (81 ) \[{\cal H}_{x,\scriptsizeaffnorm} = s \, (\Sigma^{1/2}) \, {\cal H}_x \, (\Sigma^{1/2})^T,\] then, based on the results in Section 3.6 and Section 3.8, these scale-normalized affine derivative-based entities will be equal up to rotation matrices \(\tilde{\rho}\) according to \[\begin{gather} \label{eq-cov-prop-sc-norm-aff-grad-summ-overview} (\nabla_{x',\scriptsizeaffnorm} L')(x', t';\; s', \Sigma', \tau', v') = \\ = \tilde{\rho} \, (\nabla_{x,\scriptsizeaffnorm} L)(x, t;\; s, \Sigma, \tau, v) \end{gather}\tag{136}\] and \[\begin{gather} \label{eq-cov-prop-sc-norm-aff-hess-summ-overview} ({\cal H}_{x',\scriptsizeaffnorm} L')(x', t';\; s', \Sigma', \tau', v') = \\ = \tilde{\rho} \, ({\cal H}_{x,\scriptsizeaffnorm} L)(x, t;\; s, \Sigma, \tau, v) \, \tilde{\rho}^T, \end{gather}\tag{137}\] provided that the scale parameters \(s\) and \(s'\) as well as the spatial covariance matrices \(\Sigma\) and \(\Sigma'\) are matched according to (80 ) \[s' \, \Sigma' = s \, (S_x \, A) \, \Sigma \, (S_x A)^T = s \, S_x^2 \, A \, \Sigma \, A^T,\] and provided that the other parameters of the receptive fields are matched according to (124 )–(125 ) \[\begin{align} \begin{aligned} \tau' & = S_t^2 \, \tau, \end{aligned}\\ \begin{align} v' & = \frac{S_x}{S_t} (A \, v + u). \end{align} \end{align}\]
Irrespective of any restrictions on the family of affine transformation matrices \(A\), the velocity-adapted temporal derivative operators according to (132 ) \[\partial_{\bar t} = v^T \, \nabla_x + \partial_t \quadand\quad \partial_{\bar t'} = {v'}^T \, \nabla_{x'} + \partial_{t'},\] extended to scale-normalized velocity-adapted temporal derivatives according to (91 ) \[\label{eq-def-vel-adapt-ders-both-domains-sc-norm} \partial_{\bar t,\scriptsizenorm}^n = \tau^{n/2} \, \partial_{\bar t}^n \quadand\quad \partial_{\bar t',\scriptsizenorm}^n = {\tau'}^{n/2} \, \partial_{\bar t'}^n,\tag{138}\] will, based on the result underlying Equation (133 ), be equal \[\begin{gather} \label{eq-equal-veladapt-ders-composed-transf-main-result-sc-norm} \partial_{{\bar t}',\scriptsizenorm}^n L' (x', t';\; s', \Sigma', \tau', v') = \\ = \partial_{{\bar t},\scriptsizenorm}^n L(x, t;\; s, \Sigma, \tau, v), \end{gather}\tag{139}\] provided that the parameters \(s\), \(s'\), \(\Sigma\), \(\Sigma'\), \(\tau\), \(\tau'\), \(v\) and \(v'\) of the receptive fields are matched according to Equations (122 )–(125 ).
In these ways10, the derived joint covariance properties for the spatial-temporal derivatives assume much simpler forms, when expressed in terms of scale-normalized derivatives, by being essentially equal, up to a possibly unknown rotation transformation; see Figure 14 for an illustration in terms of a commutative diagram.
In this context, it should furthermore be specifically noted that the covariance properties of the spatial derivative operators in (135 ), (136 ) and (137 ) can also be combined with the transformation property of the velocity-adapted temporal derivative operator in (139 ), to also formulate the transformation properties for the corresponding combined spatio-temporal derivative operators, of the forms \[\begin{align} \begin{aligned} \partial_{\varphi,\scriptsizenorm}^m \, \partial_{{\bar t},\scriptsizenorm}^n, \end{aligned}\\ \begin{align} \nabla_{x,\scriptsizeaffnorm} \, \partial_{{\bar t},\scriptsizenorm}^n, \end{align}\\ \begin{align} {\cal H}_{x,\scriptsizeaffnorm} \, \partial_{{\bar t},\scriptsizenorm}^n. \end{align} \end{align}\] Thus, we can, based on the presented framework, express the covariance properties for a very rich family of geometric scale-normalized spatio-temporal derivative operators.

Figure 15: Illustration of the underlying geometric situtation for the locally linearized transformations from a local, possibly moving, surface patch to an arbitrary view indexed by \(k\), with the fixation point \(F\) on the surface mapped to the origin \(O^{(k)} = 0\) in the image plane for the observer with the optic center \(P^{(k)}\). Then, any point in the tangent plane to the surface at the fixation point, as parameterized by the local coordinates \(\xi\) in a coordinate frame attached to the tangent plane of the surface with \(\xi = 0\) at the fixation point \(F\), is by the local linearization mapped to the image point \(x^{(k)}\). (Note, however, that this 3-D illustration is only intended to be schematic and not a fully quantitatively accurate representation, since the projection relations from the tangent plane to the surface have here been drawn according to a perspective projection model, whereas the algebraic model that we then will use for relating receptive field responses between the respective image domains are based on local linearizations of the underlying non-linear geometric transformations. This could in principle be accomplished by having different notation for the locally linearized projections vs. the true geometric projections. Here, we do, however, defer from making that distinction in the figure, in order to not overload the presentation with additional notation.).
In this section, we will interpret the above formulated spatio-temporal covariance and transformation properties geometrically, which will extend previous treatments of multi-view geometry (see Hartley and Zisserman ([59]) and Faugeras ([60])) from (i) multi-view observations of static scenes for mainly point and line configurations to (ii) multi-view observations of dynamic scenes in terms of spatio-temporal receptive field responses (with the extensions of this treatment to be performed in Section 7), where we have relative motions between the objects and spatio-temporal events in the world and the observer.
With regard to a visual observer that observes 3-D objects in a dynamic world, a geometric interpretation of the composed spatio-temporal image transformation according to Equations (107 ) and (108 ) can be obtained as follows:
Consider a camera, alternatively an eye, that views a local surface patch from different positions (optic centers) \(P^{(k)} = (P_1^{(k)}, P_2^{(k)}, P_3^{(k)})^T \in {\mathbb{R}}^3\) in the 3-D world, relative to a global 3-D coordinate system. For simplicity, with regard to the following analysis that is to be performed, we will assume that the fixation point is on the same point physical point \(F^{(k)} = (F_1^{(k)}, F_2^{(k)}, F_3^{(k)})^T \in {\mathbb{R}}^3\) on the surface patch for each one of the observers, however, with the 3-D coordinates for the fixation point now expressed relative to an individual coordinate system for each observer (with index \(k\)), with the origin of the individual 3-D coordinate system being at the optic center \(P^{(k)}\) of that observer.
For simplicity, we also assume that the image coordinates for each observer are chosen such that the spatial image coordinate for the fixation point being \(x = (x_1, x_2)^T = (0, 0)^T\) at the time moment \(t = 0\), when the spatio-temporal receptive field response to the studied is computed.
Given the above multi-view viewing model, to first-order of approximation, by approximating the non-linear perspective transformation for each observer by its first-order derivative, the transformation from a coordinate frame with local coordinates \(\xi = (\xi_1, \xi_2)^T\) in the tangent plane of the surface patch, with the fixation point corresponding to \(\xi = (\xi_1, \xi_2)^T = (0, 0)^T\), to the image coordinates \(x^{(k)} = (x^{(k)}_1, x^{(k)}_2)^T\) in the image plane can be written on the form \[\label{eq-aff-transf-obs-model} x^{(k)} = A^{(k)} \, \xi,\tag{140}\] where \(A^{(k)}\) represents a \(2 \times 2\) affine transformation matrix connected to the viewing position \(P^{(k)}\), and we can specifically choose a preferred reference view \(P^{(0)}\) perpendicular to the tangent plane of the surface. We can also decide to choose that preferred reference observation point, such that it corresponds to orthonormal projection, with \(A^{(0)} = I\), where \(I\) is the \(2 \times 2\) unit matrix.
By introducing an additional explicit parameterization for the observation points \(P^{(k)}\), that are at different distances from the fixation point on the local surface patch, and thus lead to different spatial scaling factors in the underlying perspective transformation, that is locally approximated by a local affine transformation, we can extend the model (140 ) to a model of the form \[\label{eq-sc-aff-transf-obs-model} x^{(k)} = S_x^{(k)} \, A^{(k)} \, \xi,\tag{141}\] where \(S_x^{(k)}\) represents the additional scaling factor that arises by changing the viewing distance relative to the observation point \(P^{(0)}\) used as the main reference. For the scaled orthographic projection model, the spatial scaling factor \(S_x^{(k)}\) will thus correspond to the inverse depth, such that \(S_x^{(k)} = 1/Z^{(k)}\), with the depth \(Z^{(k)}\) for each observer measured relative to its observation point \(P^{(k)}\).
If in addition, the local surface patch moves over time, with a 3-D motion vector \(U^{(k)} = (U_1^{(k)}, U_2^{(k)}, U_3^{(k)})^T\) relative to the observation point \(P^{(k)}\), that is then mapped to the 2-D motion vector \(u^{(k)} = (u_1^{(k)}, u_2^{(k)})^T\) under the same orthonormal projection model, as underlying the definition of the affine transformation matrix \(A^{(k)}\) above, then the motions of the spatio-temporal surface patterns projected to the image plane can, with these scaled orthographic projection models, to first-order of approximation, in the view labelled by the index \(k\), be modelled as a motion over over time of the form (see Figure 15 for an illustration) \[\label{eq-sc-aff-vel-transf-obs-model} x^{(k)} = S_x^{(k)} (A^{(k)} \xi + u^{(k)} \, t),\tag{142}\] where
\(x^{(k)} \in {\mathbb{R}}^2\) is the locally linearized projection of the physical point on the surface pattern, in the view from the observer with index \(k\), at time \(t\),
\(\xi \in {\mathbb{R}}^2\) constitute time-independent local coordinates in the tangent plane of the local surface patch,
\(S_x^{(k)} \in {\mathbb{R}}_+\) is a spatial scaling factor for the observer with index \(k\),
\(A^{(k)}\) is a non-singular \(2 \times 2\) affine projection matrix, that represents an orthographic projection from the tangent plane of the surface to a plane parallel with the image plane, for the observer with index \(k\),
\(u^{(k)} \in {\mathbb{R}}^2\) is a 2-D motion vector, that represents an orthographic projection of the 3-D motion vector \(U^{(k)}\) of the physical fixation point on the surface, to a plane parallel with the image plane, for the observer with index \(k\).
The role of the temporal scaling transformation according to Equation (108 ) in this context \[\label{eq-t-transf-geom} t' = S_t \, t,\tag{143}\] is to additionally account for making it possible to relate spatio-temporal events that occur either faster or slower in relation to the spatio-temporal variations relative to a reference view, for multiple observations at different time moments of otherwise qualitatively similar types of motion patterns or spatio-temporal events.

Figure 16: Illustration of the underlying geometric situtation for the locally linearized transformations between pairwise views of the same, possibly moving, local surface patch, with the view indexed by \(\tilde{k}\) constituting the reference view and the view indexed by \(k\) constituting an arbitrary view. Here, the fixation point \(F\) on the surface is mapped to the origin \(O^{(\tilde{k})} = 0\) in the reference view by the observer with the optic center \(P^{(\tilde{k})}\), while the fixation point \(F\) is mapped to the origin \(O^{(k)} = 0\) in the other view by the observer with the optic center \(P^{(k)}\). Then, in turn any other point in the tangent plane to the surface at the fixation point, as parameterized by the local coordinates \(\xi\) in a coordinate frame attached to the tangent plane of the surface with \(\xi = 0\) at \(F\), is by the local linearization mapped to the image point \(x^{(\tilde{k})}\) in the reference view indexed by \(\tilde{k}\) and and by a corresponding other local linearization mapped mapped to the point \(x^{(k)}\) in the other view indexed by \(k\). (Note, however, that this 3-D illustration is only intended to be schematic and not a fully quantitatively accurate representation, since the projection relations from the tangent plane to the surface have here been drawn according to a perspective projection model, whereas the algebraic model that we then will use for relating receptive field responses between the respective image domains are based on local linearizations of the underlying non-linear geometric transformations. This could in principle be accomplished by having different notation for the locally linearized projections vs. the true geometric projections. Here, we do, however, defer from making that distinction in the figure, in order to not overload the presentation with additional notation.).
While we above, for simplicity, decided to chose a normal view to the tangent plane as the reference view, such an assumption is, however, not in any way necessary, to be able to apply this joint spatio-temporal covariance model for relating spatio-temporal receptive field responses under composed spatio-temporal image transformations.
If we instead decide to choose some other particular observation point \(P^{(\tilde{k})}\), with its associated spatial scaling factor \(S_x^{(\tilde{k})}\), affine transformation matrix \(A^{(\tilde{k})}\) and image velocity \(u^{(\tilde{k})}\) as the reference view, then within the algebra of locally linearized approximations of the underlying projective image transformation model between the pairwise views, we obtain that Equation (142 ) will instead assume the form (see Figure 16 for an illustration) \[\label{eq-sc-aff-vel-transf-alt-obs-model} x^{(k)} = \tilde{B}^{(k)} \, x^{(\tilde{k})} + \tilde{u}^{(k)} \, t,\tag{144}\] where
\(x^{(k)} \in {\mathbb{R}}^2\) is the locally linearized projection of the physical point on the surface pattern in the view from the observer with index \(k\) at time \(t\),
\(x^{(\tilde{k})} \in {\mathbb{R}}^2\) is the locally linearized projection of the physical point on the surface pattern in the view from the observer with index \(\tilde{k}\) at time \(t\),
\(\tilde{B}^{(k)}\) is a non-singular \(2 \times 2\) affine projection matrix for the observer with index \(k\) in relation to an observation from a reference view with index \(\tilde{k}\), and
\(\tilde{u}^{(k)} \in {\mathbb{R}}^2\) is a corresponding 2-D relative motion vector for the observer with index \(k\) in relation to an observation from a reference view with index \(\tilde{k}\).
Here, we have thus, first of all, replaced the previous local coordinates \(\xi\) in the tangent plane of the surface patch by the image coordinates \(x^{(k)}\) for a particular observation frame, to be used as the new reference. Additionally, since the transformations from the new reference frame will no longer correspond to interpretations according a scaled orthographic model complemented by motion, we have removed the degree of freedom for the separation of the linear approximation of the perspective transformation in terms of a separate scaling factor and a separate orthonormal projection, such that in the new frame, it should instead hold that \[\begin{align} \begin{aligned} \tilde{B}^{(\tilde{k})} = I, \end{aligned}\\ \begin{align} \tilde{u}^{(\tilde{k})} = 0. \end{align} \end{align}\] Then, for \(k = \tilde{k}\) the model (144 ) reduces to the mere identity \[x^{(\tilde{k})} = x^{(\tilde{k})}.\] By furthermore inserting Equation (142 ) for \(k = \tilde{k}\) \[x^{(\tilde{k})} S_x^{(\tilde{k})} \, A^{(\tilde{k})} (\xi + u^{(\tilde{k})} \, t).\] into Equation (144 ), we obtain \[x^{(k)} = \tilde{B}^{(k)} (S_x^{(\tilde{k})} \, A^{(\tilde{k})} \xi + u^{(\tilde{k})} \, t) + \tilde{u}^{(k)} \, t,\] which can be expanded to \[x^{(k)} = \tilde{B}^{(k)} \, S_x^{(\tilde{k})} \, A^{(\tilde{k})} \xi + (\tilde{B}^{(k)} \, S_x^{(\tilde{k})} \, A^{(\tilde{k})} \, u^{(\tilde{k})} + \tilde{u}^{(k)}) \, t.\] By identifying this expression with Equation (144 ), we get \[\begin{align} \begin{aligned} S_x^{(k)} \, A^{(k)} = \tilde{B}^{(k)} \, S_x^{(\tilde{k})} \, A^{(\tilde{k})}, \end{aligned}\\ \begin{align} S_x^{(k)} \, A^{(k)} \, u^{(k)} = \tilde{B}^{(k)} \, S_x^{(\tilde{k})} \, A^{(\tilde{k})} \, u^{(\tilde{k})} + \tilde{u}^{(k)}. \end{align} \end{align}\] Thus, we have that the parameters \(\tilde{B}^{(k)}\) and \(\tilde{u}^{(k)}\) of the spatio-temporal transformation model (144 ) relative to the particular reference view for \(k = \tilde{k}\) are related to the parameters \(S_x^{(k)}\), \(A^{(k)}\)and \(u^{(k)}\) for the spatio-temporal transformation model (142 ) relative to the canonical frame in the tangent plane of the surface patch according to \[\tilde{B}^{(k)} = \frac{S_x^{(k)}}{S_x^{(\tilde{k})}} \, A^{(k)} (A^{(\tilde{k})})^{-1}\] and \[\tilde{u}^{(k)} = S_x^{(k)} \, A^{(k)} \left( u^{(k)} - u^{(\tilde{k})} \right).\] In this way, we can derive the transformation parameters for the transformation (144 ) between multiple pairwise views, from the transformation parameters for the monocular transformation (142 ) from the tangent plane of the surface patch to any one of the multiple single views.
Let us next choose some other view for \(k = \bar{k}\) as the reference view, such the spatio-temporal image transformations between the multiple pairwise views are instead of the form \[\label{eq-sc-aff-vel-transf-alt-obs-model-alt} x^{(k)} = \bar{B}^{(k)} \, x^{(\bar{k})} + \bar{u}^{(k)} \, t,\tag{145}\] where
\(x^{(k)} \in {\mathbb{R}}^2\) is the locally linearized projection of the physical point on the surface pattern in the view from the observer with index \(k\) at time \(t\),
\(x^{(\bar{k})} \in {\mathbb{R}}^2\) is the locally linearized projection of the physical point on the surface pattern in the view from the observer with index \(\bar{k}\) at time \(t\),
\(\bar{B}^{(k)}\) is a non-singular \(2 \times 2\) affine projection matrix for the observer with index \(k\) in relation to an observation from a reference view with index \(\bar{k}\) and
\(\bar{u}^{(k)} \in {\mathbb{R}}^2\) is a corresponding 2-D relative motion vector for the observer with index \(k\) in relation to an observation from a reference view with index \(\bar{k}\).
To relate the parameters \(\bar{B}^{(k)}\) and \(\bar{u}^{(k)}\) in this latter transformation model to the parameters \(\tilde{B}^{(k)}\) and \(\tilde{u}^{(k)}\) in the previous transformation model (144 ), let us insert the expression for \(x^{(\bar{k})}\) obtained by setting \(k = \bar{k}\) in (144 ) \[\label{eq-sc-aff-vel-transf-alt-obs-model-for-kbar} x^{(\bar{k})} = \tilde{B}^{(\bar{k})} \, x^{(\tilde{k})} + \tilde{u}^{(\bar{k})} \, t,\tag{146}\] into (145 ), which gives \[\label{eq-sc-aff-vel-transf-alt-obs-model-alt-inserted-xbar} x^{(k)} = \bar{B}^{(k)} (\tilde{B}^{(\bar{k})} \, x^{(\tilde{k})} + \tilde{u}^{(\bar{k})} \, t) + \bar{u}^{(k)} \, t,\tag{147}\] and \[\label{eq-sc-aff-vel-transf-alt-obs-model-alt-inserted-xbar-2} x^{(k)} = \bar{B}^{(k)} \, \tilde{B}^{(\bar{k})} \, x^{(\tilde{k})} + (\bar{B}^{(k)} \tilde{u}^{(\bar{k})} + \bar{u}^{(k)}) \, t.\tag{148}\] Identifying the coefficients for \(x^{(\tilde{k})}\) and \(t\) with the general expression (144 ) for the transformation between the views \(\tilde{k}\) and \(k\), then gives that the transformation parameters \(\bar{B}^{(k)}\) and \(\bar{u}^{(k)}\) for the corresponding transformation model based on the view \(\bar{k}\) have to be related to the parameters \(\tilde{B}^{(k)}\) and \(\tilde{u}^{(k)}\) for the reference view based on \(k = \tilde{k}\) according to \[\begin{align} \begin{aligned} \tilde{B}^{(k)} = \bar{B}^{(k)} \, \tilde{B}^{(\bar{k})}, \end{aligned}\\ \begin{align} \tilde{u}^{(k)} = \bar{B}^{(k)} \, \tilde{u}^{(\bar{k})} + \bar{u}^{(k)}. \end{align} \end{align}\] Let us also insert the the expression for \(x^{(\tilde{k})}\) obtained by setting \(k = \tilde{k}\) in (145 ) \[\label{eq-sc-aff-vel-transf-alt-obs-model-alt-for-ktilde} x^{(\tilde{k})} = \bar{B}^{(\tilde{k})} \, x^{(\bar{k})} + \bar{u}^{(\tilde{k})} \, t,\tag{149}\] into (144 ), which gives \[\label{eq-sc-aff-vel-transf-alt-obs-model-inserted-xtilde} x^{(k)} = \tilde{B}^{(k)} (\bar{B}^{(\tilde{k})} \, x^{(\bar{k})} + \bar{u}^{(\tilde{k})} \, t) + \tilde{u}^{(k)} \, t\tag{150}\] and \[\label{eq-sc-aff-vel-transf-alt-obs-model-inserted-xtilde-2} x^{(k)} = \tilde{B}^{(k)} \, \bar{B}^{(\tilde{k})} \, x^{(\bar{k})} + (\tilde{B}^{(k)} \, \bar{u}^{(\tilde{k})} + \tilde{u}^{(k)}) \, t.\tag{151}\] Identifying the coefficients for \(x^{(\bar{k})}\) and \(t\) with the general expression (145 ) for the transformation between the views \(\bar{k}\) and \(k\), then gives that the transformation parameters \(\tilde{B}^{(k)}\) and \(\tilde{u}^{(k)}\) for the corresponding transformation model based on the view \(\tilde{k}\) have to be related to the parameters \(\bar{B}^{(k)}\) and \(\bar{u}^{(k)}\) for the reference view based on \(k = \bar{k}\) according to \[\begin{align} \begin{aligned} \bar{B}^{(k)} = \tilde{B}^{(k)} \, \bar{B}^{(\tilde{k})}, \end{aligned}\\ \begin{align} \bar{u}^{(k)} = \tilde{B}^{(k)} \, \bar{u}^{(\tilde{k})} + \tilde{u}^{(k)}. \end{align} \end{align}\] Furthermore, by setting the transformation (146 ) between \(x^{(\tilde{k})}\) and \(x^{(\bar{k})}\) into the transformation (149 ) between \(x^{(\bar{k})}\) and \(x^{(\tilde{k})}\), we obtain \[x^{(\tilde{k})} = \bar{B}^{(\tilde{k})} (\tilde{B}^{(\bar{k})} \, x^{(\bar{k})} + \tilde{u}^{(\bar{k})} \, t) + \bar{u}^{(\tilde{k})} \, t\] and \[x^{(\tilde{k})} = \bar{B}^{(\tilde{k})} \, \tilde{B}^{(\bar{k})} \, x^{(\bar{k})} + (\bar{B}^{(\tilde{k})} \, \tilde{u}^{(\bar{k})} + \bar{u}^{(\tilde{k})}) \, t.\] Identifying the coefficients for \(x^{(\bar{k})}\) and \(t\), then gives \[\begin{align} \begin{aligned} \bar{B}^{(\tilde{k})} \, \tilde{B}^{(\bar{k})} = I, \end{aligned}\\ \begin{align} \bar{B}^{(\tilde{k})} \, \tilde{u}^{(\bar{k})} + \bar{u}^{(\tilde{k})} = 0, \end{align} \end{align}\] which can be rewritten into the following specific consistency relations between the parameters in the mutually pairwise views based on either of the reference views \(\tilde{k}\) or \(\bar{k}\) \[\begin{align} \begin{aligned} \bar{B}^{(\tilde{k})} = (\tilde{B}^{(\bar{k})})^{-1}, \end{aligned}\\ \begin{align} \bar{u}^{(\tilde{k})} = - \bar{B}^{(\tilde{k})} \, \tilde{u}^{(\bar{k})}. \end{align} \end{align}\] Due to the linearity of all the components of the first-order approximations of these composed spatio-temporal image transformations, the algebra for modelling the receptive field responses is therefore closed under the considered family of spatio-temporal image transformations.
With regard to receptive field responses, this closedness property between any set of locally linearized pairwise views of the same, possibly moving, surface patch will, in turn, imply that we can model the spatio-temporal responses computed at matching points in space-time between different pairwise views of the same local surface patch, using a joint covariance property under a corresponding class of composed spatio-temporal image transformations, as will be developed more explicitly in the next section.
None
Figure 17: Commutative diagram for the joint spatio-temporal smoothing component (154 ) in the joint spatio-temporal receptive field model under the composition of a spatial affine transformation, a Galilean transformation and a temporal scaling transformation according to (152 ) and (143 ), for relating the spatio-temporal receptive field responses between pairwise views of a local surface patch. This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input video sequence or video stream \(f(x, t)\) is first subject to the composed transformation \(x' = \tilde{B} \, x + \tilde{u} \, t\) and \(t' = S_t \, t\) and then smoothed with a spatio-temporal kernel \(T(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}')\), or instead directly convolved with the spatio-temporal smoothing kernel \(T(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v})\) and then subject to the same joint spatio-temporal transformation, we do then get the same result, provided that the parameters of the spatio-temporal smoothing kernels are related according to \(\tilde{\Sigma}' = \tilde{B} \, \tilde{\Sigma} \, \tilde{B}^{T}\), \(\tilde{\tau}' = S_t^2 \, \tilde{\tau}\) and \(\tilde{v}' = \frac{1}{S_t} (\tilde{B} \, \tilde{v} + \tilde{u})\)..
In this section, we will geometrically derive and analyze the transformation properties of the spatio-temporal receptive field responses that arise if we, instead of (i) choosing a virtual normal view with its associated affine transformation matrix \(A\) in relation to the coordinates \(\xi\) in the tangent plane of the surface as the main reference for expressing the composed geometric transformation, according to the treatments in Sections 5.1 and 6.1, (ii) choosing a particular observation view with its associated affine transformation matrix \(\tilde{B}\) in relation to the actual image coordinates \(x\) as the reference, for expressing the transformations properties between the spatio-temporal receptive field responses computed from different views, according to the treatment in Section 6.2.
In this way, we will obtain more explicit transformation properties between the spatio-temporal receptive field responses computed from different pairwise views of the same surface patch, compared to the previous treatments of joint covariance properties in Sections 5.2–5.5, thus deriving the results of integrating the results from the geometric analysis in Section 6 into the transformation properties of the spatio-temporal receptive fields according to Section 5.
In Section 7.2, we derive such explicit transformation properties for the underlying spatio-temporal smoothing transformation. In Section 7.3, we then derive corresponding explicit transformation properties for regular (not scale-normalized) spatio-temporal derivatives, as well as algebraically much simpler forms of transformation properties in terms of scale-normalized derivatives. The latter results clearly demonstrate the advantage of using scale-normalized derivatives, as described in Section 3, as opposed to regular spatio-temporal derivatives, since the scale-normalized spatio-temporal derivatives become essentially equal (up to a possibly unknown rotation transformation over the elements in the either vector-valued or matrix-valued image features in the case of affine-extended scale-normalized derivatives) under the composed geometric image transformation, provided that the parameters of the spatio-temporal receptive fields can be properly matched in relation to the parameters of the composed geometric image transformation.
By comparing the joint transformation between pairwise views according to (144 ), rewritten to the form \[\label{eq-transf-pairwise-views} x' = \tilde{B} \, x + \tilde{u} \, t,\tag{152}\] with the joint transformation property from the tangent plane of the surface according to (142 ), rewritten to the form (107 ) \[\label{eq-transf-single-view} x' = S_x \, (A \, x + u \, t),\tag{153}\] we can see that these transformations merely correspond to different parameterizations of the same underlying algebraic structure, with the parameters in the two different domains related according to \[\begin{align} \begin{aligned} \tilde{B} = S_x \, A, \end{aligned}\\ \begin{align} \tilde{u} = S_x \, u. \end{align} \end{align}\] Therefore, corresponding joint covariance properties for spatio-temporal receptive fields can be stated for the locally linearized transformations between pairwise views according to (152 ) as for the locally linearized transformation from the tangent plane of the surface to the image plane according to (153 ).
For clarity of presentation, we will in the following describe these joint covariance properties for the spatio-temporal smoothing operation and the spatio-temporal derivatives explicitly. Since this involves removing the degree of freedom corresponding to the parameter \(S_x\) in the treatment in Section 5.2, we will start by also removing the degree of freedom corresponding to the spatial scale parameter \(s\) in the model (1 ) for the purely spatio-temporal smoothing operation of the spatio-temporal receptive fields.
None
Figure 18: Commutative diagram for spatio-temporal derivative operators underlying the joint spatio-temporal receptive field model under the composition of a spatial affine transformation, a Galilean transformation and a temporal scaling transformation according to (152 ) and (143 ), between different pairwise views of the same local surface patch. This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input video sequence or video stream \(f(x, t)\) is first subject to the composed transformation \(x' = \tilde{B} \, x + \tilde{u} \, t\) and \(t' = S_t \, t\) and then filtered with a spatio-temporal derivative kernel \((\nabla_{x'} \partial_{t'} T)(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}')\), or instead directly convolved with the spatio-temporal smoothing kernel \((\nabla_x \partial_t T)(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v})\) and then subject to the same joint spatio-temporal transformation, we do then get the same result, provided that the spatial and the temporal derivative operators are transformed according to \(\nabla_{x'} = \tilde{B}^{-T} \, \nabla_{x}\) and \(\partial_{t'} = - u^T \tilde{B}^{-T} \, \nabla_x + \frac{1}{S_t} \, \partial_t\) and that the parameters of the spatio-temporal smoothing kernels are related according to \(\tilde{\Sigma}' = \tilde{B} \, \tilde{\Sigma} \, \tilde{B}^{T}\), \(\tau' = S_t^2 \, \tau\) and \(\tilde{v}' = \frac{1}{S_t} (\tilde{B} \, \tilde{v} + \tilde{u})\). (In this commutative diagram, we have illustrated the general covariance properties of spatio-temporal derivatives in terms of the composed spatio-temporal derivative \(\nabla_x \partial_t T\). Similar covariance properties can, of course, also be obtained for other combinations of the spatial and the temporal derivative operators \(\nabla_x\) and \(\partial_t\).).
None
Figure 19: Commutative diagram for scale-normalized spatio-temporal derivative operators defined from the joint spatio-temporal receptive field model (12 ) under the composition of a spatial affine transformation, a Galilean transformation and a temporal scaling transformation according to (152 ) and (143 ), between different pairwise views of the same local surface patch. This commutative diagram, which should be read from the lower left corner to the upper right corner, means that irrespective of whether the input video sequence or video stream \(f(x, t)\) is first subject to the composed transformation \(x' = \tilde{B} \, x + \tilde{u} \, t\) and \(t' = S_t \, t\) and then filtered with a scale-normalized spatio-temporal derivative kernel \((\nabla_{x',\tinyaffnorm} \partial_{t',\tinynorm} T)(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}')\), or instead directly convolved with the scale-normalized spatio-temporal smoothing kernel \((\nabla_{x,\tinyaffnorm} \partial_{t,\tinynorm} T)(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v})\) and then subject to the same joint spatio-temporal transformation, we do then, up to a possibly unknown rotation transformation, get the same result, provided that the parameters of the spatio-temporal smoothing kernels are related according to \(\tilde{\Sigma}' = \tilde{B} \, \tilde{\Sigma} \, \tilde{B}^{T}\), \(\tau' = S_t^2 \, \tau\) and \(\tilde{v}' = \frac{1}{S_t} (\tilde{B} \, \tilde{v} + \tilde{u})\). Note, in particular, the conceptual simplification in relation to the corresponding commutative diagram based on regular partial derivatives that have not been subject to scale normalization or velocity adaptation regarding the temporal derivatives, in that the scale-normalized spatio-temporal derivatives in this commutative diagram are essentially equal, up to a possibly unknown rotation transformation. (In this commutative diagram, we have illustrated the general covariance properties of spatio-temporal derivatives for the particular choice of the composed spatio-temporal derivative operator \(\nabla_{x,\scriptsizeaffnorm} \partial_{t,\scriptsizenorm} T\) in the spatio-temporal receptive field model (154 ). Similar covariance properties can, of course, also be obtained for other combinations of the spatial and the temporal derivative operators \(\nabla_{x,\scriptsizeaffnorm}\) and \(\partial_{t,\scriptsizenorm}\) for which corresponding covariance properties hold.).
If we merge the degrees of freedom in the spatial scale parameter \(s \in {\mathbb{R}}_+\) and the spatial covariance matrix \(\Sigma \in {\mathbb{S}}_+^2\) in the purely spatio-temporal smoothing component of the receptive fields according to (1 ) into the joint parameter \[\tilde{\Sigma} = s^2 \Sigma,\] then we can express the purely spatio-temporal smoothing component of the receptive fields according to \[\label{eq-spat-temp-RF-model-mod} T(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v}) = \tilde{g}(x - \tilde{v} \, t;\; \tilde{\Sigma}) \, h(t;\; \tilde{\tau}),\tag{154}\] where we also redefine the 2-D affine Gaussian kernel (2 ) according to \[\label{eq-gauss-fcn-2D-mod} \tilde{g}(x;\; \tilde{\Sigma}) = \frac{1}{2 \pi \sqrt{\det \tilde{\Sigma}}} \, e^{-x^T \tilde{\Sigma}^{-1} x/2}.\tag{155}\] If we correspondingly to (109 ) consider two video sequences or video streams \(f'(x', t')\) and \(f(x, t)\), that are related according to (152 ) and (143 ) such that \[f'(x', t') = f(x, t),\] and correspondingly to (110 ) and (111 ) define spatio-temporal scale-space representations of these video sequences or video streams according to \[\begin{align} \begin{aligned} & L(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v}) = \end{aligned}\nonumber\\ \begin{align} \tag{156} & = \int_{\xi \in {\mathbb{R}}^2} \int_{\eta \in {\mathbb{R}}} T(\xi, \eta;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v}) \, f(x - \xi, t - \eta) \, d\xi \, d\eta, \end{align}\\ \begin{align} & L'(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}') = \\ \end{align}\nonumber\\ \begin{aligned} & = \int_{\xi' \in {\mathbb{R}}^2} \int_{\eta' \in {\mathbb{R}}} T(\xi', \eta';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}') \times \end{aligned}\nonumber\\ \begin{align} \tag{157} & \phantom{= \int_{\xi' \in {\mathbb{R}}^2} \int_{\eta' \in {\mathbb{R}}}} \quad f'(x' - \xi', t' - \eta') \, d\xi' \, d\eta', \end{align} \end{align}\] it then follows, from similar calculations as lead to the transformation properties (121 )–(125 ), that the spatio-temporal scale-space representations \(L(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v})\) and \(L'(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}')\) of the video sequences or video streams \(f(x, t)\) and \(f'(x', t')\) are related according to \[\label{eq-joint-cov-prop-result-of-proof-mod} L'(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}') = L(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v}),\tag{158}\] provided that the parameters of the receptive fields transform according to \[\begin{align} \begin{aligned} \tag{159} \tilde{\Sigma}' & = \tilde{B} \, \tilde{\Sigma} \, \tilde{B}^{T}, \end{aligned}\\ \begin{align} \tag{160} \tilde{\tau}' & = S_t^2 \, \tilde{\tau}, \end{align}\\ \begin{align} \tag{161} \tilde{v}' & = \frac{1}{S_t} (\tilde{B} \, \tilde{v} + \tilde{u}). \end{align} \end{align}\] This follows from similar calculations as done in Section 5.2, by replacing the previous affine transformation matrix \(A\) by the new affine transformation matrix \(B\), while simultaneously replacing the spatial scaling factor \(S_x\) by \(1\); see Figure 17 for an illustration in terms of a commutative diagram.
By similarly replacing the affine transformation matrix \(A\) by the affine transformation matrix \(\tilde{B}\), while simultaneously replacing the spatial scaling factor \(S_x\) by \(1\) in the transformation properties (128 )–(129 ) and (130 )–(131 ) of the spatial and temporal derivative operators under the composed spatio-temporal transformation defined by (107 ) and (108 ), we obtain that the spatial and the temporal derivative operators in the two domains under the composed spatio-temporal transformation defined by (152 ) and (143 ) are related according to \[\begin{align} \begin{aligned} \tag{162} \nabla_{x} = \tilde{B}^T \, \nabla_{x'}, \end{aligned}\\ \begin{align} \tag{163} \partial_{t} = u^T \, \nabla_{x'} + S_t \, \partial_{t'}, \end{align} \end{align}\] and \[\begin{align} \begin{aligned} \tag{164} \nabla_{x'} = \tilde{B}^{-T} \, \nabla_{x}, \end{aligned}\\ \begin{align} \tag{165} \partial_{t'} = - \, u^T \tilde{B}^{-T} \, \nabla_x + \frac{1}{S_t} \, \partial_t, \end{align} \end{align}\] see Figure 18 for a commutative diagram that illustrates these joint covariance properties.
In analogy with the previous treatment of the transformation properties of scale-normalized derivatives in Section 5.5, also these transformation properties will be simplified, if instead expressing them in terms of scale-normalized derivatives, and also if replacing the partial temporal derivative operators by velocity-adapted temporal derivatives:
If we consider the group of general11 affine transformation matrices \(\tilde{B}\), and define the scale-normalized affine gradient vector and the scale-normalized affine Hessian matrix according to (62 ) and (81 ), with the spatial scale parameter set to \(s = 1\) and the spatial covariance matrix \(\Sigma\) replaced by \(\tilde{\Sigma}\), then, based on the results in Section 3.6 and Section 3.8, these scale-normalized affine derivative-based entities will be equal up to rotation matrices \(\tilde{\rho}\) according to \[\begin{gather} (\nabla_{x',\scriptsizeaffnorm} L')(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}') = \\ = \tilde{\rho} \, (\nabla_{x,\scriptsizeaffnorm} L)(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v}) \end{gather}\] and \[\begin{gather} ({\cal H}_{x',\scriptsizeaffnorm} L')(x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}') = \\ = \tilde{\rho} \, ({\cal H}_{x,\scriptsizeaffnorm} L)(x, t;\; s, \tilde{\Sigma}, \tilde{\tau}, \tilde{v}) \, \tilde{\rho}^T, \end{gather}\] provided that the parameters \(\tilde{\Sigma}\), \(\tilde{\Sigma}'\), \(\tilde{\tau}\), \(\tilde{\tau}'\), \(\tilde{v}\) and \(\tilde{v}'\) of the receptive fields are matched according to Equations (159 )–(161 ).
The velocity-adapted spatio-temporal derivative operators according to (132 ), with \(v\) replaced by \(\tilde{v}\), and extended to scale-normalized derivatives with \(\tau\) replaced by \(\tilde{\tau}\), will, based on the result underlying Equation (133 ), be equal \[\begin{gather} \label{eq-equal-veladapt-ders-composed-transf-main-result-sc-norm-again} \partial_{{\bar t}',\scriptsizenorm}L' (x', t';\; \tilde{\Sigma}', \tilde{\tau}', \tilde{v}') = \\ = \partial_{{\bar t},\scriptsizenorm}L(x, t;\; \tilde{\Sigma}, \tilde{\tau}, \tilde{v}), \end{gather}\tag{166}\] provided that the parameters \(\tilde{\Sigma}\), \(\tilde{\Sigma}'\), \(\tilde{\tau}\), \(\tilde{\tau}'\), \(\tilde{v}\) and \(\tilde{v}'\) of the receptive fields are matched according to Equations (159 )–(161 ).
Figure 19 illustrates the combined effects of these covariance properties in a joint commutative diagram.
In this way, if we consider a vision system, either biological or based on computer vision operations, that records spatial and spatio-temporal image structures observed by viewing local surface patches, in either a static or dynamic world, in terms of receptive field responses, then the above geometric analysis in combination with the the previously derived joint transformation properties according to Equations (121 )–(125 ) of the underlying spatial or spatio-temporal smoothing operations in the either spatial or spatio-temporal receptive fields, together with the corresponding explicit transformation properties of the spatial and temporal derivative operators according to Equations (128 )–(131 ) do therefore, beyond a trivial usually unknown spatial translation between the origins of the coordinate systems between the different image domains, fully describe how the spatial and spatio-temporal receptive field responses can be related or matched, when viewing either the same physical scene from multiple views.
When complemented by temporal scaling transformations, this matching property does furthermore extend to relating or matching the receptive field responses between different views of similarly looking motion patterns or spatio-temporal events that may occur either faster or slower between different instances of the same event.
In this context it should be remarked, however, that due to the modelling of the spatial or spatio-temporal image transformations in terms of local linearizations only, the matches between the receptive field responses obtained according to the joint covariance property will not be fully perfect, in situations when the spatial or spatio-temporal support regions of the receptive field cover larger regions in image space or space-time than cannot be compactly modelled by local linearizations. Compared to not attempting to compensate for the effect of the spatial or spatio-temporal image transformations on the receptive field responses, the positive effects of incorporating covariance properties of the receptive field responses with respect to local linearizations of the underlying non-linear perspective or projective image transformations should, however, be expected to lead to substantial improvements. Handling the locally linear approximations of the underlying non-linear perspective or projective image transformations can in this context also be expected to be be conceptually much simpler, than aiming at compensating for more complex non-linear image deformation models.
With regard to observations of more complex scenes, containing multiple local image structures, based on different characteristics in terms of e.g. local surface geometry, it should be noted that linearized transformations of receptive field responses could also be computed regionally, over larger regions of image space than could be well modelled by a single locally linearized image transformation. Then, if regional statistics of receptive field responses are to be computed, for e.g. purposes with regard to spatial or spatio-temporal recognition, then an overall compensation of the receptive field responses with respect to gross geometric and motion effects of the entire region could also be performed, thus with the parameters of the spatio-temporal image transformation not determined by the local spatio-temporal geometry and motion, but by instead determined by a coarser-scale regional geometry and motion.
With regard to the axiomatically12 derived model for spatio-temporal receptive fields (1 ), that we build the analysis in this treatment on, the geometric analysis that we have presented in Section 6 shows that the degrees of freedom in this spatio-temporal receptive field model (the parameters \(s\), \(\Sigma\), \(\tau\) and \(v\) in (1 )) span the degrees of freedom in the locally linearized scaled orthographic model, complemented with a Galilean motion to account for relative motions between objects in the world and the observer, as well as a temporal scaling transformation to account for spatio-temporal events that may occur either faster or slower relative to a reference view (the parameters \(S_x\), \(A\), \(S_t\) and \(u\) in (142 ) and (143 )).
The degrees of freedom in the slightly modified spatio-temporal receptive field model (154 ) (the parameters \(\tilde{\Sigma}\), \(\tilde{\tau}\) and \(\tilde{v}\)) do also span the degrees of freedom in the locally linearized projective projection model between pairwise views according to (152 ) and (143 ) (the parameters \(\tilde{B}\), \(S_t\) and \(\tilde{u}\)).
In this respect, these spatio-temporal receptive field models make it possible to perfectly capture the first-order linearized approximations of the variabilities generated by observing the surfaces of smooth objects in the world, that move in relation to the observer in a dynamic 3-D environment. This does specifically imply that with regard to modelling the first-order linearizations of receptive field responses under the perspective or projective transformations in either single-view or multi-view observations of 3-D scenes, we can isomorphically perform these operations as joint spatio-temporal image transformations in image space only. Thus, the algebra of the interaction between the receptive fields and the first-order linearized geometric transformations constitute a sufficient13 algebra to handle either single-view or multi-view observations of smooth surface patches in a dynamic world.
In this way, it is not really necessary to make use of explicit models of 3-D scene geometry or 3-D object motion, when to operate on the spatio-temporal image data that originate from different views. Instead, it is sufficient to just make use of the composed spatio-temporal image transformations between the multiple views of the same scene, which in that way constitute a minimal type of model of the world with respect to the image-based observer’s view.
As we previously described in Section 2.3, comparisons with biological receptive fields obtained by neurophysiological recordings of neurons in the primary visual cortex (V1), have shown that the receptive fields of simple cells can be qualitatively rather well modelled by idealized receptive fields of the form, see Lindeberg ([4]) Section 4 for explicit comparisons between biological receptive fields and these idealized receptive fields of the form \[\begin{gather} \label{eq-spat-temp-RF-model-der-again-again} T_{{\varphi}^{m} {\bar t}^n}(x, t;\; s, \Sigma, \tau, v) = \\ = \partial_{\varphi}^{m} \, \partial_{\bar t}^n \left( g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau) \right). \end{gather}\tag{167}\] By extending this definition with the affine scale-normalized directional derivative operator \(\partial_{\varphi,\scriptsizenorm}^{m}\) according to (14 ), again in one of the eigendirections \(\varphi\) of the spatial covariance matrix \(\Sigma\), as well as complementing with scale-normalized velocity-adapted temporal derivatives \(\partial_{{\bar t},\scriptsizenorm}^n\) in the direction \(v\) according to (88 ), we can thus also express a corresponding scale-normalized model of the spatio-temporal receptive fields according to (100 ) as \[\begin{gather} \label{eq-spat-temp-RF-model-der-norm} T_{{\varphi}^{m} {\bar t}^n,\scriptsizenorm}(x, t;\; s, \Sigma, \tau, v) = \\ = \partial_{\varphi,\scriptsizenorm}^{m} \, \partial_{{\bar t},\scriptsizenorm}^n \left( g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau) \right), \end{gather}\tag{168}\] which then extends the applicability of the previous model (100 ) to provable covariance properties under compositions of spatial similarity transformations
If we additionally, would extend the interpretation of those modelling results, corresponding to spatial derivatives of orders 1 and 2, to replacing the interpretation of the spatial derivative operators as components of the scale-normalized affine gradient vector according to (62 ) or as components of the scale-normalized affine Hessian matrix according to (81 ) \[\begin{align} \begin{aligned} & T_{\nabla_x {\bar t}^n,\scriptsizenorm}(x, t;\; s, \Sigma, \tau, v) = \\ & \quad\quad = \nabla_{x,\scriptsizeaffnorm} \, \partial_{{\bar t},\scriptsizenorm}^n \left( g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau) \right), \end{aligned}\\ \begin{align} & T_{{\cal H}_x {\bar t}^n,\scriptsizenorm}(x, t;\; s, \Sigma, \tau, v) = \\ & \quad\quad = {\cal H}_{x,\scriptsizeaffnorm} \, \partial_{{\bar t},\scriptsizenorm}^n \left( g(x - v \, t;\; s, \Sigma) \, h(t;\; \tau) \right), \end{align} \end{align}\] then such a model would additionally allow for provable covariance properties under arbitrary combinations of spatial affine transformations and Galilean transformations, with clear biological relevance for a biological visual agent, to be able to handle the variability of image structures under natural image transformations.
Furthermore, considering that the receptive fields of simple cells in the primary visual cortex can be qualitatively very well modelled by such spatio-temporal receptive fields, these results can be taken as further support for the working hypothesis that the receptive fields in the primary visual cortex may be regarded as being very well adapted to the structure of our environment, as also previously proposed in connection with the formulation of the normative theory of visual receptive fields that underlies the definition of the idealized spatio-temporal receptive field model that we have used as a basis for this theoretical treatment, see Lindeberg ([4]) Section 6, for a condensed summary of such conceptual theoretical arguments and Lindeberg ([1], [61]) for formulations of more explicit hypotheses regarding possible affine covariance and Galilean covariance for the receptive fields in biological vision.
With respect to the inference of cues to the 3-D structure of the world, it should furthermore be noted that:
knowledge about the affine transformation \(A^{(k)}\), in the locally linearized perspective projection model (142 ), provides direct cues to the local surface orientation of the surface patch, according to the theoretical analysis in Gårding and Lindeberg ([62]) Section 5.2,
provided that the affine transformation \(A^{(k)}\) in the locally linearized perspective projection model (142 ) is normalized such that it constitutes a pure orthographic projection, then knowledge about the spatial scaling factor \(S_x^{(k)}\) provides direct cues to the depth \(Z^{(k)} = 1/S_x^{(k)}\),
knowledge about the affine transformation matrix \(\tilde{B}^{(k)}\), in the locally linearized projective transformation (144 ) between pairwise views, provides direct cues to the local surface orientation, according to the theoretical analysis in Gårding and Lindeberg ([62]) Section 6.1.
In these ways, the parameters of the joint spatio-temporal transformation models are therefore directly related to the 3-D structure of the scene, provided that appropriate matching of the positions in image space for the receptive field responses can be obtained, to compute the image deformation parameters.
Of particular importance in this context is to really adapt the shapes of the receptive fields according to the covariance property of the actual image deformation. In Lindeberg and Gårding ([18]), it was specifically shown that such shape-adaptation of the receptive fields can improve the accuracy of surface orientation estimates by typically an order of magnitude, compared to not adapting the shapes of the receptive fields to the actual image deformation, see Tables 1–4 in Lindeberg and Gårding ([18]).
For more extensive treatments of the topic of deriving cues to 3-D scene structure by combination of information from multiple views, see the monographs by Hartley and Zisserman ([59]) and Faugeras ([60]) and the references therein.
We have presented an in-depth unified theory for covariance properties and transformation properties of the spatio-temporal receptive fields according to the generalized Gaussian model for spatio-temporal receptive fields, which extends the previous work on this topic to both joint compositions of multiple types of geometric image transformation, as well as to the basic types of spatio-temporal differentiation operators, that occur in the models of the spatio-temporal receptive fields, including the extension of the covariance and transformation properties to algebraically much simpler forms in terms of scale-normalized derivatives.
After first in Section 2 giving an overview of the spatial-temporal receptive model, that we base this work on, as well as its biological relevance, we have in Section 3 described a general theoretical foundation for obtaining provable covariance properties for spatial and temporal scale derivatives at multiple spatial and temporal scales, by formulating scale-normalized spatial or temporal derivative operators over lower-dimensional spatial or temporal domains.
Specifically, we have in Sections 3.3–3.8 both formulated and analyzed a set of new notions of affine scale-normalized directional derivative operators as well as scale-normalized affine gradient and affine Hessian operators, to be applied to affine Gaussian scale-space representations, obtained by convolution with anisotropic affine Gaussian kernels, and shown that these concepts leads to provable covariance properties, for the notion affine scale-normalized directional derivatives with respect to two important subgroups of the group of general spatial affine transformations, while for the notions of scale-normalized affine gradients and for the notion of the scale-normalized affine Hessian matrix, the covariance properties hold over the full group of non-singular spatial affine transformations.
Then, we have in Section 4 described extensions of such transformation properties and covariance properties to higher-dimensional joint spatio-temporal receptive field models, for the four classes of single image transformations, in terms of either (i) a pure spatial scaling transformation, (ii) a pure spatial affine transformation, (iii) a pure temporal scaling transformation, or (iv) a pure Galilean transformation.
To handle more general geometric configurations, where variabilities due to different types of image transformations may occur together, we have then in Section 5 derived a set of joint covariance properties for the composition of a spatial scaling transformation, a spatial affine transformation, a Galilean transformation and a temporal scaling transformation, with explicit expressions for how the receptive field parameters should be transformed under the composed image transformation, to make it possible to perfectly match the receptive field responses under convolutions with spatio-temporal receptive fields according to the generalized Gaussian derivative model. This analysis has been performed with regard to both the underlying joint spatio-temporal smoothing transformation and with regard to the both regular and the scale-normalized spatio-temporal derivative operators, that are applied to the output of the pure smoothing transformation, to produce the receptive field responses for different combinations of spatio-temporal derivative operators.
Specifically, we have shown that when using the notion of scale-normalized spatio-temporal derivative operators, the resulting spatio-temporal derivative responses become essentially equal, up to a possibly unknown rotation transformation for the case of affine-extended scale-normalized derivatives, under the composed spatio-temporal image transformation, provided that the parameters of the spatio-temporal receptive fields can be properly matched to the actual form of the geometric image transformation.
To interpret the class of studied joint covariance properties geometrically, we have then in Section 6 performed a geometric analysis of locally linearized projections from the 3+1-D spatio-temporal world to 2+1-D spatio-temporal image domains, to interpret the studied class of composed spatio-temporal image transformations as locally linearized scaled orthographic projections of a local surface patch, from the tangent plane of the surface at the fixation point to the image planes for different viewers, and also complemented with Galilean motions to represent the possibly a priori unknown relative motions between the observed object and the observers, as well as complemented with temporal scaling transformations, to represent spatio-temporal motion patterns and events that may occur either faster or slower relative to previous observations of similarly looking motion patterns or spatio-temporal events.
In this context, we have also shown how a slight reformulation of that model can be used for modelling the locally linearized projective transformations between pairwise views of the same surface patch, including an explicit derivation of how the corresponding algebra of locally linearized spatio-temporal transformations will then be closed between different reference views, with accompanying explicit transformation properties for the parameters of those locally linearized projection models, when the reference view is changed between different visual observations.
For the modified composed spatio-temporal transformation model between pairwise views of the same local surface patch, we have also in Section 7 presented explicit expressions for the corresponding joint spatio-temporal covariance properties, regarding both the underlying spatio-temporal smoothing transformation as well as its associated spatio-temporal derivative operators that form the receptive fields.
With regard to biological interpretations of these results, we have then in Section 8 described how the degrees of freedom spanned by the free parameters in the spatio-temporal receptive field model span the same degrees of freedom as spanned by the free parameters in the locally linearized scaled orthographic projection model complemented by a local Galilean motion and a temporal scaling transformation, to account for motion patterns and spatio-temporal events that may occur either faster or slower relative to a previous observation of a similarly looking motion pattern or event.
In view of previously obtained biological modelling results, that the receptive fields of simple cells in the primary visual cortex can be qualitatively rather well modelled by idealized receptive fields according to the theoretical model of visual receptive fields used in this treatment, we have in this way obtained complementary support for a previously formulated working hypothesis that the shapes of the receptive fields found in the primary visual cortex may be regarded as very well adapted to the structure of the environment.
Finally, we have in Section 9 described how direct cues to the structure of 3-D scenes can be obtained from the parameters in the locally linearized perspective or projective image formation models according to (142 ) and (144 ).
While previous work with the generalized Gaussian derivative model for spatio-temporal receptive fields have primarily focused on using either the non-causal 1-D Gaussian kernel or the time-causal limit kernel for temporal smoothing in the spatio-temporal smoothing process, the derivations in this paper have been made under a weaker assumption of only requiring temporal scale covariance (according to (3 )) for the temporal smoothing kernels. Thus, the results presented in this article do immediately generalize to the use of other temporal smoothing kernels, provided that the kernels are covariant under temporal scaling transformations.
In relation to the presented geometric interpretations of the joint covariance properties in Section 6, the derived explicit transformation properties for receptive field responses in Section 7, defined in terms of spatio-temporal derivatives of the underlying covariant spatio-temporal smoothing kernels, do notably show how to both interpret and relate spatio-temporal receptive field responses, when viewing dynamic scenes under different composed geometric viewing conditions.
Specifically, we propose that this theoretical analysis should have direct relevance, when interpreting the functional properties of biological receptive fields, both computationally and with regard to how the simple cells in the primary visual cortex, whose functional properties we here model with an idealized axiomatically derived spatio-temporal receptive field model. From the viewpoint of the here presented theory, in combination with previous biological modelling results, that demonstrate a very good qualitative agreement between idealized receptive field models according to this theory and neurophysiological recordings of actual biological receptive fields in the primary visual cortex of higher mammals, the shapes of these joint spatio-temporal receptive fields can, from this viewpoint, be regarded as very well adapted to the structural properties of the environment.
The theoretical results derived in this treatment are more generally intended as a theoretical foundation for computer vision modules, that make use of populations of spatio-temporal receptive field responses as the first processing layers in the visual hierarchy, as well as for formulating models of biological vision and interpreting the functional properties of biological vision from a computational viewpoint, as well as with regard to constraints from the environment, that may strongly influence the formation of the receptive fields from a combination of learning and evolution mechanisms over time.
I would like to thank Jens Pedersen for valuable interactions concerning an earlier form of joint covariance property of the spatio-temporal smoothing transformation.
I would also like to thank the Editor and the Reviewer for valuable comments that improved the presentation.
The support from the Swedish Research Council (contracts 2018-03586 and 2022-02969) is gratefully acknowledged.↩︎
Except for the fact that Sections 3.9–3.10 generalize the previous treatments for scale-normalized derivatives, computed based on convolution with either non-causal 1-D temporal kernels or the time-causal limit kernel, to temporal derivatives computed based on temporal smoothing with an arbitrary scale-covariant temporal smoothing kernel.↩︎
To derive this expression, we can set \(B = A\), \(\Sigma_L = s \, \Sigma\) and \(\Sigma_R = s' \, \Sigma'\) in the transformation property of the spatial covariance matrices \(\Sigma_R = B \, \Sigma_L \, B^T\) under a spatial affine transformation of the form \(x_R = B \, x_L\) according to Equation (30) in Lindeberg and Gårding ([18]). In this paper, we do, however, on the other hand, overparameterize the degrees of freedom in the affine Gaussian convolution kernels, in order to later more clearly separate the degree of freedom in uniform scaling transformations from the degrees of freedom in non-isotropic affine image transformations, see also Lindeberg ([57]) for a canonical parameterization of the degrees of freedom in 2-D spatial affine image transformations, based on a singular value decomposition of the affine transformation matrix \(A\). The underlying reasons for this aim are: (i) to prepare for the degrees of freedom in uniform scaling transformations and more general non-isotropic affine transformations to be studied in Section 5, and also (ii) to prepare for possible different ways of normalizing the spatial covariance matrices \(\Sigma\) with regard to specific geometric interpretations of the imaging situation.↩︎
Concerning the notation, we throughout this paper denote the transpose of an inverse matrix as \(A^{-T} = (A^{-1})^T\).↩︎
In the expression below, as well as in following mathematical equations in this paper, the notation “\(\times\)” means means a mere multiplication, however, used here to separate different components in a product, or later also as a binding symbol between multiplications of multiple components in more complex expressions over multiple lines.↩︎
The tilt direction in a monocular projection model is the projection of the surface normal, at the observation point, to the image plane. The slant angle is, in turn, the angle between the surface normal and the viewing direction.↩︎
To understand the origin of the indeterminacy of this form, consider first a singular-value-type decomposition of a general square root matrix \(\Sigma^{1/2}\) of the covariance matrix \(\Sigma\) of the form \(\Sigma^{1/2} = U_{1/2} \, \Lambda^{1/2} \, V_{1/2}^T\), where \(\Lambda^{1/2}\) is a diagonal matrix with not necessarily ordered eigenvalues. With regard to the expression (61 ), this corresponds to setting \(U_{1/2} = \rho\) and \(V_{1/2}^T = U^T\). For our convention for the square root matrix \(\Sigma^{1/2}\) according to (59 ), the chosen form of the principal square root thus corresponds to choosing \(U_{1/2} = I\).↩︎
Because of physical constraints, as arising when viewing a 3-D surface patch in terms of 2-D images, we do, however, restrict the spatial scaling factors \(\sigma_1\) and \(\sigma_2\) in a decomposition of the 2-D affine transformation matrix according to \(A = {\cal R}_{\psi/2} \, {\cal R}_{\varphi/2} \, {\operatorname{diag}}(\sigma_1, \sigma_2) \, {\cal R}_{\varphi/2} \, {\cal R}_{-\psi/2}\), where \({\cal R}_{\psi/2}\) and \({\cal R}_{\varphi/2}\) are rotation matrices (see Equation (15) in Lindeberg ([58])), to be positive.↩︎
Notably, several of these transformation properties become simpler, when expressed in terms of scale-normalized derivatives according to Section 3, where (i) the scale-normalized spatial derivative operators \(\partial_{\varphi,\scriptsizenorm}^m\) and \(\nabla_{x,\scriptsizenorm}\) according to (14 ) and (15 ) will absorb the spatial scaling factor \(S_x\) in Equations (101 ) and (102 ) analogous to Equations (17 ) and (16 ), (ii) the scale-normalized affine gradient operator \(\nabla_{x,\scriptsizeaffnorm}\) according to (62 ) will absorb the affine transformation matrix \(A\) in Equation (103 ) analogous to Equation (67 ), and (iii) the scale-normalized temporal derivative operators \(\partial_{t,\scriptsizenorm}^n\) and \(\partial_{{\bar t},\scriptsizenorm}^n\) according to (88 ) and (90 ) will absorb temporal scaling factor \(S_t\) in Equations (104 ) and (105 ) analogous to Equations (89 ) and (92 ). To save space, we, however, postpone introducing these scale-normalized derivatives into the transformation properties of the composed spatio-temporal receptive fields, until also addressing the joint covariance properties of the spatio-temporal receptive fields in Section 5.5.↩︎
In addition to the above three main cases treated here, it is also possible to extend the covariance property of the affine scale-normalized directional operator in (135 ) to the spatio-temporal extension of the case with coupled eigendecompositions of the affine covariance matrix \(A\) and the spatial covariance matrix \(\Sigma\) studied in Section 3.4.3. To save space, we do, however, leave out those details to the reader.↩︎
For the purpose of modelling the transformations between pairwise views, we do here not explicitly consider the special case of the similarity group, since the geometric viewing configurations that lead to such a restricted form of variability are very degenerate, in relation to the case of multiple views of the same object from different viewing directions.↩︎
For the axiomatically formulated theory of visual receptive fields, that leads to the principled model for spatio-temporal receptive fields that underlies this treatment, see Lindeberg ([19]) concerning the foundations and Lindeberg ([4]) Appendix B for a complement.↩︎
Note, however, that complementary mechanisms may be needed to handle discontinuities in depth or surface orientation, as well as for handling the effects of illumination variations. With regard to a subset of the space of variability spanned by illumination variations, it should, however, be noted that if the studied idealized receptive field model is applied over a logarithmic brightness scale, then the receptive field responses will be automatically invariant under local multiplicative illumination variations and exposure mechanisms, see Lindeberg ([2]) Section 2.3 and Lindeberg ([4]) Section 3.4.↩︎