August 05, 2024
This paper investigates tree species classification using Sentinel-2 multispectral satellite image time-series. Despite their critical importance for many applications, such maps are often unavailable, outdated, or inaccurate for large areas. The interest of using remote sensing time series to produce these maps has been highlighted in many studies. However, many methods proposed in the literature still rely on a standard classification algorithm, usually the Random Forest (RF) algorithm with vegetation indices. This study shows that the use of deep learning models can lead to a significant improvement in classification results, especially in an imbalanced context where the RF algorithm tends to predict towards the majority class. In our use case in the center of France with 10 tree species, we obtain an overall accuracy (OA) around 95% and a F1-macro score around 80% using three different benchmark deep learning architectures. In contrast, using the RF algorithm yields an OA of 93% and an F1 of 60%, indicating that the minority classes are not classified with sufficient accuracy. Therefore, the proposed framework is a strong baseline that can be easily implemented in most scenarios, even with a limited amount of reference data. Our results highlight that standard multilayer perceptron can be competitive with batch normalization and a sufficient amount of parameters. Other architectures (convolutional or attention-based) can also achieve strong results when tuned properly. Furthermore, our results show that DL models are naturally robust to imbalanced data, although similar results can be obtained using dedicated techniques.
Forest monitoring, tree species classification, deep learning, multispectral satellite time series, imbalanced data
Tree species identification is critical for many applications related to forest monitoring. For example, species information is used in combination with allometric equations for carbon estimation, since carbon growth and storage is species dependent [1]–[3]. The proportion of tree species and species combinations is also used as an indicator of biodiversity and forest resilience [4], [5]. In addition, climate change is increasingly affecting the forests, with large-scale abiotic (fire, drought, windthrows) and biotic (insects and pathogens) disturbances impacting the species composition. Hence, there is an urgent need for updated tree species maps, e.g. for accurate analysis of tree health [6] or to support reforestation decisions [7], [8]. Unfortunately, such information is not always available, accurate or updated at fine scale and for large areas. Standard ground-based inventories are time-consuming and cannot be conducted on a yearly basis. For instance, in France, the last National Forest Inventory (NFI) data provided by IGN (Institut national de l’information géographique et forestière) is outdated since it was conducted between 2007 and 2018 (BD Forêt V2, [9]).
The use of remote sensing data has been identified as a very efficient way to map tree species in a timely manner over large areas and with fine spatial resolution [10]. Sentinel-2 (S2) satellites, are increasingly used for such application [11]–[13]. Indeed, they provide multispectral images with fine spatial resolution (up to 10m) and low revisit time (\(\sim\) 5 days in Europe) [14]. This high spatio-temporal resolution can be used to monitor timely changes in vegetation cover, which is important to highlight the specific phenology of each tree species (or more generally other type of vegetation) [15]. Some studies have shown the potential interest of using additional data, such as Sentinel-1 [11], but it is not the focus of our analysis, mainly because Sentinel-1 data have much higher computational and storage costs from an operational perspective.
Standard frameworks for mapping tree species with in situ data often rely on the Random Forest (RF) algorithm [16]. For example, [10] observed in their review that the RF algorithm is one of the most used algorithms for such tasks, with other standard machine learning methods such as support vector machine (SVM). In more recent studies, the RF algorithm is still largely used, e.g., [11], [17]–[19]. The popularity of RF with remote sensing data can be explained by several advantages: it is less prone to overfitting, requires less data for training, and is faster than deep learning (DL) methods. Moreover, the RF algorithm is more easily interpretable than other algorithms, which can be interesting depending on the task at hand (see for example [6] in the case of forest health detection).
Despite its advantages, the RF algorithm is known to be affected by imbalanced data [20], which is a common problem when dealing with tree species mapping since the different species are not naturally equally distributed. Moreover, it is also known that the RF algorithm can be outperformed by deep learning approaches to classify remote sensing data at large scale [21]. In line with these points, a recent review has highlighted the shift towards deep learning models for tree species classification [22]. However, the methods reviewed focus on patch-level approaches, as in [23], where the patches are 400\(\times\)400 S2 pixels in size. In many cases, having access to such patches (in the review made in [22], the patch sizes range from 64\(\times\)64 to 500\(\times\)500 pixels) can be difficult. For instance, in our use case we had plots of various size, many of them being of size 3\(\times\)3 S2 pixels. In addition, working with patches is more computational intensive than working with single pixel time series. The results obtained in [24] have showed that pixel-level approaches can outperform patch-level approaches for tree species classification with S2 images. This is particularly important in mixed-species forests, such as those in our study area, where large patches can consist of different tree species.
Therefore, in this paper we propose to explore pixel-level deep learning strategies adapted to work with time series [25] and compare their results with those obtained with the RF algorithm, which is still used as a standard classifier in many recent studies. Our use case focuses on a relatively small dataset, \(\sim\) 4400 reference plots, which is nevertheless larger than the dataset used in many studies, such as [24]. Moreover, the study area is the Centre-Val de Loire region of France and its surroundings, which corresponds to a large area (11 S2 tiles, 110000 km\(^2\)). We also validated our results on an independent validation dataset for 4 key species (oak, pine, beech and chestnut), which was mainly used to analyze the generalization of the mapping to unseen areas and minority species. Complementary to previous studies such as [24], some key steps are analyzed in more details in our study, in particular the imbalanced data problem and the hyperparameter choice of the different algorithms. Finally, the configuration files and implemented DL models used with the open source iota2 software [26], a processing chain for the operational production of land cover maps from remotely sensed image time series, have been made available: https://framagit.org/fl.mouret/tree_species_classification_iota2.
This section present the study area, reference plots and satellite data used for our analysis.
Our study area is the Centre-Val de Loire region and its surroundings (11 S2 tiles, 110000 km\(^2\)), and is depicted in 1. The same area was analyzed in a previous study related to the detection of oak dieback, where the need for accurate mapping of tree species was identified [6]. Our study area is a large region in northern France with a temperate climate and diverse forests. The region is a plateau with some hills, drained by the Loire River and its tributaries. Soil acidity and rainfall determine the type of forest. Thus, oak forests (65% of the total [27]) with hornbeam, birch and chestnut dominate most of the region, which has drier and more acidic soils.
Our reference data were provided by the ONF (Office National des Forêts). The reference plots are pure stands, i.e. more than 75% of a single species (further work could be interesting to include plots with mixed species). The distribution of each species is provided in 1. The total number of pixels is 67k (the ratio between the classes is about the same). It can be seen that oak plots dominate the data set, with more than 70% of the plots belonging to this class (which is consistent with the fact that oak trees are predominant in our study area). The spatial distribution of the reference plots is neither systematic nor regular. 1 shows a higher concentration of reference plots in the northwest of the study area, and some areas have no reference plots. Finally, 2 provides an example of plots showing that 1) reference plots can be small and 2) different species can be found close to each other.
Species | Definition | \(\#\) Plots |
---|---|---|
Birch | Tree species of the genre Betula | 52 |
Hornbeam | Tree species of the genre Carpinus | 48 |
Chestnut | Castanea sativa Mill. | 61 |
Oak | Quercus robur L. and Quercus petraea (Matt.) Liebl. | 3219 |
Douglas Fir | Pseudotsuga menziesii (Mirb.) Franco. | 131 |
Fraxinus | Tree species of the genre Fraxinus | 39 |
Beech | Fagus sylvatica L. | 254 |
Poplars | Tree species of the genre Populus | 78 |
Pines | Tree species of the genre Pinus | 486 |
Robinia | Robinia pseudoacacia L. | 20 |
We were able to validate our mappings using independent plots of 4 key species from forest health campaign assessments (these plots were used, for example, in [6], [28] to map forest dieback). In total, we had at our disposal 1497 oak, 197 chestnut, 91 pine and 136 beech pure plots from these campaigns. Having these independent plots is interesting since our training data is 1) highly imbalanced and 2), does not cover all the massifs of the region.
As explained in the introduction of this paper, our mapping is based on S2 data only. We used the same processing chain as in [6], consequently a brief description is given below, more details are available in that reference. The S2 satellites (S2-A and S2-B) are operated by the European Space Agency for the European Union’s Copernicus Earth observation program [14]. S2 data are multispectral images, we used 10 spectral bands for our analysis to cover the visible (bands 2, 3, 4), red-edge (bands 5, 6, 7), near infrared (NIR) (bands 8 and 8a) and short-wave infrared (SWIR) (bands 11 and 12) parts of the spectrum. Each image was resampled to pixels of size \(10\times10\)m. The MAJA processing chain [29] was used to produce ortho-rectified level-2A images, with cloud and shadow masks. Finally, S2 images acquired between 2019 and 2020 were used, providing long-term canopy information (the effect of changing the temporal length of the input data is discussed in 5).
This section provides the methodological steps used to map tree species. A simplified workflow is depicted in 3. As mentionned in the introduction, the \(iota^2\) processing chain [26] was used to extract training samples and produce maps of our study area. As explained in the previous paragraph, level L2A S2 images are acquired and cloud-filtered. Then, they are interpolated on a regular time grid to produce 740 features for each pixel (10 bands over 74 dates). Since our dataset is highly imbalanced (with a large majority of oak trees), we tested different strategies to deal with this problem (see 5 for a detailed discussion on this point). Finally, a classification algorithm is trained on the reference data and used to produce a large-scale tree species map.
A linear interpolation (i.e., gap-filling) was used to have consistent time series for each S2 tile (1 interpolated data every 10 days) [6], [30], [31]. At the end of these preprocessing steps, each pixel is characterized by 740 features corresponding to 10 bands acquired over 2 years (74 dates). Other strategies for dealing with missing data and irregularly sampled time series were also tested (without improving our results) and are discussed in 5.
In order to be used efficiently by the classification algorithm (this is particularly true for the convolutional- and attention-based DL models), it is common to have standardized data with mean equal to 0 and standard deviation equal to 1. In our case, especially for the convolutional operation, it is important to preserve the temporal correlation of the S2 bands [32]. Therefore, each S2 band was standardized by taking all the acquisitions over time (and not by standardizing each acquisition independently).
Among the many classification algorithms available in the literature, the RF algorithm is widely used due to its scalability, robustness, and ability to model complex phenomena. The RF algorithm is an ensemble method based on decision trees: it constructs many decision trees and averages their predictions by majority voting. The RF algorithm uses bagging (from bootstrap aggregation) to improve the stability and reduce the variances of the classification: each tree is trained on a subset of the original dataset, and at each split, a random subset of the features is selected. As explained in the introduction of this paper, we used the RF algorithm as a benchmark method because it is one of the most popular classification approaches in remote sensing. A major advantage of the RF algorithm is its robustness to the choice of its hyperparameters. For our experiments, we used default values, i.e. 100 fully grown decision trees (our tests have confirmed that changing these hyperparameters does not significantly affect the classification results).
In our analysis, we focused on 3 different DL architectures based on different mechanisms. We used two classical architectures, which are strong baselines to classify time series [25], namely multilayer perceptron (MLP or fully connected neural network) and temporal convolutional network (TempCNN)[32]. These two models are easy to implement, and our results will show that they can work without extensive tuning, which is interesting for operational purposes. These standard architectures are depicted in 4: it consists of a succession of layers (we have fixed the number of layers to 3, as it provided the best results), each layer being composed of a transformation layer (i.e., a linear or convolutional layer), a batch normalization, and a nonlinear activation function (the rectified linear unit (ReLU) is commonly used). Since we are working with time series, note that the convolution layers are 1-dimensional convolution layers. Unlike in [25], the MLP network also uses batch normalization instead of dropout, as it was found to be more efficient and stable. We also implemented a more recent method, the Lightweight Temporal Attention Encoder (LTAE) [33], which is based on the attention mechanism and is a state-of-the-art method for land cover classification [21].
From a simplified point of view, these 3 structures are very similar: they all aim at extracting relevant features, which are used by the output linear layer to classify the time series. The main difference is related to the feature extraction mechanism (dense layers, convolutions with global max pooling or attention mechanism).
Training a deep learning network is known to be more complex than classical machine learning algorithms such as RF. In the following, we propose a simple guideline for the choice of each parameter. The values used in our experiments are reported in 2. Our results show that these default values can lead to good results. We used the ADAM optimizer, other optimizers were tested and did not change our results significantly. An adaptive learning rate strategy was used to reduce the learning rate by a factor of 2 if the validation loss did not improve after 20 epochs. Moreover, the training was stopped after 40 epochs without improvements. The batch size was set to 4096, apart for the MLP with SMOTE where it was set to 8192 (in that case, smaller values lead to very poor results, see our discussion on that point). Finally, the Cross-entropy loss was used, other losses were tested (e.g., margin loss) without improving our results.
MLP | TempCNN | LTAE | |
---|---|---|---|
Learning rate | \(1^{-4}\) | \(1^{-3}\) | \(1^{-3}\) |
Neurons / conv. filters | 1024, 512, 256 | 128, 128, 128 | 512 |
Filter size | - | 3,3,2 | - |
Heads | - | - | 6 |
Embedding size | - | - | 370 |
Learnable parameters | 1.4M | 113K | 271K |
The MLP and TempCNN can be easily tuned by setting a sufficient number of neurons/filters and an appropriate learning rate. Since the MLP tends to overfit more easily, we have found that reducing its learning rate to \(1^{-4}\) instead of \(1^{-3}\) was more stable. In particular, the use of numerous neurons (1024 or 2048) or filters (128) was found to be efficient even when the number of training samples is relatively small (see discussion in 5). For the MLP, we divided the number of neurons by 2 at each layer to reduce the size of the model and improve performances. The LTAE is slightly more complex to tune (number of heads, embedding size, final MLP), which is a disadvantage from an operational point of view. Setting the number of heads to 6 was the most effective in our case; reducing this number can lead to underfitting, and increasing it can lead to convergence problems (this may, of course, depend on the data set at hand). The dimension of query/key vector was set to 8 as in the original paper. Similarly, the embedding size was set to 370 (number of features divided by 2). As for the other models, we have found that setting the number of neurons in the MLP to a relatively high value (512 instead of 128 in the original paper) was found more efficient.
Most of the methods used to deal with class imbalance can be grouped into algorithm-level methods (i.e., the training algorithm directly takes the class imbalance into account) and data-level methods (i.e., a modification of the training dataset is done)[34]. Our analysis focuses on one method from each family. The first approach is an algorithmic-level method and consists in using class weights to penalize errors related to the underrepresented classes during training. The second strategy, which belongs to the data-level methods, consists in generating synthetic samples from the minority classes. We have used the classical algorithm Synthetic Minority Oversampling Technique (SMOTE), which has been widely used for a wide range of applications [35]. Finally, we also trained our DL models on our raw dataset without considering the class imbalance problem.
Note that other methods have been tested and a discussion of their performance is provided in 5. In particular, we tested a variant of SMOTE, namely ADASYN [36], without any improvement in our results (see the additional results provided in 4.3).
Our experimental results have been validated by stratified cross-validation (CV, 10 folds). The train/test separation is done at the plot level to avoid auto-correlation problems. The main metric used to validate our results is the F1 macro score, i.e., the average F1 score of each class, which allow a metric robust to imbalanced data. More precisely, the F1 score is the harmonic mean of precision and recall, where precision is the percentage of samples correctly labeled in class \(j\) and recall is the percentage of samples in class \(j\) that were correctly labeled. In addition, we also show the overall accuracy (OA), which is the percentage of samples correctly classified (so it is affected by class imbalance), and the balanced accuracy (BA), which is the recall averaged over each class (so it is not affected by class imbalance). These two additional metrics can be useful since their interpretation is intuitive.
The CV results (F1, OA and BA) obtained with the different tested configurations are displayed in 3. Overall, it is clear that the DL approaches largely outperform the RF algorithm, especially in terms of F1 and BA, which means that the RF algorithm struggles to correctly classify the underrepresented classes. Moreover, all DL models can achieve very close classification metrics, regardless of the strategy used to deal with unbalanced data. The MLP provides the best F1 score (0.81) without needing any strategy to deal with imbalanced data.
Model | Imb. strat. | F1 | OA | BA |
---|---|---|---|---|
RF | - | 0.59 (0.06) | 0.93 (0.01) | 0.52 (0.05) |
RF | Class weight | 0.62 (0.06) | 0.93 (0.01) | 0.54 (0.05) |
RF | SMOTE | 0.62 (0.06) | 0.93 (0.01) | 0.54 (0.05) |
MLP | - | 0.81 (0.03) | 0.96 (0.01) | 0.80 (0.04) |
MLP | Class weight | 0.81 (0.03) | 0.96 (0.01) | 0.80 (0.04) |
MLP | SMOTE | 0.80 (0.03) | 0.95 (0.01) | 0.82 (0.03) |
TempCNN | - | 0.79 (0.03) | 0.95 (0.01) | 0.79 (0.03) |
TempCNN | Class weight | 0.80 (0.02) | 0.95 (0.01) | 0.80 (0.02) |
TempCNN | SMOTE | 0.80 (0.02) | 0.95 (0.01) | 0.80 (0.02) |
LTAE | - | 0.80 (0.02) | 0.95 (0.01) | 0.77 (0.03) |
LTAE | Class weight | 0.80 (0.04) | 0.95 (0.01) | 0.81 (0.04) |
LTAE | SMOTE | 0.80 (0.04) | 0.95 (0.01) | 0.82 (0.04) |
In addition to this table, we have provided normalized confusion matrices averaged after the 10 runs for each classifier using class weights for the MLP and SMOTE for the other classifiers in 5 (similar results are obtained with the other strategies). Looking at these confusion matrices, it is clear that the RF algorithm tends to classify the minority species (except conifers, poplars and beech) as oak, which explains the poor results in terms of F1 score and BA obtained in 3. A significant improvement is observed with the DL methods, even if some minority species (such as birch, hornbeam, beech, and Fraxinus) can be partially classified as oak (between 10 and 30% confusion, depending on the species and model). In addition, DL models also tend to confuse hornbeam with beech and Robinia with Fraxinus, tree species that are more similar to each other than to oak. Finally, it can be observed that the LTAE has the best BA overall (i.e., recall is higher on average), but with a lower recall for oak, which may affect the precision for other species.
The RF, MLP, TempCNN and LTAE models were validated on our independent datasets described in 2.3. Each model was trained using the SMOTE algorithm on the dataset presented in 2.2 (very similar results were obtained with or without class weights). For each species (oak, pine, chestnut and beech) we computed the percentage of plots correctly classified by the different models (i.e., recall). For this experiment, a plot is considered correctly classified if more than 50% of its pixels are correctly classified. These results are reported in 4. Overall, MLP performs best and can retrieve an average of 75% of the plots (which is close to the BA obtained in the previous section). LTAE and TempCNN provide a lower BA with a drop of 5%, which could be a sign of overfitting, convergence problem or lack of generalization. Finally, the RF algorithm is largely outperformed with a BA of 57% (which is consistent with the results obtained with the standard CV in the previous section). In particular, almost all oak and pine plots are correctly classified.
We also perform the same analysis by considering that a plot is correctly classified when more than 20% of its pixels have been correctly detected (this is an easier task since only a few pixels need to be correctly detected). In this case, the average recall (BA) is 0.66, 0.85, 0.77, and 0.82 for RF, MLP, TempCNN, and LTAE, respectively. This significant improvement shows that at least a part of the pixels are generally correctly detected, which could be valuable for practical applications. Finally, pixel-level results were computed and are very close to the results given in 4 (BA equal to 0.58, 0.72, 0.67, and 0.69 for RF, MLP, TempCNN, and LTAE).
Model (SMOTE) | Oak | Pine | Chestnut | Beech | Mean (BA) |
---|---|---|---|---|---|
RF | 0.991 | 0.923 | 0.325 | 0.59 | 0.575 |
MLP | 0.971 | 0.978 | 0.645 | 0.419 | 0.753 |
TempCNN | 0.970 | 0.934 | 0.563 | 0.257 | 0.681 |
LTAE | 0.965 | 0.956 | 0.523 | 0.375 | 0.705 |
To complement the previous results, 5 provides additional experiments conducted to evaluate some variations in the classification workflow.
Mod. | Config. | Imb. | Data | F1 | BA |
---|---|---|---|---|---|
MLP | Same | CW | 1 year | 0.78 (0.04) | 0.78 (0.04) |
MLP | 128, 64, 32 | CW | Same | 0.80 (0.05) | 0.80 (0.04) |
MLP | 128, 64, 32 | - | Same | 0.79 (0.05) | 0.77 (0.04) |
CNN | 64, 64, 64 | - | Same | 0.76 (0.03) | 0.70 (0.04) |
LTAE | 4 heads | - | Same | 0.79 (0.03) | 0.80 (0.04) |
LTAE | 8 heads | - | Same | 0.75 (0.04) | 0.78 (0.03) |
MLP | BS 256 | - | Same | 0.80 (0.03) | 0.77 (0.03) |
MLP | BS 256 | CW | Same | 0.80 (0.03) | 0.79 (0.03) |
MLP | BS 8k | CW | Same | 0.80 (0.03) | 0.80 (0.04) |
MLP | BS 8k no BN | CW | Same | 0.73 (0.05) | 0.80 (0.04) |
MLP | No BN | - | Same | 0.79 (0.02) | 0.77 (0.02) |
LTAE | BS 2048 | - | Same | 0.79 (0.03) | 0.76 (0.04) |
LTAE | BS 2048 | SMOTE | Same | 0.80 (0.03) | 0.82 (0.04) |
CNN | BS 2048 | - | Same | 0.79 (0.03) | 0.74 (0.03) |
CNN | BS 2048 | CW | Same | 0.80 (0.02) | 0.79 (0.02) |
RF | Same | CW | Under. | 0.70 (0.03) | 0.68 (0.04) |
MLP | Same | CW | Under. | 0.69 (0.03) | 0.80 (0.04) |
CNN | Same | - | Under. | 0.72 (0.02) | 0.77 (0.03) |
MLP | BS 2048 | S5 | Same | 0.73 (0.03) | 0.82 (0.03) |
MLP | BS 4096 | S5 | Same | 0.77 (0.03) | 0.82 (0.03) |
MLP | Same | S100 | Same | 0.79 (0.03) | 0.81 (0.03) |
MLP | Same | A5 | Same | 0.78 (0.03) | 0.79 (0.03) |
MLP | Same | A100 | Same | 0.79 (0.02) | 0.81 (0.03) |
CNN | Same | S100 | Same | 0.80 (0.02) | 0.80 (0.02) |
CNN | Same | A5 | Same | 0.76 (0.03) | 0.72 (0.03) |
CNN | Same | A100 | Same | 0.80 (0.01) | 0.79 (0.02) |
Time range used for the analysis: using 1 year of S2 data instead of 2 slightly decreases the classification results (F1=0.78 instead of 0.81), we observed this with the other models.
Size of the DL models: regarding the size of the DL models, a good F1 score can be obtained with an MLP with 3 layers of 128, 64 and 32 neurons, with or without class weights. Note that without class weight, the BA is lower (0.77 instead of 0.80). Overall, we have found that smaller network tend to be more impacted by imbalanced data. In practice, we have observed that choosing a larger number (typically 1024 or 2048 neurons in the first layer) is not a problem and can lead to a slight improvement in classification, which is interesting because it can be adapted to larger datasets. We found that at least 128 filters were needed for TempCNN, with a drop in accuracy observed when using 3 layers of 64 filters. On the other hand, we observed that adding layers was not beneficial and led to a slight deterioration of our classification results. For LTAE, using too many heads can lead to lower accuracy due to convergence problems. However, using 4 heads instead of 6 results in a slightly lower BA (0.80 instead of 0.83).
Regularization: a small batch size (here, 256) can reduce the accuracy of the results. In our experiment, when using a batch size equal to 256 with an MLP, even if the F1 score is close to the optimal value, we observed a decrease in the BA (0.77 instead of 0.80). In this case, using class weights mitigates this problem (BA=0.79 instead of 0.80). For TempCNN and LTAE, this observation is more pronounced: reducing the batch size to 2048 instead of 4096 also reduces the BA without class weights (0.76 for LTAE and 0.74 for TempCNN). Using a very large batch size (e.g., 8192) was not found to affect the overall results when using batch normalization (without it, F1 score significantly drops to 0.73). In addition, with a batch size of 4096 and no batch normalization, the F1 score obtained using the MLP drops to 0.79 instead of 0.81 (and the BA drops to 0.77 instead of 0.80). The same conclusions were found with or without class weights, and also with the TempCNN architecture.
Other strategies to deal with imbalanced data: our results show that undersampling the oak plots (here, by randomly selecting 400 plots) improves the RF predictions (which are still far from the optimal scores) but worsens the F1 score obtained with DL methods, implying that overall it is better to use all available samples with an appropriate strategy to deal with unbalanced data. Regarding the oversampling strategies, SMOTE (S) and ADASYN (A) can lead to very similar results. Regarding the choice of their hyperparameters, we have found that the number of neighbors used in the SMOTE algorithm does not have much impact on the classification results, while it is important to choose a large number (e.g., 100 or more) in the ADASYN algorithm. Finally, when using MLP with SMOTE, we observed a significant reduction in the F1 score when reducing the batch size (0.73 instead of 0.80 when using a batch size of 2048).
The fact that the RF algorithm can be affected by unbalanced data has already been identified in the literature [11], [20], [37]. In [38], in a land cover classification context, the authors highlighted that RF tends to predict towards the majority class, which is consistent with our observations. In our case, we have found that using RF with SMOTE or class weight failed to handle this issue. As observed in 4.3, undersampling the oak plots improves the results of the RF algorithm, but the F1 score we obtained is still much lower than those obtained with DL methods. For example, this strategy was used in [11] to map tree species in Germany, where the authors found that mapping less common species was still challenging, confirming the results we obtained with the RF algorithm. We also tested a variant of the RF algorithm, namely the Balanced RF[37] implemented in the Python toolbox imbalanced-learn[39]. However, this variant lead to very poor results. As a key takeaway, we recommend that practitioners consider these potential issues for future work, especially since our results highlight the potential benefit of using DL methods, which we found to be able to generalize better, especially for less frequent classes. Other classification methods such as Sparse Gaussian Processes ([21]), Support Vector Machine (SVM) algorithm [40] or Tree Boosting Algorithms (XGboost) [41] have also been tested without providing competitive results. In particular, we also tested other DL architectures (e.g., ResNet, 2D convolutional neural network, recurrent neural network, etc.) without improving our classification results. This means that most of the DL architectures can achieve similar accuracy for our task.
While our results highlight the potential benefits of using DL models, some simple guidelines can be followed to achieve good accuracy. Our results tend to show that choosing a model sufficiently large (e.g., in our case at least 200k parameters), batch normalization with a sufficient batch size (e.g., 512 to 8192) is a good start. Based on our results, we recommend using an MLP with a sufficient number of neurons, batch normalization and class weights since it lead to better classification and generalization in our validation data. On the basis of these recommendations, a review of the literature revealed that there was a tendency for some studies to use DL models with too few parameters. For example, [42] implemented an MLP with only 1 hidden layer with 100 neurons for tree species classification, which led to poor results. Similarly, [43] used a single layer with 18 hidden units. In both cases, the small number of neurons could explain the poor performances obtained with DL methods.
Regarding the method used to address data imbalance, our results show that all methods generally lead to similar conclusions, when tuned properly. When using SMOTE with the MLP, we observed that the use of a very large batch size (8192) was necessary, which is probably due to the fact that MLP can easily overfit, hence they are sensitive to small batches that can be composed of synthetic noisy samples. In general, we found that choosing a sufficient model size as well as a sufficient batch size can help improve BA. Using class weight can help to be more robust overall to the choice of model size and batch size, so it may be a good place to start. Additional tests using different losses (e.g., margin or focal loss, label smoothing) or the ADASYN algorithm did not improve our results. Overall, it appears that DL methods can also learn patterns related to the minority classes if the model is sufficiently large, hence they can be robust to the imbalanced data problem. This is consistent with previous studies (conducted for other types of data) that showed that good tuning of the models can be sufficient to achieve optimal accuracy in an imbalanced data context [44].
Regarding the input data used for classification, we observed that the direct use of the raw S2 band is sufficient. Similarly to [6], we observed that using 2 years of data instead of 1 can be useful to better characterize the forests and thus improve the classification results. Using additional sensors and information, such as Sentinel-1, is an interesting perspective that could improve our results [11].
We have also tested other strategies to deal with missing data (related to clouds). Our results confirm that standard gap-filling (i.e., linear interpolation) is a strong baseline adapted to work at large scale (11 S2 tiles) and in an operational context [6], [21], [45]. Working on this point is an interesting perspective, recent studies have shown that improvement is possible by working directly with raw (irregular and unaligned) SITS [46]. However, our initial tests were not very conclusive: we tested the method developed in [46], but this lead to inaccurate classification results. This approach, tested for land cover classification, proposes an end-to-end learning framework that makes use of an attention-based interpolation to embed the raw SITS into a regular temporal grid. More tests should be done to investigate why such an approach fails in our case (imbalance data, small dataset, classes with very similar behavior, etc.). We also conducted experiments using the LTAE with raw S2 data and a cloud mask, as tested in [46]. While theoretical results obtained with standard CV appears close to those obtained with our proposed framework, our tests at large scale obtained by producing a map of the study area have shown important changes at the frontier of two S2 tiles. This highlights the fact that such framework tend to largely overfit the specific acquisitions of a given S2 tile, which confirm the results observed in [46].
Other interesting perspectives could be explored, two of which are developed below. We observed that combining the different maps produced with the different models, e.g., by taking the mode of the predictions, could filter some noisy areas in some cases. However, our first tests show that the gain in classification accuracy is not always obvious, hence further work is needed to develop an optimal ensemble learning approach. Finally, for operational mapping, it may be interesting to add more reference data to produce maps that are as accurate as possible. Using NFI plots is an obvious way of increasing the size of the training dataset, however since this database is not up-to-date adding such samples might be non-trivial. Working on mixed-species plots could be another way to add more training samples to the dataset, the main problem being to automatically separate these plots into pure class pixels.
This paper investigates tree species classification in temperate forests based on multispectral remote sensing data in an imbalanced context and over a large area (110000 km\(^2\)). Our study area is located around the Centre-Val de Loire region of France, which is dominated by oak forests (75% of our training plots), and our training dataset is relatively small (less than 5000 plots). The proposed framework uses 2 years of Sentinel-2 (S2) data as input features (a preprocessing is done to remove clouds and interpolate the different input time series on the same temporal grid).
Our results highlight that deep learning (DL) models can be used in that context with good accuracy and largely outperform classical machine learning model such as the Random Forest algorithm, which is highly biased toward the majority class in our case. We tested 3 different architecture based on different mechanisms, i.e., a fully connected network (MLP), a convolutional-based network (1D convolutional network, TempCNN) and an attention-based network (LTAE). These implementations and configuration files have been made available and can be used with the iota2 processing chain to produce land-cover or vegetation maps with remote sensing time series (https://framagit.org/fl.mouret/tree_species_classification_iota2).
Our results show that the 3 architectures provide similar results, with the MLP being more efficient and robust overall (F1 score equal to 0.81). Since the MLP is also the easiest to tune, we recommend using it as the baseline. In particular, we observed that good convergence was possible with a sufficiently large model, batch normalization and sufficient batch size (typically 4096). We recommend to use class weights since it is generally more robust to changes in the model configuration. In practice, using other strategies such as SMOTE can also work well, in this case we observed that the MLP was very sensitive to the batch size, it is recommended to use a very large batch size in this case.
This work was supported by the SYCOMORE program, with the financial support of the Région Centre-Val de Loire (France), in collaboration with the SuFoSaT project of the GRAINE ADEME program. The autors would like to thank the M. Fauvel, H. Touchais and all the TOSCA PARCELLE team for his help and expertise on the use of iota2 software. We thank ONF for sharing reference data and CNES for the use of its HPC center.
Florian Mouret and Cécile Vincent-Barbaroux are with Université of Orléans, USC INRAE 1328 / P2E laboratory (Physiology, Ecology and Environment), BP 6759, rue de Chartres 45067 Cedex 2, Orléans, France (e-mail: florian.mouret; cecile.vincent1@univ-orleans.fr).↩︎
Florian Mouret, David Morin and Milena Planells are with CESBIO, Université de Toulouse, CNES/CNRS/INRAE/IRD/UT3-Paul Sabatier, 18, Avenue Edouard Belin, 31401 Toulouse, France.↩︎