Tree species classification at the pixel-level using deep learning and multispectral time series in an imbalanced context


Abstract

This paper investigates tree species classification using Sentinel-2 multispectral satellite image time-series (SITS). Despite their importance for many applications and users, such mapping is often unavailable or outdated. The value of using SITS to classify tree species on a large scale has been demonstrated in numerous studies. However, many methods proposed in the literature still rely on a standard machine learning algorithm, usually the Random Forest (RF) algorithm. Our analysis shows that the use of deep learning (DL) models can lead to a significant improvement in classification results, especially in an imbalanced context where the RF algorithm tends to predict towards the majority class. In our case study in central France with 10 tree species, we obtained an overall accuracy (OA) of around 95% and an F1-macro score of around 80% using three different benchmark DL architectures (fully connected, convolutional and attention-based networks). In contrast, using the RF algorithm, the OA and F1 scores obtained are 92% and 60%, indicating that the minority classes are poorly classified. Our results also show that DL models are robust to imbalanced data, although small improvements can be obtained by specifically addressing this issue. Validation on independent in-situ data shows that all models struggle to predict in areas not well covered by training data, but even in this situation, the RF algorithm is largely outperformed by deep learning models for minority classes. The proposed framework can be easily implemented as a strong baseline, even with a limited amount of reference data.

Keywords: Forest monitoring; tree species classification; deep learning; multispectral satellite time series; imbalanced data; Sentinel-2

1 Introduction↩︎

Tree species identification is necessary for many applications related to forest monitoring. For example, species information is used in combination with allometric equations for carbon estimation, since carbon growth and storage is species dependent [1][3]. The proportion of tree species and species combinations is also used as an indicator of biodiversity and forest resilience [4], [5]. In addition, climate change is increasingly affecting the forests, with large-scale abiotic (fire, drought, windthrows) and biotic (insects and pathogens) disturbances impacting the species composition. Hence, there is an urgent need for updated tree species maps, e.g. for accurate analysis of tree health [6] or to support reforestation decisions [7][9]. Unfortunately, such information is not always available, accurate or updated at fine scale and for large areas. Standard ground-based inventories are time-consuming and cannot be conducted on a yearly basis. For instance, in France, the last open dataset provided by IGN (Institut national de l’information géographique et forestière) is outdated since it was conducted between 2007 and 2018 (BD Forêt® V2, [10]). In addition, National Forest Inventory (NFI) plots are not freely available and do not cover forests continuously.

The use of remote sensing data has been identified as a very efficient way to map tree species in a timely manner over large areas and with fine spatial resolution [8], [11]. Sentinel-2 (S2) satellites, are increasingly used for such application [12][14]. Indeed, they provide multispectral images worldwide with fine spatial resolution (up to 10m) and low revisit time (\(\sim\) 5 days in Europe) [15]. This high spatio-temporal resolution can be used to monitor timely changes in vegetation cover, which is important to highlight the specific phenology of each tree species (or more generally other type of vegetation) [16]. Some studies have shown the potential interest of using additional data, such as Sentinel-1 [12], but it is not the focus of our analysis, mainly because Sentinel-1 data have much higher computational and storage costs from an operational perspective.

Standard frameworks for mapping tree species with in situ data often rely on the Random Forest (RF) algorithm [17]. For example, [11] observed in their review that the RF algorithm is one of the most used algorithms for such tasks, with other standard machine learning methods such as support vector machine (SVM). In more recent studies, the RF algorithm is still largely used, e.g., [12], [18][20]. The popularity of RF with remote sensing data can be explained by several advantages: it is less prone to overfitting, requires less data for training, and is faster than deep learning (DL) methods. Moreover, the RF algorithm is more easily interpretable than other algorithms, which can be interesting depending on the task at hand (see for example [6] in the case of forest health detection).

Despite its advantages, the RF algorithm is known to be affected by imbalanced data [21], which is a common problem when dealing with tree species mapping since the different species are not naturally equally distributed. Moreover, it is also known that the RF algorithm can be outperformed by deep learning approaches to classify remote sensing data at large scale [22]. In line with these points, a recent review has highlighted the shift towards deep learning models for tree species classification [23]. However, the methods reviewed focus on patch-level approaches, as in [24], where the patches are 400\(\times\)​400 S2 pixels in size. In many cases, having ground data that correspond to the size of very large patches can be difficult (in the review made in [23], the patch sizes range from 64\(\times\)​64 to 500\(\times\)​500 pixels). For instance, in our use case we had field plots of various size, many of them being of size 3\(\times\)​3 S2 pixels. In addition, working with patches is more computational intensive than working with single pixel time series. The results obtained in [25] have showed that pixel-level approaches can outperform patch-level approaches for tree species classification with S2 images. This is particularly important in mixed-species forests, such as those in our study area, where large patches can consist of different tree species.

Therefore, in this paper we propose to explore pixel-level deep learning strategies adapted to work with time series [26] and compare their results with those obtained with the RF algorithm, which is still used as a standard classifier in many recent studies. Our use case focuses on a relatively small dataset, \(\sim\) 4400 reference plots, which is nevertheless larger than the dataset used in many studies, such as [25]. Moreover, the study area is the Centre-Val de Loire region of France and its surroundings, which corresponds to a large area (11 S2 tiles, 110000 km\(^2\)) with wide variety of tree species and silvicultural practices. We also validated our results on an independent validation dataset for 4 key species (oak, pine, beech and chestnut), mainly to analyze the generalization of the mappings to unseen areas and minority species. Complementary to previous studies such as [25], some key steps are analyzed in more details in our study, in particular the imbalanced data problem and the hyperparameter choice of the different algorithms. Finally, the implemented DL models used with the open source iota2 Python library [27], a processing chain for the operational production of land cover maps from remotely sensed image time series, have been made available: https://framagit.org/fl.mouret/tree_species_classification_iota2. In this respect, the proposed framework can easily be applied to other case studies.

2 Study area and data↩︎

This section present the study area, reference plots and satellite data used for our analysis.

2.1 Study area↩︎

Our study area is the Centre-Val de Loire region and its surroundings (11 S2 tiles, 110000 km\(^2\)), and is depicted in 1. The same area was analyzed in a previous study related to the detection of oak dieback, where the need for accurate mapping of tree species was identified [6]. Our study area is a large region in northern France with a temperate climate and diverse forests. The region is a plateau with some hills, drained by the Loire River and its tributaries. Soil acidity and rainfall determine the type of forest. Thus, oak forests (65% of the total [28]) with hornbeam, birch and chestnut dominate most of the region, which has drier and more acidic soils.

Figure 1: Our study area is delimited in grey, the boundaries between its 11 Sentinel-2 tiles is in lighter grey and administrative departments are outlined in black. The training plots are shown in blue, while the independent validation plots (4 species) are shown in orange.

2.2 Training reference data↩︎

Our reference data were provided by the ONF (Office National des Forêts). The reference plots are pure stands, i.e. more than 75% of a single species (further work could be interesting to include plots with mixed species). The distribution of each species is provided in 1. The total number of pixels is 67k (the ratio between the classes is about the same). It can be seen that oak plots dominate the data set, with more than 70% of the plots belonging to this class (which is consistent with the fact that oak trees are predominant in our study area). The spatial distribution of the reference plots is neither systematic nor regular. 1 shows a higher concentration of reference plots in the northwest of the study area, and some areas have no reference plots. Finally, 2 provides an example of plots showing that 1) reference plots can be small and 2) different species can be found close to each other.

Table 1: Reference data used for our analysis.
Species Definition \(\#\) Plots
Birch Tree species of the genre Betula 52
Hornbeam Tree species of the genre Carpinus 48
Chestnut Castanea sativa Mill. 61
Oak Quercus robur L. and Quercus petraea (Matt.) Liebl. 3219
Douglas Fir Pseudotsuga menziesii (Mirb.) Franco. 131
Fraxinus Tree species of the genre Fraxinus 39
Beech Fagus sylvatica L. 254
Poplars Tree species of the genre Populus 78
Pines Tree species of the genre Pinus 486
Robinia Robinia pseudoacacia L. 20
Figure 2: Illustration of different reference plots from our database. Each plot covers different S2 pixels (here the plots are very small, covering 6 to 8 pixels). Each dominant tree species is represented by a different color with the corresponding name.

2.3 Additional independent validation data↩︎

We were able to validate our mappings using independent plots of 4 key species from forest health campaign assessments (these plots were used, for example, in [6], [29] to map forest dieback). In total, we had at our disposal 1497 oak, 197 chestnut, 91 pine and 136 beech pure plots from these campaigns. Having these independent plots is interesting since our training data is 1) highly imbalanced and 2), does not cover all the massifs of the region.

2.4 Satellite data↩︎

As explained in the introduction of this paper, our mapping is based on S2 data only, as we focus on a large area and want to propose a classification framework that can be easily used for similar tasks. We used the same processing chain as in [6], consequently a brief description is given below, more details are available in that reference. The S2 satellites (S2-A and S2-B) are operated by the European Space Agency for the European Union’s Copernicus Earth observation program [15]. S2 data are multispectral images, we used 10 spectral bands for our analysis to cover the visible (bands 2, 3, 4), red-edge (bands 5, 6, 7), near infrared (NIR) (bands 8 and 8a) and short-wave infrared (SWIR) (bands 11 and 12) parts of the spectrum. Each image was resampled to pixels of size \(10\times10\)m. The MAJA processing chain [30] was used to produce ortho-rectified level-2A images, with cloud and shadow masks. Finally, S2 images acquired between 2019 and 2020 were used, providing long-term canopy information (the effect of changing the temporal length of the input data is discussed in 5).

3 Methods↩︎

This section provides the methodological steps used to map tree species. A simplified workflow is depicted in 3. The \(iota^2\) Python processing chain [27] was used to extract training samples and produce maps of our study area. As explained in the previous paragraph, level L2A S2 images are acquired and cloud-filtered. Then, they are interpolated on a regular time grid to produce 740 features for each pixel (10 bands over 74 dates). Finally, a classification algorithm is trained on the features of the reference data and used to produce a large-scale tree species map. More details on each step are given below.

Figure 3: Methodological steps used to map tree species with S2 time series.

3.1 Preprocessing↩︎

A linear interpolation (i.e., gap-filling) was used to have consistent time series for each S2 tile (1 interpolated data every 10 days) [6], [31], [32]. At the end of these preprocessing steps, each pixel is characterized by 740 features corresponding to 10 bands acquired over 2 years (74 dates). Other strategies for dealing with missing data and irregularly sampled time series were also tested (without improving our results) and are discussed in 5.

In order to be used efficiently by the classification algorithm (this is particularly true for the convolutional- and attention-based DL models), it is common to have standardized data with mean equal to 0 and standard deviation equal to 1. In our case, especially for the convolutional operation, it is important to preserve the temporal correlation of the S2 bands [33]. Therefore, each S2 band was standardized by taking all the acquisitions over time (and not by standardizing each acquisition independently).

3.2 Classification algorithms↩︎

3.2.1 RF algorithm↩︎

Among the many classification algorithms available in the literature, the RF algorithm is widely used due to its scalability, robustness, and ability to model complex phenomena. The RF algorithm is an ensemble method based on decision trees: it constructs many decision trees and averages their predictions by majority voting. The RF algorithm uses bootstrap aggregation (known as bagging) to improve the stability and reduce the variances of the classification: each tree is trained on a subset of the original dataset, and at each split, a random subset of the features is selected. As explained in the introduction of this paper, we used the RF algorithm as a benchmark method because it is one of the most popular classification approaches in remote sensing. A major advantage of the RF algorithm is its robustness to the choice of its hyperparameters. For our experiments, we used default values, i.e. 100 fully grown decision trees (our tests have confirmed that changing these hyperparameters does not significantly affect the classification results).

3.2.2 Deep learning models↩︎

In our analysis, we focused on 3 different DL architectures based on different mechanisms. We used two classical architectures, which are strong baselines to classify time series [26], namely multilayer perceptron (MLP or fully connected neural network) and temporal convolutional network (TempCNN)[33]. These two models are easy to implement, and our results will show that they can work without extensive tuning, which is interesting for operational purposes. These standard architectures are depicted in 4: it consists of a succession of layers (we have fixed the number of layers to 3, as it provided the best results), each layer being composed of a transformation layer (i.e., a linear or convolutional layer), a batch normalization, and a nonlinear activation function (the rectified linear unit (ReLU) was used). Since we are working with time series, note that the convolution layers are 1-dimensional convolution layers. Unlike in [26], the MLP network also uses batch normalization instead of dropout, as it was found to be more efficient and stable. We also implemented a more recent method, the Lightweight Temporal Attention Encoder (LTAE) [34], which is based on the attention mechanism and is a state-of-the-art method for land cover classification [22].

In summary, from a simplified point of view these 3 structures are very similar: they all aim at extracting relevant features, which are used by the output linear layer to classify the time series. The main difference is related to the feature extraction mechanism (dense layers, convolutions with global max pooling or attention mechanism).

a

b

c

Figure 4: The different deep learning architecture tested in our analysis. The last layer is a standard linear layer with a number of neurons equal to the number of classes to be predicted and a softmax activation..

3.2.3 Hyperparameters used for the deep learning models↩︎

Training a deep learning network is known to be more complex than classical machine learning algorithms such as RF. In the following, we propose a simple guideline for the choice of each parameter. The values used in our experiments are reported in 2. Our results show that these defaults parameters have produced good results for each model. We used the ADAM optimizer, other optimizers were tested and did not change our results significantly. The number of epoch was fixed to 100. An adaptive learning rate strategy was used to reduce the learning rate by a factor of 2 if the validation loss did not improve after 20 epochs. Moreover, the training was stopped after 40 epochs without improvements. The batch size was set to 2048, apart for the MLP with SMOTE where in that case it was set to 8192 (smaller values lead to very poor results, see our discussion on that point). Finally, the Cross-entropy loss was used, other losses were tested (e.g., margin loss) without improving our results.

Table 2: Hyperparameter values used in the different DL models.
MLP TempCNN LTAE
Learning rate \(1^{-4}\) \(1^{-3}\) \(1^{-3}\)
Neurons / conv. filters 1024, 512, 256 128, 128, 128 512
Filter size - 3,3,2 -
Heads - - 6
Embedding size - - 370
Learnable parameters 1.4M 113K 271K

The MLP and TempCNN can be easily tuned by setting a sufficient number of neurons/filters and an appropriate learning rate. Since the MLP tends to overfit more easily, we have found that reducing its learning rate to \(1^{-4}\) instead of \(1^{-3}\) was more stable. In particular, the use of numerous neurons (1024 or 2048) or filters (128) was found to be efficient even when the number of training samples is relatively small (see discussion in 5). For the MLP, we divided the number of neurons by 2 at each layer to reduce the size of the model and improve performances. The LTAE is slightly more complex to tune (number of heads, embedding size, final MLP). Setting the number of heads to 6 was the most effective in our case; reducing this number can lead to underfitting, and increasing it can lead to convergence problems (this may, of course, depend on the data set at hand). The dimension of query/key vector was set to 8 as in the original paper. Similarly, the embedding size was set to 370 (number of features divided by 2). As for the other models, we have found that setting the number of neurons in the MLP to a relatively high value (512 instead of 128 in the original paper) was found more efficient.

3.3 Methods for dealing with imbalanced data↩︎

Most of the methods used to deal with class imbalance can be grouped into algorithm-level methods (i.e., the training algorithm directly takes the class imbalance into account) and data-level methods (i.e., a modification of the training dataset is done)[35]. Our analysis mainly focuses on one method from each family. The first approach is an algorithmic-level method and consists in using class weights to penalize errors related to the underrepresented classes during training. The second strategy, which belongs to the data-level methods, consists in generating synthetic samples from the minority classes. We have used the classical algorithm Synthetic Minority Oversampling Technique (SMOTE), which has been widely used for a wide range of applications [36]. Finally, we also trained our DL models on our raw dataset without considering the class imbalance problem.

Note that other methods have been tested and a discussion of their performance is provided in 5. In particular, we tested a variant of SMOTE, namely ADASYN [37], without any improvement in our results (see the additional results provided in 4.3).

3.4 Validation experiments and metrics↩︎

Our experimental results have been validated by stratified cross-validation (CV, 10 folds). The train/test separation is done at the plot level to avoid auto-correlation problems. The main metric used to validate our results is the F1 macro score, i.e., the average F1 score of each class, a metric robust to imbalanced data. More precisely, the F1 score is the harmonic mean of precision and recall, where precision is the percentage of samples correctly labeled in class \(j\) and recall is the percentage of samples in class \(j\) that were correctly labeled. In addition, we also show the overall accuracy (OA), which is the percentage of samples correctly classified (so it is affected by class imbalance), and the balanced accuracy (BA), which is the recall averaged over each class (so it is not affected by class imbalance). These two additional metrics can be useful since their interpretation is intuitive.

4 Results↩︎

4.1 Classifier comparison and main results↩︎

The CV results (F1, OA and BA) obtained with the different tested configurations are displayed in 3. Overall, it is clear that the DL approaches largely outperform the RF algorithm, especially in terms of F1 and BA, which means that the RF algorithm struggles to correctly classify the underrepresented classes. Moreover, all DL models can achieve very close classification metrics, regardless of the strategy used to deal with imbalanced data. The MLP provides the best F1 score (0.81) without needing any strategy to deal with imbalanced data. Finally, note that the best BA (0.83) is obtained using LTAE with SMOTE (0.82 is obtained using MLP with SMOTE), which means that the recall of minority classes is higher on average.

Table 3: F1, OA and BA obtained for different classification configurations after 10-fold cross-validation. The 95% confidence interval is in parentheses. Best values are in bold.
Model Imb. strat. F1 OA BA
RF - 0.59 (0.06) 0.93 (0.01) 0.52 (0.05)
RF Class weight 0.62 (0.06) 0.93 (0.01) 0.54 (0.05)
RF SMOTE 0.62 (0.06) 0.93 (0.01) 0.54 (0.05)
MLP - 0.81 (0.03) 0.96 (0.01) 0.80 (0.04)
MLP Class weight 0.81 (0.03) 0.96 (0.01) 0.80 (0.04)
MLP SMOTE 0.80 (0.03) 0.95 (0.01) 0.82 (0.03)
TempCNN - 0.79 (0.03) 0.95 (0.01) 0.79 (0.03)
TempCNN Class weight 0.80 (0.02) 0.95 (0.01) 0.80 (0.02)
TempCNN SMOTE 0.80 (0.02) 0.95 (0.01) 0.79 (0.02)
LTAE - 0.80 (0.02) 0.95 (0.01) 0.77 (0.03)
LTAE Class weight 0.80 (0.04) 0.95 (0.01) 0.81 (0.04)
LTAE SMOTE 0.80 (0.04) 0.95 (0.01) 0.83 (0.04)

In addition to these table, we have provided normalized confusion matrices averaged after the 10 runs for each classifier using SMOTE oversampling in 5 (similar results are obtained with other strategies). Looking at these confusion matrices, it is clear that the RF algorithm tends to classify the minority species (except conifers, poplars and beech) as oak, which explains the poor results in terms of F1 score and BA obtained in 3. A significant improvement is observed with the DL methods, even if some minority species (such as birch, hornbeam, beech, and Fraxinus) can be partially classified as oak (between 10 and 30% confusion, depending on the species and model). In addition, DL models also tend to confuse hornbeam with beech and robinia with fraxinus, tree species that are more similar to each other than to oak. Finally, the models with the best BA (typically LTAE and MLP with SMOTE) tend to have higher recall in minority classes (e.g. the best recall for 5 minority species is obtained with LTAE). However, the recall for oak tends to be slightly lower in this case (0.97 instead of 0.98 or 0.99), which affects the precision for other species as the dataset is highly imbalanced.

a

b

c

d

Figure 5: Normalized confusion matrices averaged after 10 folds CV using SMOTE oversampling..

4.2 Validation on independent data↩︎

The RF, MLP, TempCNN and LTAE models were validated on our independent datasets described in 2.3. Each model was trained using on the dataset presented in 2.2. For each species (oak, pine, chestnut and beech) we computed the percentage of plots correctly classified by the different models (i.e., recall). For this experiment, a plot is considered correctly classified if more than 50% of its pixels are correctly classified (results obtained with other classification rules are discussed below). These results are reported in 4.

Overall, most confusion was observed for chestnut and beech species, which tend to be predicted as oak. LTAE with SMOTE performs best and can retrieve on average 71% of the plots, with a significantly higher recall for beech compared to other models (38% for LTAE instead of 29% for MLP, the second best model for this species). Very close results are obtained using MLP with SMOTE, which was found to be better for chestnut retrieval. Overall, it appears that using the SMOTE algorithm leads to better validation results compared to using the class weight, indicating better generalization to areas without training data. The RF algorithm is again largely outperformed, with a BA of 57% and poor retrieval of chestnut and beech (which is consistent with the results obtained with the standard CV in the previous section).

We also perform the same analysis by considering that a plot is correctly classified when more than 20% of its pixels have been correctly detected (this is an easier task since only a few pixels need to be correctly detected). In this case, with SMOTE oversampling, the average recall (BA) is 0.66, 0.85, 0.77, and 0.82 for RF, MLP, TempCNN, and LTAE, respectively. This significant improvement shows that at least a part of the pixels are generally correctly detected, which is valuable for real-world applications. Finally, pixel-level results were also computed and are very close to the results given in 4 (BA equal to 0.58, 0.72, 0.67, and 0.69 for RF, MLP, TempCNN, and LTAE).

Table 4: Percentage of validation plots correctly retrieved (i.e., 50% of the pixels are correctly classified).
Model Imb. strat. Oak Pine Chestnut Beech Mean (BA)
RF SMOTE 0.99 0.92 0.33 0.06 0.58
MLP Class weight 0.98 0.91 0.62 0.24 0.69
MLP SMOTE 0.97 0.95 0.60 0.29 0.70
TempCNN Class weight 0.98 0.91 0.42 0.18 0.62
TempCNN SMOTE 0.97 0.93 0.56 0.26 0.68
LTAE Class weight 0.98 0.87 0.43 0.38 0.67
LTAE SMOTE 0.96 0.96 0.52 0.38 0.71

4.3 Additional results↩︎

To complement the previous results, 5 provides additional experiments conducted to evaluate some variations in the classification workflow.

Table 5: F1 and BA obtained for different configurations after 10-fold CV. TempCNN is abbreviated CNN. “Same” means that the configuration is the same as in 3. “BS” stands for batch size and “BN” for batch normalization. “Imb. Strat.” is the strategy used to deal with imbalanced data, “Under.” means that the oak plots were undersampled by randomly selecting 400 training plots. For SMOTE and ADASYN, the number of neighbors is given (SMOTE k=100 means SMOTE with 100 neighbors). The 95% confidence interval is in parentheses.
Model Config. Imb. Strat. Data F1 BA
MLP Same Class weight 1 year 0.78 (0.04) 0.78 (0.04)
MLP 128, 64, 32 Class weight Same 0.80 (0.05) 0.80 (0.04)
MLP 128, 64, 32 - Same 0.79 (0.05) 0.77 (0.04)
CNN 64, 64, 64 - Same 0.76 (0.03) 0.70 (0.04)
LTAE 4 heads - Same 0.79 (0.03) 0.80 (0.04)
LTAE 8 heads - Same 0.75 (0.04) 0.78 (0.03)
MLP BS 256 - Same 0.80 (0.03) 0.77 (0.03)
MLP BS 256 Class weight Same 0.80 (0.03) 0.79 (0.03)
MLP BS 8k Class weight Same 0.80 (0.03) 0.80 (0.04)
MLP BS 8k no BN Class weight Same 0.73 (0.05) 0.80 (0.04)
MLP No BN - Same 0.79 (0.02) 0.77 (0.02)
RF Same Class weight Under. 0.70 (0.03) 0.68 (0.04)
MLP Same Class weight Under. 0.69 (0.03) 0.80 (0.04)
CNN Same - Under. 0.72 (0.02) 0.77 (0.03)
MLP BS 2048 SMOTE, k=5 Same 0.73 (0.03) 0.82 (0.03)
MLP BS 4096 SMOTE, k=5 Same 0.77 (0.03) 0.82 (0.03)
MLP Same SMOTE, k=100 Same 0.79 (0.03) 0.81 (0.03)
MLP Same ADASYN, k=5 Same 0.78 (0.03) 0.79 (0.03)
MLP Same ADASYN, k=100 Same 0.79 (0.02) 0.81 (0.03)
CNN Same SMOTE, k=100 Same 0.80 (0.02) 0.80 (0.02)
CNN Same ADASYN, k=5 Same 0.76 (0.03) 0.72 (0.03)
CNN Same ADASYN, k=100 Same 0.80 (0.01) 0.79 (0.02)
  • Time range used for the analysis: using 1 year of S2 data instead of 2 slightly decreases the classification results (F1=0.78 instead of 0.81), we observed this with the other models.

  • Size of the DL models: regarding the size of the DL models, a good F1 score can be obtained with an MLP with 3 layers of 128, 64 and 32 neurons, with or without class weights. Note that without class weight, the BA is lower (0.77 instead of 0.80). Overall, we have found that smaller network tend to be more impacted by imbalanced data. In practice, we have observed that choosing a larger number (typically 1024 or 2048 neurons in the first layer) is not a problem and can lead to a slight improvement in classification, which is interesting because it can be adapted to larger datasets. We found that at least 128 filters were needed for TempCNN, with a drop in accuracy observed when using 3 layers of 64 filters. On the other hand, we observed that adding layers was not beneficial and led to a slight deterioration of our classification results. For LTAE, using too many heads can lead to lower accuracy due to convergence problems. However, using 4 heads instead of 6 results in a slightly lower BA (0.80 instead of 0.83).

  • Regularization: a small batch size (here, 256) can reduce the accuracy of the results. In our experiment, when using a batch size equal to 256 with an MLP, even if the F1 score is close to the optimal value, we observed a decrease in the BA (0.77 instead of 0.80). In this case, using class weights mitigates this problem (0.79 instead of 0.80). Using a very large batch size (e.g., 8192) was not found to affect the overall results when using batch normalization (without it, F1 score drops to 0.73 when using a batch of size 8192). Finally, without batch normalization, the F1 score drops to 0.79 instead of 0.81 (and the BA to 0.77 instead of 0.80). The same conclusions were found with or without class weights, and also with the TempCNN architecture.

  • Other strategies to deal with imbalanced data: our results show that undersampling the oak plots (here, by randomly selecting 400 plots) improves the RF predictions (which are still far from the optimal scores) but worsens the F1 score obtained with DL methods, implying that overall it is better to use all available samples with an appropriate strategy to deal with imbalanced data. Regarding the oversampling strategies, SMOTE (S) and ADASYN (A) can lead to very similar results. Regarding the choice of their hyperparameters, we have found that the number of neighbors used in the SMOTE algorithm does not have much impact on the classification results, while it is important to choose a large number (e.g., 100 or more) in the ADASYN algorithm. Finally, when using MLP with SMOTE, we observed a significant reduction in the F1 score when reducing the batch size (0.73 instead of 0.80 when using a batch size of 2048).

5 Discussion and perspectives↩︎

The fact that the RF algorithm can be affected by imbalanced data has already been identified in the literature [12], [21], [38]. In [39], in a land cover classification context, the authors highlighted that RF tends to predict towards the majority class, which is consistent with our observations. In our case, we have found that using RF with SMOTE or class weight failed to handle this issue. As observed in 4.3, undersampling the oak plots improves the results of the RF algorithm, but the F1 score we obtained is still much lower than those obtained with DL methods. For example, this strategy was used in [12] to map tree species in Germany, where the authors found that mapping less common species was still challenging, confirming the results we obtained with the RF algorithm. We also tested a variant of the RF algorithm, namely the Balanced RF [38] implemented in the Python toolbox imbalanced-learn [40]. However, this variant lead to very poor results. As a key takeaway, we recommend that practitioners consider these potential issues for future work, especially since our results highlight the potential benefit of using DL methods, which we found to be able to generalize better, especially for less frequent classes. Other classification methods such as Sparse Gaussian Processes ([22]), Support Vector Machine (SVM) algorithm [41] or Tree Boosting Algorithms (XGboost) [42] have also been tested without providing competitive results. In particular, we also tested other DL architectures (e.g., ResNet, 2D convolutional neural network, recurrent neural network, etc.) without improving our classification results. This means that most of the DL architectures can achieve similar accuracy for our task.

Our results highlight the potential benefits of using DL models, and some simple guidelines can be followed to achieve good accuracy. Specifically, choosing a model sufficiently large (e.g., in our case at least 200k parameters), applying batch normalization and class weights is a good start. Furthermore, using an MLP with these specifications can provide a strong baseline model that requires minimal hyperparameter tuning. On the basis of these recommendations, a review of the literature revealed that there was a tendency for some studies to use DL models with too few parameters. For example, [43] implemented an MLP with only 1 hidden layer with 100 neurons for tree species classification, which led to poor results. Similarly, [44] used a single layer with 18 hidden units. In both cases, the small number of neurons could explain the poor performances obtained with DL methods.

Regarding the method used to address data imbalance, slight improvements and better generalization can be achieved by oversampling with the SMOTE algorithm, but in this case we found that choosing an appropriate batch size was important. Indeed, we observed that if the SMOTE algorithm is used with too small a batch size, poor results can be obtained. This obervation is especially true for the MLP, which is probably due to the fact that MLP can easily overfit, hence they are sensitive to small batches that can be composed of synthetic noisy samples. Finally, the LTAE architecture was found to give the best performance when using the SMOTE algorithm, although it required more hyperparameter tuning. Additional tests using different losses (e.g., margin or focal loss, label smoothing) or the ADASYN algorithm did not improve our results. Overall, it appears that DL methods can also learn patterns related to the minority classes if the model is sufficiently large, hence they can be robust to the imbalanced data problem. This is consistent with previous studies (conducted for other types of data) that showed that good tuning of the models can be sufficient to achieve optimal accuracy in an imbalanced data context [45].

Regarding the input data used for classification, we observed that the direct use of the raw S2 band is sufficient. Similarly to [6], we observed that using 2 years of data instead of 1 can be useful to better characterize the forests and thus improve the classification results. Using additional sensors and information, such as synthetic aperture radar (SAR) [12], [46] or hyperspectral data [47], is an interesting perspective that could improve our results.

We have also tested other strategies to deal with missing data (related to clouds). Our results confirm that standard gap-filling (i.e., linear interpolation) is a strong baseline adapted to work at large scale (11 S2 tiles) and in an operational context [6], [22], [48]. However, we observed that poor interpolation could impact the generalization of the results, especially for minority classes, since few training examples are available. To that extent, working on this point is an interesting perspective, recent studies have shown that improvement is possible by working directly with raw (irregular and unaligned) SITS [49]. Our initial tests were not very conclusive: we tested the method developed in [49], but this lead to inaccurate classification results. This approach, tested for land cover classification, proposes an end-to-end learning framework that makes use of an attention-based interpolation to embed the raw SITS into a regular temporal grid. More tests should be done to investigate why such an approach fails in our case (imbalance data, small dataset, classes with very similar behavior, etc.). We also conducted experiments using the LTAE with raw S2 data and a cloud mask, as tested in [49]. While theoretical results obtained with standard CV appears close to those obtained with our proposed framework, our tests at large scale obtained by producing a map of the study area have shown important changes at the frontier of two S2 tiles. This highlights the fact that such framework tend to largely overfit the specific acquisitions of a given S2 tile, which confirm the results observed in [49].

Other interesting perspectives could be explored, three of which are developed below. We observed that combining the different maps produced with the different models, e.g., by taking the mode of the predictions, could filter some noisy areas in some cases. However, our first tests show that the gain in classification accuracy is not always obvious, hence further work is needed to develop an optimal ensemble learning approach. Nevertheless, we observed that the three different DL architectures produced varying predictions in certain areas, which could serve as an indicator of mapping uncertainty. Interestingly, using different initializations within the same DL architecture yielded significantly more consistent results, providing little to no basis for uncertainty measurement. Finally, for operational mapping, it may be interesting to add more reference data to produce maps that are as accurate as possible. Using accessible information on the tree species from BD Forêt® V2, [10] could be an obvious way for increasing the size of the training dataset. However, since this database is not up-to-date, adding such samples might be non-trivial. Working on mixed-species plots could be another way to add more training samples to the dataset, the main problem being to automatically separate these plots into pure class pixels.

6 Conclusion↩︎

This paper investigates tree species classification in temperate forests based on multispectral remote sensing data in an imbalanced context and over a large area (110000 km\(^2\)). Our study area is located around the Centre-Val de Loire region of France, which is dominated by oak forests (75% of our training plots), and our training dataset is relatively small (less than 5000 plots). The proposed framework uses 2 years of Sentinel-2 (S2) data as input features (a preprocessing is done to remove clouds and interpolate the different input time series on the same temporal grid).

Our results highlight that deep learning (DL) models can be used in that context with good accuracy and largely outperform classical machine learning model such as the Random Forest algorithm, which is highly biased toward the majority class in our case. We tested 3 different architecture based on different mechanisms, i.e., a fully connected network (MLP), a convolutional-based network (1D convolutional network, TempCNN) and an attention-based network (LTAE). These implementations and configuration files have been made freely available and can be used as baseline for other use cases (and potentially other applications).

Our results show that the 3 deep learning architectures tested provide similar results and largely outperform the RF algorithm. The MLP architecture was found to be a strong baseline (F1 score equal to 0.81) that can be used without extensive hyperparameter tuning. While more complex to implement, the LTAE also provided good generalization capabilities and appears to be more efficient at retrieving minority classes when used with the SMOTE algorithm, an oversampling technique. Overall, we found that using the SMOTE algorithm with an appropriate batch size could slightly improve the generalization of the classification models.

Data availability↩︎

The configuration files used to train the models and map the study area with the iota2 processing chain are available here: https://framagit.org/fl.mouret/tree_species_classification_iota2. The open-source iota2 project can be used to produce land cover or vegetation maps from remotely sensed time series, more information and implementation can be found here: https://framagit.org/iota2-project/iota2.

Acknowledgments↩︎

The authors would like to thank the M. Fauvel, H. Touchais and all the TOSCA PARCELLE team for his help and expertise on the use of iota2 software. We thank ONF for sharing reference data and CNES for the use of its HPC center.

Author contributions↩︎

Conceptualization, F.M, D.M., M.P. and C.V-B.; methodology, F.M, D.M., M.P. and C.V-B.; software, F.M.; validation, F.M, D.M., M.P. and C.V-B.; formal analysis, F.M, D.M., M.P. and C.V-B.; investigation, F.M.; resources, F.M.; data curation, F.M. and D.M; writing—original draft preparation, F.M.; writing—review and editing, F.M, D.M., M.P. and C.V-B.; visualization, F.M.; supervision, M.P. and C.V-B.; project administration, M.P. and C.V-B.; funding acquisition, M.P. and C.V-B. All authors have read and agreed to the published version of the manuscript.

Fundings↩︎

This work was supported by the SYCOMORE program, with the financial support of the Région Centre-Val de Loire (France), in collaboration with the SuFoSaT project of the GRAINE ADEME program.

References↩︎

[1]
Wang, C., 2006. Biomass allometric equations for 10 co-occurring tree species in chinese temperate forests. For. Ecol. Manag.222, 9–16. ://www.sciencedirect.com/science/article/pii/S0378112705005888, .
[2]
Návar, J., 2009. Allometric equations for tree species and carbon stocks for forests of northwestern Mexico. For. Ecol. Manag.257, 427–434. ://dx.doi.org/10.1016/j.foreco.2008.09.028, .
[3]
Henry, M., Bombelli, A., Trotta, C., Alessandrini, A., Birigazzi, L., Sola, G., Vieilledent, G., Santenoise, P., Longuetaud, F., Valentini, R., Picard, N., Saint-André, L., 2013. GlobAllomeTree: international platform for tree allometric equations to support volume, biomass and carbon assessment. iForest - Biogeosciences and Forestry6, 326–330. ://dx.doi.org/10.3832/ifor0901-006, .
[4]
Cavers, S., Cottrell, J.E., 2014. The basis of resilience in forest tree species and its use in adaptive forest management in Britain. Forestry88, 13–26. ://dx.doi.org/10.1093/forestry/cpu027, .
[5]
Kacic, P., Kuenzer, C., 2022. Forest biodiversity monitoring based on remotely sensed spectral diversity : A review. Remote Sens.14. ://www.mdpi.com/2072-4292/14/21/5363, .
[6]
Mouret, F., Morin, D., Martin, H., Planells, M., Vincent-Barbaroux, C., 2023. Toward an operational monitoring of oak dieback with multispectral satellite time series: A case study in Centre-Val de Loire region of France. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.17, 643–659. .
[7]
MacKenzie, W.H., Mahony, C.R., 2021. An ecological approach to climate change-informed tree species selection for reforestation. For. Ecol. Manag.481, 118705. ://www.sciencedirect.com/science/article/pii/S0378112720314742, .
[8]
Haneda, L.E., Brancalion, P.H., Molin, P.G., Ferreira, M.P., Silva, C.A., de Almeida, C.T., Resende, A.F., Santoro, G.B., Rosa, M., Guillemot, J., Le Maire, G., Feret, J.B., de Almeida, D.R.A., 2023. Forest landscape restoration: Spectral behavior and diversity of tropical tree cover classes. Remote Sens. Appl.: Soc. Environ.29, 100882. .
[9]
Shovon, T.A., Auge, H., Haase, J., Nock, C.A., 2024. Positive effects of tree species diversity on productivity switch to negative after severe drought mortality in a temperate forest experiment. Glob. Change Biol.30, e17252. ://onlinelibrary.wiley.com/doi/abs/10.1111/gcb.17252, , http://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1111/gcb.17252. e17252 GCB-23-2185.R1.
[10]
IGN, 2019. LA BD FORÊT V2. Inventaire Forestier (IF)://inventaire-forestier.ign.fr/IMG/pdf/lif_46_poster.pdf.
[11]
Fassnacht, F.E., Latifi, H., Stereńczak, K., Modzelewska, A., Lefsky, M., Waser, L.T., Straub, C., Ghosh, A., 2016. Review of studies on tree species classification from remotely sensed data. Remote Sens. Environ.186, 64–87. ://dx.doi.org/10.1016/j.rse.2016.08.013, .
[12]
Blickensdörfer, L., Oehmichen, K., Pflugmacher, D., Kleinschmit, B., Hostert, P., 2024. National tree species mapping using sentinel-1/2 time series and german national forest inventory data. Remote Sens. Environ.304, 114069. .
[13]
Liu, P., Ren, C., Wang, Z., Jia, M., Yu, W., Ren, H., Xia, C., 2024. Evaluating the potential of Sentinel-2 time series imagery and machine learning for tree species classification in a mountainous forest. Remote Sens.16. ://www.mdpi.com/2072-4292/16/2/293, .
[14]
Vaghela Himali, P., Raja, R.A.A., 2024. Automatic identification of tree species from sentinel-2a images using band combinations and deep learning. IEEE Geosci. Remote Sens. Lett.21, 1–5. ://dx.doi.org/10.1109/LGRS.2024.3354814, .
[15]
Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., Bargellini, P., 2012. Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sens. Environ.120, 25 – 36. .
[16]
Kowalski, K., Senf, C., Hostert, P., Pflugmacher, D., 2020. Characterizing spring phenology of temperate broadleaf forests using Landsat and Sentinel-2 time series. Int. J. Appl. Earth Obs. Geoinf.92, 102172. .
[17]
Breiman, L., 2001. Random forests. Mach. Learn.45, 5–32. .
[18]
Persson, M., Lindberg, E., Reese, H., 2018. Tree species classification with multi-temporal Sentinel-2 data. Remote Sens.10. ://www.mdpi.com/2072-4292/10/11/1794, .
[19]
Karasiak, N., Dejoux, J.F., Monteil, C., Sheeren, D., 2021. Spatial dependence between training and test sets: another pitfall of classification accuracy assessment in remote sensing. Mach. Learn.111, 2715–2740. .
[20]
Hemmerling, J., Pflugmacher, D., Hostert, P., 2021. Mapping temperate forest tree species using dense Sentinel-2 time series. Remote Sens. Environ.267, 112743. ://dx.doi.org/10.1016/j.rse.2021.112743, .
[21]
More, A.S., Rana, D.P., 2017. Review of random forest classification techniques to resolve data imbalance, in: 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), IEEE, Aurangabad, India. p. 72–78. ://dx.doi.org/10.1109/ICISIM.2017.8122151, .
[22]
Bellet, V., Fauvel, M., Inglada, J., 2023. Land cover classification with gaussian processes using spatio-spectro-temporal features. IEEE Trans. Geosci. Remote Sens.61, 1–21. .
[23]
Zhong, L., Dai, Z., Fang, P., Cao, Y., Wang, L., 2024. A review: Tree species classification based on remote sensing data and classic deep learning-based methods. Forests15, 852. ://dx.doi.org/10.3390/f15050852, .
[24]
Bolyn, C., Lejeune, P., Michez, A., Latte, N., 2022. Mapping tree species proportions from satellite imagery using spectral–spatial deep learning. Remote Sens. Environ.280, 113205. ://dx.doi.org/10.1016/j.rse.2022.113205, .
[25]
Xi, Y., Ren, C., Tian, Q., Ren, Y., Dong, X., Zhang, Z., 2021. Exploitation of time series Sentinel-2 data and different machine learning algorithms for detailed tree species classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.14, 7589–7603. ://dx.doi.org/10.1109/JSTARS.2021.3098817, .
[26]
Wang, Z., Yan, W., Oates, T., 2017. Time series classification from scratch with deep neural networks: A strong baseline, in: Proc. IJCNN, IEEE, Anchorage, Alaska, USA. p. 1578–1585. ://doi.org/10.1109/ijcnn.2017.7966039, .
[27]
Inglada, J., Vincent, A., Arias, M., Tardy, B., 2016b. iota2-a25386. Software. CESBIO. .
[28]
Simon, M., Colin, A., Letouz, F., 2018. Étude de l’évaluation de la ressource et des disponibilités futures en bois région CENTRE-VAL-DE-LOIRE. Technical Report. Institut national de l’information géographique et forestière (IGN). ://inventaire-forestier.ign.fr/IMG/pdf/powerpoint_presentation_etude_dispo_region_centre_06_07_18_cle8133c2.pdf.
[29]
Mouret, F., Morin, D., Martin, H., Planells, M., Vincent-Barbaroux, C., 2024. Mapping the dieback of several tree species in Centre of France using Sentinel-2 derived indices : a preliminary analysis, in: Proc. 2024 IEEE IGARSS, Athens, Greece. pp. 1–4.
[30]
Hagolle, O., Huc, M., Villa Pascual, D., Dedieu, G., 2015. A multi-temporal and multi-spectral method to estimate aerosol optical thickness over land, for the atmospheric correction of Formosat-2, Landsat, VEN\(\mu\)S and Sentinel-2 images. Remote Sens.7, 2668–2691. ://www.mdpi.com/2072-4292/7/3/2668, .
[31]
Inglada, J., Vincent, A., Arias, M., Marais-Sicre, C., 2016a. Improved early crop type identification by joint use of high temporal resolution SAR and optical image time series. Remote Sens.8, 362. ://dx.doi.org/10.3390/rs8050362, .
[32]
Vuolo, F., Ng, W.T., Atzberger, C., 2017. Smoothing and gap-filling of high resolution multi-spectral time series: Example of Landsat data. Int. J. Appl. Earth Obs. Geoinf.57, 202–213. ://www.sciencedirect.com/science/article/pii/S0303243416302100, .
[33]
Pelletier, C., Webb, G., Petitjean, F., 2019. Temporal convolutional neural network for the classification of satellite image time series. Remote Sens.11, 523. ://dx.doi.org/10.3390/rs11050523, .
[34]
Garnot, V.S.F., Landrieu, L., 2020. Lightweight temporal self-attention for classifying satellite images time series, in: Lemaire, V., Malinowski, S., Bagnall, A., Guyet, T., Tavenard, R., Ifrim, G.(Eds.), Workshop on Advanced Analytics and Learning on Temporal Data (AALTD), Springer International Publishing, Ghent, Belgium. pp. 171–181.
[35]
Johnson, J.M., Khoshgoftaar, T.M., 2019. Survey on deep learning with class imbalance. Journal of Big Data6. ://dx.doi.org/10.1186/s40537-019-0192-5, .
[36]
He, H., Garcia, E., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng.21, 1263–1284. ://doi.org/10.1109/tkde.2008.239, .
[37]
He, H., Bai, Y., Garcia, E.A., Li, S., 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: Proc. IEEE Int. Conf. Neural Netw. (IJCNN), Hong Kong, China. pp. 1322–1328. .
[38]
Chen, C., Liaw, A., Breiman, L., 2004. Using Random Forest to Learn Imbalanced Data. Technical Report666. Department of Statistics, UC Berkley. ://xtf.lib.berkeley.edu/reports/SDTRWebData/accessPages/666.html.
[39]
Mellor, A., Boukir, S., Haywood, A., Jones, S., 2015. Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin. ISPRS J. Photogramm. Remote Sens.105, 155–168. ://dx.doi.org/10.1016/j.isprsjprs.2015.03.014, .
[40]
Lemaître, G., Nogueira, F., Aridas, C.K., 2017. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research18, 1–5. ://jmlr.org/papers/v18/16-365.html.
[41]
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn.20, 273–297. ://doi.org/10.1007/bf00994018, .
[42]
Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system, in: Proc. SIGKDD, ACM, New York, NY, USA. pp. 785–794. ://doi.acm.org/10.1145/2939672.2939785, .
[43]
Verhulst, M., Heremans, S., Blaschko, M.B., Somers, B., 2024. Temporal transferability of tree species classification in temperate forests with Sentinel-2 time series. Remote Sens.16, 2653. ://dx.doi.org/10.3390/rs16142653, .
[44]
Zagajewski, B., Kluczek, M., Raczko, E., Njegovec, A., Dabija, A., Kycko, M., 2021. Comparison of random forest, support vector machines, and neural networks for post-disaster forest species mapping of the krkonoše/karkonosze transboundary biosphere reserve. Remote Sensing13, 2581. ://dx.doi.org/10.3390/rs13132581, .
[45]
Shwartz-Ziv, R., Goldblum, M., Li, Y.L., Bruss, C.B., Wilson, A.G., 2023. Simplifying neural network training under class imbalance, in: Thirty-seventh Conference on Neural Information Processing Systems, New Orleans, Louisiana, USA.://openreview.net/forum?id=iGmDQn4CRj.
[46]
Priyanka, Rajat, Avtar, R., Malik, R., Musthafa, M., Rathore, V.S., Kumar, P., Singh, G., 2023. Forest plantation species classification using full-pol-time-averaged sar scattering powers. Remote Sens. Appl.: Soc. Environ.29, 100924. .
[47]
Dmitriev, P.A., Kozlovsky, B.L., Dmitrieva, A.A., Varduni, T.V., 2023. Maple species identification based on leaf hyperspectral imaging data. Remote Sens. Appl.: Soc. Environ.30, 100964. .
[48]
Inglada, J., Arias, M., Tardy, B., Hagolle, O., Valero, S., Morin, D., Dedieu, G., Sepulcre, G., Bontemps, S., Defourny, P., et al., 2015. Assessment of an operational system for crop type map production using high temporal and spatial resolution satellite optical imagery. Remote Sens.7, 12356–12379. ://dx.doi.org/10.3390/rs70912356, .
[49]
Bellet, V., Fauvel, M., Inglada, J., Michel, J., 2024. End-to-end learning for land cover classification using irregular and unaligned SITS by combining attention-based interpolation with sparse variational Gaussian processes. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.17, 2980–2994. ://dx.doi.org/10.1109/JSTARS.2023.3343921, .