January 09, 2025
Calving front position variation of marine-terminating glaciers is an indicator of ice mass loss and a crucial parameter in numerical glacier models. dl systems can automatically extract this position from SAR imagery, enabling continuous, weather- and illumination-independent, large-scale monitoring. This study presents the first comparison of dl systems on a common calving front benchmark dataset. A multi-annotator study with ten annotators is performed to contrast the best-performing dl system against human performance. The best dl model’s outputs deviate 221 m on average, while the average deviation of the human annotators is 38 m. This significant difference shows that current dl systems do not yet match human performance and that further research is needed to enable fully automated monitoring of glacier calving fronts. The study of Vision Transformers, foundation models, and the inclusion and processing strategy of more information are identified as avenues for future research.
Glacier Calving Front Delineation, Deep Learning, Comparison Study, Foundation Model, Vision Transformer
Climate change is altering our world. One significant change is the recession of glaciers [1], [2]. For marine-terminating glaciers, major ice mass loss occurs not only due to increasing meltwater runoff but also due to changes in ice dynamics [3], [4]. Glacier calving and changes in the calving front position are two of the main mechanisms controlling these dynamic changes. Hence, the calving front position is an essential indicator of glacier dynamics and stability of any marine- or lacustrine-terminating glacier. Frontal positions of glacier termini are required to quantify frontal ablation and, thus, quantify their mass change. Neglecting the frontal ablation and calving front dynamics can lead to an underestimation of the ice thickness of up to 30 % [5] and a reduction of glacier contribution to mean sea level rise by \(2~\%\) for all temperature change scenarios from 2015 to 2100 [6]. Numerical glacier models utilize calving front positions to calibrate and validate their performance or to readjust the model by data assimilation [7], [8]. Meanwhile, modern satellite systems provide weekly to sub-daily observing capabilities depending on the region whereby the positions of the calving fronts can be localized in the acquired images. SAR imagery provides the advantage of continuous monitoring capabilities since the radar signals are illumination and cloud-independent, in contrast to optical imagery. Especially since the launch of the Sentinel-1 mission, the amount of publicly accessible SAR imagery has increased substantially. The vast amount of data poses a new challenge: manual detection of the front in the individual images becomes infeasible. In addition, there is a large archive of SAR imagery from previous missions ranging back to the 1990s. Therefore, algorithms for automated analysis of large data quantities are required.
Since 2019, several studies have applied dl techniques to delineate the calving front of marine-terminating glaciers or the coastline of entire ice shelves in satellite imagery. The first studies [9]–[22] are all based on the U-Net architecture [23], which up to this day is the basis of many state-of-the-art networks in image segmentation. Later studies [24]–[28] employ networks such as DeepLabv3+ [29], Xception [30], and VGG16 [31]. Zhu et al. [32] explore the combination of CNNs and ViTs [33]. Currently, only one study [34] relies on a fully ViT-based network [33]. As different datasets and metrics were used to train and evaluate these algorithms, the results are not comparable.
This study compares these algorithms in terms of their ability to delineate the calving front of marine-terminating glaciers, using SAR imagery. In total, we assess the performance of 22 dl systems by adapting, re-training, and evaluating every single system with a common benchmark dataset, which was published in prior work [12]. We address the questions of whether a particular neural network architecture is better suited for localizing the calving front than others, what influence the label used for training has on performance, and whether more global-scale semantic information in the input is beneficial. The in-depth analysis of the assessment offers potential avenues for future research. To put the dl performance in perspective, we conduct a multi-annotator study. Ten anonymous annotators manually labeled each SAR image, allowing us to assess the variance between human annotators and check whether automatic front extraction has already reached human performance.
All dl systems are optimized, trained, and evaluated on the same dataset: caffe, which was introduced by Gourmelon et al. [12]. The dataset encompasses 681 SAR images from seven tidewater glaciers dating from 1996 to 2020. Five glaciers are located on the Antarctic Peninsula, one in Greenland, and one in Alaska. The dataset comprises multiple missions (ERS-1/2, Envisat, RADARSAT-1, ALOS PALSAR, tsx, tdx, and Sentinel-1). The imagery was multi-looked, calibrated, geo-referenced, and ortho-rectified. Image sizes in pixels vary between \(405\,\times\,382\) and \(3561\,\times\,2768\), depending on the sensor and captured glacier, while the spatial resolution ranges between 6 and 20 m per pixel. Each SAR image has two manually annotated labels with the same geolocation. One label shows the calving front as a binary segmentation mask, where each pixel in the mask belongs to either the front or the background. The other label displays a multi-class segmentation into landscape zones, including ocean and ice mélange (a combination of sea ice and icebergs), rock outcrop, glacier, and a NA area that comprises SAR shadows and regions outside the radar scene. Depending on the type of label used in the original publication, each dl system is either trained using the binary front labels, the zone labels, or both labels together. Additionally, caffe provides a bounding box for each image that shows the region of interest and is used to exclude static glacier fronts. For the zone labels, the calving front is extracted during post-processing as the edge between glacier and ocean zones within this bounding box. For the front labels, the prediction inside the bounding box is taken as the final calving front prediction. For the evaluation of the trained dl systems, the dataset contains a test set of 122 images, which are withheld during training. Which part of the training set is used for the validation during hyperparameter optimization is the user’s choice. The test set includes all images of the Columbia Glacier in Alaska and the Mapple Glacier on the Antarctic Peninsula. This intercontinental spread of the test set and the spatial separation of the test and training sets ensures that the evaluation assesses the reproducibility of the dl systems’ performance in a global context.
For the comparison, we selected studies that take satellite imagery as input to a neural network and extract either the calving front of a marine-terminating glacier or the coastline of an ice shelf. Only three studies are excepted: We do not evaluate the studies of Baumhoer et al. [9] and Zhang et al. [22] as both were superseded by their successor models Heidler et al. [15] and Zhang et al. [27]. Similarly, we do not evaluate the study by Heidler et al. [25] because their network inherently requires that there is only one coastline and not multiple calving fronts in a single image. Therefore, Heidler et al. [25]’s model is not applicable to the caffe dataset, which shows multiple calving fronts in several images. Additionally, we explore the performance of foundation models – large deep neural networks that have been trained on enormous amounts of data and aim to handle various downstream tasks for which only minimal fine-tuning is required [35]. For segmentation tasks, several foundation models have emerged recently [36]–[38]. As a representative, we choose to evaluate the promptable sam [36] in the advertised zero-shot manner, i. e., no fine-tuning is performed.
Adjustments are necessary to enable comparison between the algorithms of the different studies. We regard each paper’s code as a system, meaning we try to minimize the adaptations we perform. The pre-processing, the dl model, and post-processing are kept unchanged as much as possible. We only adapt parts of the pipeline to make the code run with the employed dataset, which might differ from the dataset initially used to train and test the code. For example, if the respective dataset contains binary zone segmentation masks (glacier vs. ocean), the loss function will likely be or contain binary cross-entropy (BCE), which we will have to adapt to categorical cross-entropy to work with our multi-class zone segmentation masks. We tweak the pipelines of the dl systems so that they take in SAR imagery and learn to extract the calving front by either using caffe‘s zone or front labels. Systems that were previously trained on binary coastline masks or binary calving front masks are trained on caffe’s binary calving front masks. Systems previously trained on binary ocean masks or multi-class segmentation masks are trained on caffe’s multi-class zone masks. Any manual steps in the pre-or post-processing of the systems are skipped, as we want to test the systems’ ability to delineate calving fronts fully automatically. Since most of the standard pre-processing is already complete for the benchmark dataset (see Sec. 2), only pre-processing techniques related to the specific architecture of the neural network need to be applied. Concerning the post-processing, we add bounding box masking and the deletion of too short fronts to the end of each post-processing schema. The bounding boxes and the minimum length of the fronts are dataset-specific prior knowledge, which Gourmelon et al. [12] use in their post-processing scheme. Hence, this prior knowledge needs to be integrated into the other systems to keep the comparison fair. Bounding box masking alters the prediction so that outside of the bounding box, all pixels belong to the background. The minimum length of a front for the given dataset is 1.5 km. All predicted front pixels belonging to a connected line shorter than half of this minimum length are set to background. System-specific adjustments can be found in the supplementary information (see Sec. ¿sec:app:methods?). An overview of the segmentation masks originally used, the network architecture on which each system is based, the strategy for dealing with image sizes, and the augmentations performed can be found in Tables 2, 3, and 4.
To ensure fairness in the comparison, we re-optimize the hyperparameters on the benchmark’s training set. Therefore, we split the training set into a train set and a validation set, whereas we train the network on the train set and evaluate it on the validation set. The split ratio is taken from the respective study. For the optimization, we chose the hyperparameters that were specified as being optimized in the corresponding publication. Additionally, if not already mentioned in the publication, we optimize the learning rate or the base and maximum learning rate if a scheduler is used. A list of the final set of re-optimized hyperparameters with the best validation results can be found in the supplementary information (see Table 5). Next, we trained each system five times. The number of epochs trained was calculated so that the model would see 150 times the number of pixels in the training set. For the calculation, the amount of patch overlap, resizing, and the number of iterations in one epoch had to be taken into account. Lastly, the five trained systems are evaluated on caffe’s test set, and the mean and standard deviation of the evaluation metrics are computed over the five runs.
For the evaluation, two metrics are employed, which were both introduced by Gourmelon et al. [12] alongside the benchmark dataset: the mde and the number of images with no predicted front. The number of images with no predicted front counts the images where no front is found by the dl system. The mde evaluates the distance between the predicted locations of the calving fronts and the locations of the manually labeled calving fronts. It is calculated as: \[\begin{gather} \label{eq:mean95distance95error} \mathrm{MDE}(\mathcal{I}) = \frac{1}{\sum_{(\mathcal{P}, \mathcal{Q}) \in \mathcal{I}} (|\mathcal{P}| + |\mathcal{Q}|)} \cdot \\ \sum_{(\mathcal{P}, \;\mathcal{Q}) \in I} \bigg( \sum_{\vec{p} \in \mathcal{P}} \min_{\vec{q} \in \mathcal{Q}} \lVert \vec{p}-\vec{q} \rVert_2 + \sum_{\vec{q} \in \mathcal{Q}} \min_{\vec{p} \in \mathcal{P}} \lVert \vec{p}-\vec{q} \rVert_2 \bigg) \end{gather}\tag{1}\] whereas \(\mathcal{I}\) is the set of all images where a front is predicted, \(|.|\) the cardinality of a set, \(\mathcal{P}\) all ground truth front pixels of one image, and \(\mathcal{Q}\) all predicted front pixels of the same image. Images with no predicted front pixels are ignored during the calculation.
The mde is closely related to two other metrics: The Average Symmetric Surface Distance [39], which is a well-known metric in medical image segmentation, and the Chamfer Distance [40], which is used for distance calculations between point clouds.
A trade-off exists between the mde and the number of images with no predicted front. For a front that is difficult to predict, the outcomes are typically either a complete absence of a predicted front or the prediction of an inconsistent and distant front. In the first scenario, there is an increase in the number of images without any predicted front, keeping the mde low. In the second scenario, few images lack a predicted front, resulting in a higher MDE.
To check whether the resulting differences between the dl systems are significant, a Kruskal-Wallis test [41] is conducted. The best-performing model is compared with the second, third, and fourth best-performing models to check whether the performance gain is significant. For this purpose, six one-sided Mann-Whitney U-tests are performed. Three of them are based on the models’ mdes, and three are based on the models’ number of images with no predicted front. Subsequently, the results are grouped according to different properties of the dl systems to discover whether these properties have an impact on the performance. To test the hypothesis that a certain group is more suitable, the mdes of the groups are compared with a Kruskal-Wallis test [41]. The Kruskal-Wallis test is followed by one-sided Mann-Whitney U-tests that check whether the performance differences between the best group and the remaining ones are significant. First, the results are grouped by base architecture. The dl systems have a total of four basic architectures on which they are built: the U-Net [23], DeepLabv3+ [29], the ViT [33], and VGG16 [31] (see Table 2). Only one model, the GLA-STDeepLab [32], mixes ViT and DeepLabv3+, which we then regard as a fifth type of base architecture. Second, the results are grouped by models trained on caffe’s binary front labels, caffe’s zone labels, and models trained in a multi-task manner on both labels (see Table 2). To test the hypothesis that more global-scale semantic information is beneficial for performance, the correlation factors between the mde and two variables are calculated. For the first variable, each model’s mean input size in pixels during training is taken as a surrogate of how much information goes into the networks. For the second variable, the depth of used U-net architectures is taken as an estimate of how much local-global information interaction takes place.
For the statistical analysis of the results, all posthoc tests following Kruskal-Wallis tests were carried out hypothesis-driven and are Bonferroni-corrected if applicable.
SAR imagery is not easy to interpret. Ice mélange, for example, exhibits similar characteristics as glacial ice and is therefore easily confused as part of the glacier (see Fig. 4 (c) and 4 (d) as examples). Hence, we conducted a multi-annotator study for caffe‘s test set to visualize and quantify the differences in human annotations. Nine annotators participated in our study, which, together with the original annotator of the caffe dataset, results in ten annotations for each image in the test set. The annotators’ levels of proficiency in QGIS and knowledge about glaciers are given in the supplementary information (Fig. 7). The annotators were asked to delineate the calving fronts in QGIS, following a provided manual. In addition to the SAR images, they were assisted with a catchment for each glacier and one optical image per glacier (not per SAR image) to help understand the geometry of the glacier. The resulting shape files were to be post-processed by removing everything within the catchment area. However, some annotators also deliberately labeled the rocky coastline, resulting in fragmented, spurious fronts when everything within the catchment was removed. Therefore, we had to buffer all catchments by 120 m to remove false fronts. In addition, the Columbia Glacier catchments were expanded at the coastline between the eastern and western glacier tongues, and finally, all fronts shorter than 750 m were deleted.
As no objective ground truth exists due to the subjectivity of labeling, we consider the aggregation of all people as ground truth. For calculating how much human annotations deviate, we would, however, get a bias if we simply calculated the mde between each annotator and the combination of all ten annotators. Instead, for each annotator, we aggregate the nine remaining annotators and compare the annotator with this combined version. For the aggregation of the nine annotators, we conduct a majority vote. To combine the manually labeled calving fronts, the fronts are used together with the catchment areas to create one PNG per annotator showing the ocean area. For each pixel in the combined image, the number of annotators that assigned that pixel to the ocean area is counted. If more than or equal to five annotators assign this pixel to the ocean, this pixel is also attributed to the ocean in the combined image. We subtract an eroded version of the ocean from the ocean area to obtain a coastline from the combined ocean area. Next, we remove the parts of the coastline that lie within the catchment that was also used for the individual annotations, leaving us with the calving front. Finally, we delete fronts that are shorter than 750 m and occur due to rocky coastlines that are labeled as front and are not covered by the buffered catchment area.
To examine whether dl has already reached human performance in detecting glacier calving fronts in SAR images, we use the dl system with the lowest mde and compare it to human performance. Since training a neural network is a stochastic process, the dl system is trained five times. As the caffe benchmark dataset was labeled by annotator number ten, the dl system is trained on annotator number ten, and, therefore, the system might have a bias towards this annotator. Still, as the dl system’s outputs are not equal to the annotations of annotator number ten, we compare the dl system to the aggregation of all ten annotators. The combination of the ten annotators is performed in exactly the same way as it is done for the combination of the nine annotators. To make the comparison between the dl system and annotators fair, the predictions of HookFormer [34] are further post-processed just like the multi-annotator annotations (removal of predicted front pixels within the buffered catchment area, deletion of fronts shorter than 750 m). We then calculate the number of images with no predicted front and the mde to the combined annotations instead of caffe’s ground truth. To test whether the difference in mde between humans and the dl system is significant, a one-sided Mann-Whitney U-test is carried out.
 
The number of images with no predicted front varies strongly between the 22 systems. One system fails to detect a calving front in 100 of the 122 images in the test set, while two systems detect fronts in all 122 images. The mdes of the systems range between 338 m–4712 m (Fig. 1). We provide a visual comparison between the predictions of the five dl systems with the lowest mde for sample images of the Columbia and Mapple glaciers in figures 2 and 3, respectively. Many systems were designed for different tasks than the caffe dataset, which may explain the low performance. The use of SAR data in the caffe dataset as opposed to optical imagery, the extraction of laterally bounded glacier calving fronts rather than ice shelf edges, and the construction of the test set containing only glaciers not seen during training provide a challenging basis for calving fronts delineation. In an attempt to explain the significant differences in performance between the systems, we sort the dl systems according to certain characteristics and check whether there is a link with performance. The statistical methods used can be found in Sec. 3.3 and the numerical results in the supplementary information (Sec. ¿sec:sec:statistical95analysis95results?). The first feature we examine is the basic architecture, i. e., the underlying neural network composition upon which the individual model is built. ViTs [33] significantly outperform other architectures such as DeepLabv3+ [29] or U-Net [23]. Further analyses suggest that the inclusion of global-scale semantic information through larger input sizes and strategies for the targeted use of this information, e. g., deeper U-Net [23] architectures, appear to be crucial factors for the performance of dl systems. The integration of other strategies for utilizing global and multi-scale information, such as aspp [42], the HookNet architecture [43], or attention mechanisms as in ViT [33], is also beneficial. Employing additional information with regards to the training labels offers another advantage; both mtl approaches [15], [16], [24] and systems trained only on caffe’s zone label [12]–[14], [18], [20], [21], [26]–[28], [32], [34], [36] outperform systems trained only on caffe’s binary front [10]–[12], [17], [19].
 
 
Nonetheless, several dl systems show difficulties in segmenting images of the Columbia Glacier taken by Sentinel-1, suggesting that the training data for dl systems designed to work with Sentinel-1 imagery should include more Sentinel-1 samples than the benchmark dataset [12] used for this comparison. Sentinel-1 images are under-represented in caffe’s training set (15 Sentinel-1, 52 ERS-1/2, 72 Envisat, 54 RADARSAT-1, 40 ALOS PALSAR, 326 TSX/TDX images) and, additionally, smaller than the average image size in the training set (S1: 998 \(\times\) 651 vs.complete training set 2163 \(\times\) 2174).
One of the dl systems in our comparison is a foundation model [36]. For caffe, the usage of this foundation model [36] in the advertised zero-shot way resulted in a higher mde than the mde of the model [34] used to generate input prompts. Nevertheless, further research on foundation models is required to determine whether fine-tuned versions could outperform specialized models. In addition, future foundation models developed for the segmentation of radar images may be more suitable, as the current versions are generally trained on optical images.
The post-processing of the results of a network could also be worth further investigation, as the difference in mde between Gourmelon et al. [12]’s zone system and Gourmelon et al. [13]’s system is solely due to the improved post-processing.
The dl system with the lowest mde is the HookFormer [34], a ViT that has two connected branches with different resolution levels and was trained on caffe’s zone labels. One of the branches receives a down-scaled image showing the greater global surroundings, while the other branch takes in the current high-resolution region of interest. This mimics the human approach of first mapping the surroundings and then zooming into the area of the calving front once the overarching formation is recognized. Although the HookFormer achieves the lowest mde, it still encounters issues with some predictions. In certain images from the test set, the system incorrectly identifies ice mélange as part of the glacier, erroneously shifting the calving front towards the ocean. This misclassification reduces the system’s performance during the winter months. In other images, rocky coastline is misidentified as part of the calving front. However, this might also be an issue of the test set, as the rock class has not been picked manually, so the rock may actually be covered by ice or snow. If the system is used to extract calving fronts for new glaciers and not for comparison with other dl systems on the benchmark, this situation could easily be avoided by using a static mask that excludes the rocky coastlines of laterally bounding mountains. Moreover, the HookFormer, like the other dl systems, exhibits a decreased delineation performance for Sentinel-1 images of the Columbia Glacier. Additionally, the outputs of the HookFormer show slight patching artifacts. Since the complete images provided in caffe are too large to be fed unchanged into the neural network, the images must be divided into patches, which, in this case, sometimes leads to completely straight edges between the predicted classes. This problem could most likely be mitigated if the region-of-interest patches were extracted with an overlap and the outputs at the overlap were averaged. Lastly, HookFormer’s delineated calving fronts seem to be ragged, which could be fixed during post-processing.
In most cases, the labeled calving fronts of the multi-annotator study do not differ much between the different annotators. The averaged mde of all annotators for the complete test set is 38 m with a standard deviation of 15 m. The mdes for each annotator and for different subsets of the test set are given in Table 1. The labeling of images from the Mapple Glacier was more ambiguous compared to those of the Columbia Glacier. This ambiguity is likely due to the presence of ice mélange in front of the calving front in several Mapple Glacier images, which complicated the mapping of the glacier front. For example, in Fig. 4 (c), the region between the ocean and the glacier was recognized as glacial ice by two annotators, while the other annotators assigned it to the ocean as ice mélange. This shows that ice mélange in SAR imagery is not only a challenge for dl systems but also for humans, thereby constraining the learning possibilities of dl systems. Consequently, annotations for Envisat and ALOS PALSAR satellite images of the Mapple Glacier exhibit higher mdes compared to other sensors. On average, the mde for Sentinel-1 images is higher than for ERS-1/2 and TanDEM-X, suggesting that Sentinel-1 images are also more challenging for humans to interpret.
| Season | Glacier | Sensor | Resolution | |||||||||||
| All | Sum. | Win. | Map. | Col. | S1 | Envi. | ERS | PAL. | TSX | 20 | 17 | 7 | ||
| Anno. | # 1 | \(81\) | \(75\) | \(87\) | \(238\) | \(44\) | \(149\) | \(816\) | \(21\) | \(651\) | \(30\) | \(240\) | \(651\) | \(30\) | 
| # 2 | \(30\) | \(24\) | \(36\) | \(32\) | \(30\) | \(86\) | \(36\) | \(22\) | \(61\) | \(19\) | \(79\) | \(61\) | \(19\) | |
| # 3 | \(45\) | \(44\) | \(47\) | \(56\) | \(43\) | \(135\) | \(60\) | \(24\) | \(226\) | \(23\) | \(125\) | \(226\) | \(23\) | |
| # 4 | \(38\) | \(31\) | \(46\) | \(58\) | \(34\) | \(117\) | \(239\) | \(12\) | \(36\) | \(18\) | \(129\) | \(36\) | \(18\) | |
| # 5 | \(35\) | \(28\) | \(42\) | \(34\) | \(35\) | \(103\) | \(78\) | \(29\) | \(47\) | \(20\) | \(99\) | \(47\) | \(20\) | |
| # 6 | \(23\) | \(18\) | \(28\) | \(28\) | \(22\) | \(66\) | \(78\) | \(22\) | \(41\) | \(13\) | \(67\) | \(41\) | \(13\) | |
| # 7 | \(33\) | \(27\) | \(40\) | \(27\) | \(35\) | \(95\) | \(57\) | \(32\) | \(58\) | \(21\) | \(89\) | \(58\) | \(21\) | |
| # 8 | \(30\) | \(24\) | \(35\) | \(29\) | \(30\) | \(92\) | \(51\) | \(15\) | \(56\) | \(17\) | \(86\) | \(56\) | \(17\) | |
| # 9 | \(28\) | \(26\) | \(30\) | \(37\) | \(26\) | \(74\) | \(65\) | \(16\) | \(47\) | \(18\) | \(72\) | \(47\) | \(18\) | |
| # 10 | \(41\) | \(39\) | \(44\) | \(28\) | \(44\) | \(73\) | \(34\) | \(26\) | \(68\) | \(35\) | \(68\) | \(68\) | \(35\) | |
| Mean | \(\mathbf{38}\) | \(\mathbf{34}\) | \(\mathbf{43}\) | \(\mathbf{57}\) | \(\mathbf{34}\) | \(\mathbf{99}\) | \(\mathbf{151}\) | \(\mathbf{22}\) | \(\mathbf{129}\) | \(\mathbf{21}\) | \(\mathbf{105}\) | \(\mathbf{129}\) | \(\mathbf{21}\) | |
| Run | # 1 | \(207\) | \(161\) | \(256\) | \(132\) | \(223\) | \(811\) | \(222\) | \(145\) | \(211\) | \(104\) | \(724\) | \(211\) | \(104\) | 
| # 2 | \(206\) | \(171\) | \(244\) | \(107\) | \(228\) | \(850\) | \(285\) | \(199\) | \(174\) | \(99\) | \(764\) | \(174\) | \(99\) | |
| # 3 | \(239\) | \(172\) | \(312\) | \(115\) | \(266\) | \(1067\) | \(266\) | \(199\) | \(188\) | \(103\) | \(949\) | \(188\) | \(103\) | |
| # 4 | \(240\) | \(176\) | \(307\) | \(91\) | \(274\) | \(1023\) | \(185\) | \(75\) | \(133\) | \(114\) | \(895\) | \(133\) | \(114\) | |
| # 5 | \(212\) | \(188\) | \(238\) | \(103\) | \(237\) | \(823\) | \(190\) | \(165\) | \(224\) | \(105\) | \(735\) | \(224\) | \(105\) | |
| Mean | \(\mathbf{221}\) | \(\mathbf{174}\) | \(\mathbf{271}\) | \(\mathbf{110}\) | \(\mathbf{245}\) | \(\mathbf{915}\) | \(\mathbf{230}\) | \(\mathbf{157}\) | \(\mathbf{186}\) | \(\mathbf{105}\) | \(\mathbf{813}\) | \(\mathbf{186}\) | \(\mathbf{105}\) | |




Figure 4: Visualizations for all ten annotations by humans (shades of blue), the five post-processed HookFormer runs (shades of red), and the aggregation of human annotations (yellow). (a) shows the Mapple Glacier on 2nd November 2009, acquired by the TSX satellite; (b) shows the Columbia Glacier on 15th March 2016, acquired by the TDX satellite; (c) shows the Mapple Glacier on 9th June 2007, acquired by the Envisat satellite; and (d) shows the Columbia Glacier on 6th January 2018, acquired by the S1 satellite..
Furthermore, we want to compare the annotators with the best-performing dl system - the HookFormer. The mde of the HookFormer’s post-processed automatic calving front predictions is high, with an average of 221 m and a standard deviation of 15 m. This result is significantly (refer to Sections 3.4 and ¿sec:sec:results95multi95annotator? for the employed statistical tests and their numerical results) higher than the comparatively low average mde of the manual annotations with an mde of 38 m and a standard deviation of 15 m.
A visual comparison between the dl system’s and the annotators’ performance is shown in Fig. 1. The mdes for each run, subdivided into different subsets of the test set, are given in Table 1. The mde for the predicted fronts of the dl system is higher for winter images than for summer images. During winter, ice mélange is more prevalent, which might cause this drop in performance. For the dl system, predictions for the Columbia Glacier have a higher mde than for the Mapple Glacier. This difference could be due to stronger patching artifacts in the Columbia predictions, as the Columbia Glacier images tend to be larger than the Mapple Glacier images and, therefore, need to be split into more patches. Of the various sensors, Sentinel-1 has by far the highest mde for the outputs of the dl system. The low resolution of 20 m, combined with ice mélange in front of the calving front, leads to false predictions for Columbia Glacier (see, e. g., Fig. 4 (d)) and increases the total mde for Sentinel-1 images. Although to a lesser extent, this is also true for human annotators since a drop in performance for low-resolution images containing ice mélange is observable. For human annotators, this also applies to the Mapple Glacier. In general, higher-resolution images seem to be easier for the dl system to understand, as indicated by the low mde for images with a resolution of 7 m. Two examples where the predictions of the annotators and the runs of the dl system closely agree are shown in Fig. 4 (a) and 4 (b).
When one of the evaluated dl systems shall be used to generate a calving front dataset for analysis, we strongly recommend the use of automated or manual checks to ascertain the plausibility of the delineated calving fronts. Still, our evaluation is restricted to the scenarios presented in caffe’s test set: laterally bounded glaciers not seen during training and captured by SAR sensors. Scenarios like ice shelves, optical images, and glaciers already seen during training are not covered by the caffe dataset and have, therefore, not been tested in this study.
Our research shows that dl has not yet reached human performance in delineating glacier calving fronts. The best-performing dl system produces calving front predictions that are, on average, 183 m away from the average human-labeled calving front after post-processing. As an example, for the Mapple Glacier, the difference of 110 m to the average human-labeled calving front would lead to an error of 0.38 km2 in glacier area if we multiply the difference by the average length of Mapple’s front in caffe’s test set.
However, an assessment of frontal ablation rates for the large number of tidewater glaciers at high temporal resolution and on regional scales is still missing [44], [45]. Consequently, we are faced with the need to improve dl systems, as manual mapping is not feasible. From our analyses of the influences on the performance of dl systems, several avenues for future research are derived to improve the calving front delineation performance of dl systems: We suggest that future research should further explore the possibilities of ViTs and foundation models and focus on the efficient provision and integration of global information. Until human performance is achieved, we strongly recommend the use of automated or manual checks to ascertain the plausibility of the delineated calving fronts.
The authors would like to sincerely thank all anonymous annotators for their contribution to this research. This research was funded by the Bayerisches Staatsministerium für Wissenschaft und Kunst within the Elite Network Bavaria with the Int. Doct. Program “Measuring and Modelling Mountain Glaciers in a Changing Climate” (IDP M3OCCA)) as well as the German Research Foundation (DFG) project “Large-scale Automatic Calving Front Segmentation and Frontal Ablation Analysis of Arctic Glaciers using Synthetic-Aperture Radar Image Sequences (LASSI)”. The authors thank the Jet Propulsion Laboratory, California Institute of Technology, for support of their work under a contract with the National Aeronautics and Space Administration. The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR projects b110dc and b194dc. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the DFG – 440719683. The author team acknowledges the provision of satellite data under various AOs from respective space agencies (DLR, ESA, JAXA, CSA).
The benchmark dataset caffe is available at https://doi.org/10.1594/PANGAEA.940950 [46].
Figures 2, 3, 4, 5, and 6 show a subset
of caffe‘s images. The full set of visualizations is provided at https://doi.org/10.5281/zenodo.11484341.
Codes for the dl systems can be found in their studies’ respective repositories:
https://github.com/daniel-cheng/CALFIN,
https://github.com/VChristlein/PixelwiseDistanceRegression4GlacierSegmentation,
https://github.com/zetaSaahil/Glacier-CFL-detection_DMapBCE,
https://github.com/Nora-Go/Calving_Fronts_and_Where_to_Find_Them,
https://github.com/EntChanelt/GlacierCRF,
https://github.com/VChristlein/BayesianUNet4GlacierSegmentation/,
https://github.com/khdlr/HED-UNet,
https://github.com/ho11laqe/nnUNet_calvingfront_detection,
https://github.com/VChristlein/AttentionUNet4GlacierSegmentation/,
https://github.com/facebookresearch/segment-anything,
https://github.com/eloebel/glacier-front-extraction,
https://github.com/PCdurham/SEE_ICE,
https://github.com/yaramohajerani/FrontLearning,
https://github.com/VChristlein/MostOutOfUNet4GlacierSegmentation,
https://github.com/RiverNA/AMD-HookNet,
https://github.com/RiverNA/HookFormer,
https://github.com/enzezhang/FrontDL3,
https://zenodo.org/records/8270875, and
https://github.com/Tangyu35/Calving-front-detection.
Nora Gourmelon: Conceptualization, Methodology, Software, Experiments, Statistical Analysis, Project administration, Writing - Original draft preparation. Konrad Heidler: Software, Experiments, Writing - review & editing. Erik Loebel: Software, Experiments, Writing - review & editing. Daniel Cheng: Software, Writing - review & editing. Julian Klink: Software, Experiments, Writing - review & editing. Fei Wu: Software, Experiments, Writing - review & editing. Noah Maul: Writing - review & editing. Moritz Koch: Writing - review & editing. Marcel Dreier: Writing - review & editing. Dakota Pyles: Writing - review & editing. Thorsten Seehaus: Supervision, Writing - review & editing. Matthias Braun: Supervision, Writing – review & editing. Andreas Maier: Supervision, Writing – review & editing. Vincent Christlein: Supervision, Validation, Writing - review & editing.
Nora Gourmelon received the B.Sc. and M.Sc. degrees (passed with distinction) in computer science from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany, in 2019 and 2020, respectively, where she is currently pursuing the Ph.D. degree in computer science with the Pattern Recognition Laboratory. She joined the Pattern Recognition Laboratory, FAU, in 2020, and is part of the International Doctorate Program “Measuring and Modeling Mountain glaciers and ice caps in a Changing ClimAte (M3OCCA).” She was honored as AI Newcomer 2023 in the field of natural and life sciences by the German Association of Computer Science. Her main research interests include applications of AI on topics related to sustainability and natural sciences.
Konrad Heidler (Student Member, IEEE) received the Bachelor’s degree in Mathematics, the Master’s degree in Mathematics in Data Science, and the doctorate in engineering (Dr.-Ing.) from Technical University of Munich (TUM), Munich, Germany, in 2017, 2020, and 2024, respectively. He is currently a postdoctoral researcher at TUM, where he is leading the working group for Visual Learning and Reasoning at the Chair for Data Science in Earth Observation. His current research work focuses on the application of deep learning in polar regions.
Erik Loebelreceived the Bachelor’s degree in Geodesy and Geoinformation and the Master’s degree in Geodesy from Technische Universität Dresden (TU Dresden), Germany, in 2016, and 2019, respectively. Since 2020, he is pursuing a PhD degree at the Chair of Geodetic Earth System Research, TU Dresden. His reseach interests include remote sensing and earth observation, machine learning, signal processing and time series analysis, with a special focus on the cryosphere and polar regions. His current research aims at developing and applying deep learning methods for monitoring glacier changes in Greenland.
Daniel Cheng received his B.Sc., M.Sc., and Ph.D. in computer science from the University of California, Irvine. His Ph.D. work focused on the automatic extraction of glacial features, for use in modeling frameworks such as the Ice-sheet and Sea-level System Model. His interests involve applying such machine learning methods to the processing of remote sensing data products, as well as the use of such data to model aspects of the earth’s climate. He is currently pursuing these interests as a postdoc at the Jet Propulsion Laboratory, where his work focuses on the state estimation of the Antarctic Ice Sheet, using ISSM to provide controls on ocean models at the ice-ocean interface.
Julian Klink received the B.Sc. degree in computer science from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany in 2022 and is currently pursuing the M.Sc. degree in computer science with the Pattern Recognition Laboratory, FAU.
Anda Dong received his B.Sc. degree in 2023 and is currently pursuing a M.Sc. in computer science at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany. In 2023/2024, he participated in the Cooperative Laboratory Study program at Tohoku University, Japan.
Fei Wu received the B.E. degree in electronic and information engineering from Central South University, Changsha, China, in 2017, and the Ph.D. degree in signal and information processing from University of Chinese Academy of Sciences, Beijing, China, in 2023. He is currently a postdoctoral research fellow with the Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany. His research interests include computer vision and machine learning with a focus on handwriting analysis, object tracking, and segmentation.
Noah Maul received his B.Sc. and M.Sc. degrees in computer science from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany in 2020 and 2017, respectively. He joined the Pattern Recognition Lab at FAU in 2020, where his research mainly focused on the symbiosis of machine learning, blood flow simulation, and X-ray imaging. His research interests further include CT image reconstruction, segmentation, and neural PDE solving.
Moritz Koch received the B.Sc and M.Sc. degrees in physical geography and climate and environmental sciences from the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany, in 2019 and 2021, respectively. He is currently enrolled as a PhD Student at the Institute of Geography at the FAU and an affiliated student at the International Doctorate Program “Measuring and Modeling Mountain glaciers and ice caps in a Changing ClimAte (M3OCCA).”
Marcel Dreier earned his Bachelor’s and Master’s degree in computer science at Friedrich-Alexander-University Erlangen-Nürnberg. In his Master’s thesis he used diffusion models to generate offline handwritten text images and completed it in August 2023. Later on, he joined the Pattern Recognition Lab in October 2023 as a Ph.D candidate under the supervision of Prof. Andreas Maier. His current research focuses on machine learning on radargrams.
Dakota Pyles received a B.Sc. in Geosciences from the University of Montana in 2019 and a M.Sc. in Geology from the University of Idaho in 2022. He is currently pursuing a Ph.D. degree in the Working Group of Geographic Information Systems (GIS) and Remote Sensing at the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). In 2023, he joined the Institute of Geography, FAU, and is affiliated with the International Doctorate Program “Measuring and Modeling Mountain glaciers and ice caps in a Changing ClimAte (M3OCCA)”. His current research focuses on estimating frontal ablation in the Arctic and understanding spatiotemporal drivers of observed tidewater glacier changes.
Thorsten Seehaus received the Diploma degree in physics from the University of Würzburg, Würzburg, Germany, in 2011, and the Ph.D. degree in geography from Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany, in 2016. He finished an apprenticeship as a Mechatronics Technician at Jopp GmbH, Bad Neustadt an der Saale, Germany, in 2003. In 2012, he joined the Working Group of Geographic Information System (GIS) and Remote Sensing, Institute of Geography, Friedrich-Alexander-Universität Erlangen-Nürnberg, where he is currently a Research Assistant. He uses mainly multimission synthetic aperture radar (SAR) imagery to assess glacier variables, such as mass balances and area changes. His research interests include developing and applying remote sensing techniques for monitoring glacier changes on various scales and in various regions worldwide.
Matthias Braun received the Diploma degree in hydrology and the Dr. rer.nat. (Ph.D.) degree (Hons.) from the University of Freiburg, Breisgau, Germany, in 1997 and 2001, respectively. From 2001 to 2010, he was the Scientific Coordinator of the interdisciplinary Center for Remote Sensing of Land Surfaces at Bonn University, Germany. He was appointed as an Associate Professor of geophysics with the University of Alaska Fairbanks, Fairbanks, AK, USA, in 2010, and as a Professor with Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany, in 2011. His research interests cover mass change of glaciers for which he combines in-situ observations, remote sensing and modelling with a strong focus on large-scale Earth observation data analysis. He has been leading numerous field campaigns in Antarctica, Greenland, Svalbard, Patagonia, High Mountain Asia, and the Alps.
Andreas Maier (Senior Member, IEEE) was born in Erlangen, Germany, in November 1980. He graduated in computer science and the Ph.D. degree from Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, in 2005 and 2009, respectively. From 2005 to 2009, he was with the Pattern Recognition Laboratory, Computer Science Department, Friedrich-Alexander-Universität Erlangen-Nürnberg. His major research subject was medical signal processing in speech data. In this period, he developed the first online speech intelligibility assessment tool—PEAKS—that has been used to analyze over 4000 patients and control subjects so far. From 2009 to 2010, he started working on flat-panel C-arm CT as a Post-Doctoral Fellow at the Radiological Sciences Laboratory, Department of Radiology, Stanford University, Stanford, CA, USA. From 2011 to 2012, he was with Siemens Healthcare, Erlangen, Germany, as the Innovation Project Manager and was responsible for reconstruction topics in the angiography and X-ray business unit. In 2012, he returned to Friedrich-Alexander-Universität Erlangen-Nürnberg as the Head of the Medical Reconstruction Group, Pattern Recognition Laboratory, where he became a Professor and the Head in 2015. His research interests include medical imaging, image and audio processing, digital humanities, and interpretable machine learning and the use of known operators. Dr. Maier has been a member of the Steering Committee of the European Time Machine Consortium since 2016. In 2018, he has received the ERC Synergy Grant “4D nanoscope.”
Vincent Christlein received the degree in computer science and the Ph.D. (Dr.-Ing.) degree from Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany, in 2012 and 2018, respectively. During his studies, he worked on automatic handwriting analysis with a focus on writer identification and writer retrieval. Since 2018, he has been a Research Associate with the Pattern Recognition Laboratory, FAU, where he was promoted to an Academic Councilor in 2020 and heads the Computer Vision Group, which covers a wide variance of topics, e.g., environmental projects such as glacier segmentation or solar cell crack recognition, but also computational humanities topics, such as document and art analysis.
In this section, we will review the methodologies of the compared codes and the adjustments made. Tables 2, 3, and 4 provide a summary of the segmentation masks originally used, the network architecture on which each system is based, the original strategy for dealing with image sizes, and the augmentations performed. Table 5 lists the values of the re-optimized hyper-parameters for each model in the comparison.
| Segmentation Mask | Base Model | ||||||||
| Paper | BCL | BCF | BO | Multi | Conv. U-Net | DeepLabv3+ | ViT | VGG16 | |
| Cheng | ✔ | ✔ | ✔ | ||||||
| Davari (a) | ✔ | ✔ | |||||||
| Davari (b) | ✔ | ✔ | |||||||
| Gourm. (22) | Front | ✔ | ✔ | ||||||
| Zones | ✔ | ✔ | |||||||
| Gourm. (23) | ✔ | ✔ | |||||||
| Hartmann | ✔ | ✔ | |||||||
| Heidler | ✔ | ✔ | ✔ | ||||||
| Herrmann | ✔ | ✔ | ✔ | ||||||
| Holzmann | ✔ | ✔ | |||||||
| Kirillov | ✔ | ✔ | |||||||
| Loebel | ✔ | ✔ | |||||||
| Marochov | ✔ | ✔ | |||||||
| Mohajerani | ✔ | ✔ | |||||||
| Periya. | ✔ | ✔ | |||||||
| Wu (a) | ✔ | ✔ | |||||||
| Wu (b) | ✔ | ✔ | |||||||
| Zhang (21) | ✔ | ✔ | |||||||
| Zhang (23) | ✔ | ✔ | |||||||
| Zhu | ✔ | ✔ | ✔ | ||||||
| Resizing | Patch Extraction | ||||
| Paper | Size | Patch Size | Train-time Overlap | Test-time Overlap | |
| Cheng et al. [24] | \(256\,\times\,256\) | \(224\,\times\,224\) | / | 208 | |
| Davari et al. [11] | \(512\,\times\,512\) | / | / | / | |
| Davari et al. [10] | / | \(256\,\times\,256\) | 0 | 0 | |
| Gourmelon et al. [12] | / | \(256\,\times\,256\) | 0 | 128 | |
| Gourmelon et al. [13] | / | \(256\,\times\,256\) | 0 | 128 | |
| Hartmann et al. [14] | / | \(256\,\times\,256\) | 0 | 0 | |
| Heidler et al. [15] | / | \(768\,\times\,768\) | 384 | 384 | |
| Herrmann et al. [16] | / | \(1280\,\times\,1024\) | 0 | 640, 512 | |
| Holzmann et al. [17] | \(512\,\times\,512\) | / | / | / | |
| Kirillov et al. [36] | \(1024\,\times\,X\) or \(X\,\times\,1024\) | / | / | / | |
| Loebel et al. [18] | / | \(512\,\times\,512\) | a. n. | a. n. | |
| Marochov et al. [26] | Phase 1 | / | \(50\,\times\,50\) | 30 | 0 | 
| Phase 2 | / | \(15\,\times\,15\) | 14 | 14 | |
| Mohajerani et al. [19] | Training | \(150\,\times\,240\) | / | / | / | 
| Testing | \(200\,\times\,300\) | / | / | / | |
| Periyasamy et al. [20] | / | \(256\,\times\,256\) | 0 | /\(^*\) | |
| Wu et al. [21] | Target | / | \(288\,\times\,288\) | 0 | 0 | 
| Context | \(288\,\times\,288\) | \(576\,\times\,576\) | 288 | 288 | |
| Wu et al. [34] | Target | / | \(224\,\times\,224\) | 0 | 0 | 
| Context | \(224\,\times\,224\) | \(448\,\times\,448\) | 224 | 224 | |
| Zhang et al. [27] | / | \(960\,\times\,720\) | \(320, 240\) | \(384, 288\) | |
| Zhang et al. [28] | / | \(960\,\times\,720\) | \(320, 240\) | \(384, 288\) | |
| Zhu et al. [32] | / | \(384\,\times\,384\) | \(192, 192\) | \(192, 192\) | |
| Paper | Flips | Rot. | Noise | Sharp. | Crop | Bright. | Elastic | Gray | Other | |
|---|---|---|---|---|---|---|---|---|---|---|
| Cheng et al. [24] | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
| Davari et al. [11] | ✔ | ✔ | ||||||||
| Davari et al. [10] | ✔ | ✔ | ||||||||
| Gourmelon et al. [12] | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
| Gourmelon et al. [13] | ✔ | ✔ | ✔ | ✔ | ✔ | |||||
| Hartmann et al. [14] | ||||||||||
| Heidler et al. [15] | ✔ | ✔ | ||||||||
| Herrmann et al. [16] | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ||||
| Holzmann et al. [17] | ✔ | ✔ | ||||||||
| Kirillov et al. [36] | ||||||||||
| Loebel et al. [18] | ✔ | ✔ | ||||||||
| Marochov et al. [26] | Phase 1 | ✔ | ✔ | |||||||
| Phase 2 | ||||||||||
| Mohajerani et al. [19] | ✔ | ✔ | ||||||||
| Periyasamy et al. [20] | ✔ | ✔ | ||||||||
| Wu et al. [21] | ✔ | ✔ | ||||||||
| Wu et al. [34] | ✔ | ✔ | ||||||||
| Zhang et al. [27] | ✔ | ✔ | ||||||||
| Zhang et al. [28] | ✔ | ✔ | ||||||||
| Zhu et al. [32] | ✔ | |||||||||
| Paper | LR | \(\gamma\) | Dilation kernel | bin. thres. | w | k | R | Tile size | Kernel size | Loss weight. | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cheng et al. [24] | \([3e^{-5}; 3e^{-4}]\) | / | / | / | / | / | / | / | / | \(0.01 * L_z + 0.99 * L_f\) | |
| Davari et al. [11] | \(1e^{-2}; 1e^{-6}\) | \(8\) | \(2\,\times\,2\) | / | / | / | / | / | / | / | |
| Davari et al. [10] | \([1e^{-7}, 1e^{-4}]\) | / | / | \(0.05\) | \(3\) | \(0.1\) | \(1\) | / | / | / | |
| Gourmelon et al. [12] | Front | / | / | / | / | / | / | / | / | / | / | 
| Zones | / | / | / | / | / | / | / | / | / | / | |
| Gourmelon et al. [13] | / | / | / | / | / | / | / | / | / | / | |
| Hartmann et al. [14] | \(1e^{-4}\) | / | / | / | / | / | / | / | / | / | |
| Heidler et al. [15] | \([4e^{-5}, 2e^{-4}]\) | / | / | / | / | / | / | / | / | / | |
| Herrmann et al. [16] | / | / | / | / | / | / | / | / | / | / | |
| Holzmann et al. [17] | \([1e^{-6}, 1e^{-3}]\) | / | / | \(0.5\) | \(8\) | / | / | / | / | / | |
| Kirillov et al. [36] | Iterative | / | / | / | / | / | / | / | / | / | / | 
| Parallel | / | / | / | / | / | / | / | / | / | / | |
| Loebel et al. [18] | \(1e^{-4}\) | / | / | / | / | / | / | / | / | / | |
| Marochov et al. [26] | \(1e^{-3}; 1e^{-3}\) | / | / | / | / | / | / | \(32\) | \(15\,\times\,15\) | / | |
| Mohajerani et al. [19] | \(1e^{-3}\) | / | / | / | / | / | / | / | / | / | |
| Periyasamy et al. [20] | \([1e^{-7}, 1e^{-3}]\) | / | / | / | / | / | / | / | / | \(0.8 * CE + 0.2* Dice\) | |
| Wu et al. [21] | / | / | / | / | / | / | / | / | / | / | |
| Wu et al. [34] | / | / | / | / | / | / | / | / | / | / | |
| Zhang et al. [27] | \(1e^{-4}\) | / | / | / | / | / | / | / | / | / | |
| Zhang et al. [28] | \(0.05\) | / | / | / | / | / | / | / | / | / | |
| Zhu et al. [32] | \(0.05\) | / | / | / | / | / | / | / | / | / | |
Cheng et al. [24] employ the DeepLabv3 [29] architecture to segment optical and SAR imagery into land and sea, including ice mélange. They employ the Xception model [30] as the backbone, like in the original DeepLabv3 paper, but add Atrous Spatial Pyramid Pooling [29] between the encoder and decoder. Their loss function is a weighted sum of the cross-entropy and the Dice loss. Cheng et al. [24]’s network, called CALFIN, outputs two probability masks: sea versus land and coastline versus background. Their training and testing datasets consist of 1,541 Landsat images of Greenland and 232 Sentinel-1A/B images of Antarctica. The dataset is part of the published dataset of Cheng et al. [47]. All images are centered over basins and are precision- and terrain-corrected. Only images with low cloud coverage and a low number of NODATA pixels are further considered. Next, they are resized to 256 \(\times\) 256 pixels and enhanced using pseudo-HDR toning (HDR) and shadows/highlights (S/H) options in Adobe Photoshop. Before feeding the input into CALFIN, patches of size 224 \(\times\) 224 are extracted. The input patches are augmented randomly on the fly and have three channels: the original image, the HDR-enhanced, and the S/H-enhanced image. Augmentations include flips, Gaussian noise, sharpening filters, rotations of up to 12\(^{\circ}\), as well as crops and rescaling. A polyline extraction via a minimum spanning tree is performed to extract the final calving front prediction from CALFIN’s output probability masks, and the result is masked with the corresponding fjord boundaries. To assess their prediction quality, Cheng et al. [24] calculate the mean–median of the distances between the closest pixels in the predicted and target fronts in meters.
For the comparison, we omit the resizing step during post-processing because information is lost during resizing. Instead, we directly extract patches of size 224 \(\times\) 224 pixels. We adjust the number of output layers from two to five so that one channel predicts the front labels of the caffe dataset, and the remaining four channels predict the zone labels. Masking of fjord boundaries during post-processing is prior knowledge and, therefore, cannot be used for the comparison. The polyline extraction without masking the fjord boundaries would result in a coastline prediction, not a calving front prediction. Therefore, to extract the final front prediction, we use the post-processing of Gourmelon et al. [12]’s Zones network instead of the original post-processing.
Davari et al. [11] convert the typical binary front segmentation to a regression problem by applying a distance map transform on the calving front segmentation mask. Their dataset includes SAR imagery of two glacier systems at the Antarctic Peninsula. The images are multi-looked, calibrated to sigma-0, geo-referenced, ortho-rectified, and resized to 512\(\,\times\,\)512 pixels. Flips and rotations are used to augment the dataset. Their dl model predicts each pixel’s distance to the closest point of the calving front. The architecture of their model is a simple U-Net [23], with the mean-squared error loss for training. From the predicted distance map, the calving front is extracted during post-processing. In their paper, Davari et al. [11] test three different post-processing schemes: statistical thresholding, a conditional random field, and a second U-Net. Davari et al. [11] showed that the second U-Net gives the most accurate results. The U-Net takes the predicted distance map as input and outputs a segmentation prediction for the front. The model is trained on the front segmentation masks using a binary cross-entropy loss. The one-pixel-wide front segmentation masks are thickened with a kernel of size \(5\,\times\,5\) to ease the class imbalance problem for the second U-Net. The output of the second U-Net is post-processed with morphological thinning.
We add the post-processing of Gourmelon et al. [12]’s Front network. This results in a one-pixel-wide prediction for the front, which is essential for a fair comparison, as with a broader prediction, the distance error computation would be skewed.
Davari et al. [10] tested three different versions of the U-Net [23] for calving front extraction. The best-performing version uses mcc as an early stopping criterion and is trained on binary segmentation masks showing the calving front versus background using an improved distance map loss. Davari et al. [10]’s dataset includes SAR imagery of the Jakobshavn Isbrae Glacier located in Greenland and two glacier systems at the Antarctic Peninsula. The images are multi-looked, calibrated to sigma-0, geo-referenced, and ortho-rectified. Additionally, only the images of Jakobshavn Isbrae are median-filtered to reduce speckle noise. The calving front segmentation masks are dilated with a \(5\,\times\,5\) kernel to alleviate the class imbalance. All images are divided into non-overlapping patches of size \(256\,\times\,256\), and the resulting dataset is artificially enlarged by flip and rotation augmentations. No post-processing is performed.
For the comparison, we omit median-filtering as all images need to be treated similarly, and we add the post-processing of Gourmelon et al. [12]’s Front network to extract the calving front.
The baselines for the benchmark dataset were presented in the same paper as the dataset by Gourmelon et al. [12]. As the benchmark features two label categories, Gourmelon et al. [12] provide two separate networks, which from now on will be called “Zones” and “Front” after the labels used to train the networks. Both networks have a U-Net structure with aspp [42] in the bottleneck. The Front network is trained with an improved distance map loss [10], while the loss function of the Zones network is a weighted combination of Dice [48] and cross-entropy [49]. Only the front labels are pre-processed via a morphological dilation employing a rectangular structuring element of size \(5\,\times\,5\) pixels. For further processing, both networks extract patches of size \(256\,\times\,256\) with no overlap for training and 128 pixels overlap for testing. Image patches are augmented online by rotations, horizontal flips, brightness adjustments, Gaussian noise, and elastic transforms. Neural network outputs are combined by patch merging with Gaussian importance weighting. Post-processing for the Zones network includes filling gaps in the ocean zone prediction and removing all but the largest connected predicted ocean zone. The boundary between the ocean and glacier zones is taken as the predicted calving front. For the Front network, the predicted front is skeletonized, and the longest path in each separate skeleton is identified to obtain 1-pixel-wide lines.
No adaptations were performed for the comparison.
Gourmelon et al. [13] change the post-processing of Gourmelon et al. [12]’s Zones network by introducing a crf. The crf is the replacement for the commonly used argmax, which determines the predicted zone for each pixel based on the output logits of the network. Instead of just considering each pixel individually as argmax does, the crf optimizes the predicted zones while considering the predictions and logits of all other pixels.
No further adaptations were made to the system pipeline. Moreover, retraining the network was not necessary.
To increase the accuracy in uncertain image regions, Hartmann et al. [14] simulate two Bayesian U-Nets with random sampling layers using dropout and concatenate the two networks, generating a two-stage pipeline that determines uncertain regions and then focuses on these regions to enhance the prediction. Hartmann et al. [14]’s multi-looked, geo-referenced, and ortho-rectified dataset comprises SAR imagery of two glacier systems at the Antarctic Peninsula. For training and testing, patch extraction with a patch size of \(256\,\times\,256\) and no overlap is conducted. The first Bayesian U-Net takes the SAR image as input, while the second, in addition to the SAR image, receives an uncertainty map, which is computed as the binarized variance of 20 forward passes of the first U-Net. Both networks are trained to segment ocean versus non-ocean regions using the binary cross-entropy and early stopping on the validation loss with a patience of 30 epochs.
For the comparison, we adapt the U-Nets from binary zone segmentation to multi-zone segmentation with four output channels and categorical cross-entropy loss. The second U-Net receives four uncertainty maps, one for each zone. To get the final prediction, an argmax is applied to the four output channels of the second U-Net, and the post-processing of Gourmelon et al. [12]’s Zones network is applied.
Heidler et al. [15]’s network is based on the U-Net architecture but has two output heads: one for delineating the coastline and one for the segmentation into sea and land. Both heads separately merge up-scaled feature maps from the U-Net’s decoder using an attention mechanism and employ deep supervision with an adaptively balanced cross-entropy loss function. The dataset used to train and test the network includes 16 Antarctic Sentinel-1 scenes taken between June 2017 and December 2018, each covering an area of 315 km \(\times\) 263 km. During pre-processing, all scenes are processed in the Antarctic Polar Stereographic Projection (EPSG:3031), converted to decibels, and divided into overlapping patches of 768 \(\times\) 768 pixels. Applied augmentations are rotations with multiples of 90\(^{\circ}\) and mirroring both horizontally and vertically. As both polarizations of Sentinel-1 are used, the input to the network has two channels (HH and HV). Heidler et al. [15] conduct no post-processing of the network’s output. The sea/land segmentation is evaluated using the mean IoU, and for the edge detection result, both the F1 scores at the optimal image and dataset scale and the average distance to the target coastline over all predicted coastline pixels are employed. Moreover, Heidler et al. [15] showed that adding down-sampled Tandem-X elevation maps as a third input channel can be beneficial.
The input channels are reduced to one to accommodate for the caffe dataset. In addition, the sea and land segmentation network head is extended to encompass multiple landscape zones. The loss function for this multi-class head is set to categorical cross-entropy, and the post-processing of Gourmelon et al. [12]’s Zones network is applied to extract the calving front. The coastline segmentation head did not need any adaptation to be used for calving front segmentation. Only the post-processing of Gourmelon et al. [12]’s Front network is used to obtain a calving front prediction from the binary segmentation head.
The nnU-Net [50], a framework initially designed for biomedical image segmentation, adapts the U-Net to a given dataset and automates design decisions and hyperparameter tuning, eliminating the need for manual intervention. Herrmann et al. [16] train and test the nnU-Net on the caffe dataset and experiment with multi-task learning, concluding that fusing the front and zone label and training the nnU-Net with this fused label yields the lowest mde. Front labels are dilated with a structuring element of \(5\,\times\,5\) pixels and inserted into the zone label. Patch extraction is performed with the median image size, which for the caffe dataset is \(1280\,\times\,1024\). The dataset is augmented online using rotations and scaling, Gaussian noise, Gaussian blur, brightness and contrast adjustments, simulation of low resolution, gamma augmentation, and mirroring. nnU-Net’s loss function is a combination of cross-entropy and Dice score. Since the nnU-Net assumes that the final segmentation objective is the label itself, Herrmann et al. [16] add additional post-processing to extract the calving front. For this purpose, the front zone in the fused label is assigned to the ocean zone, and the glacier zone is dilated with a structuring element of \(7\,\times\,7\) pixels. Afterward, the post-processing of Gourmelon et al. [12]’s Zones network is applied.
The nnU-Net usually employs five-fold cross-validation and takes the ensemble of the five trained networks as the final prediction. Instead of taking the ensemble, we treat the cross-validation networks as the five training runs and compute the mean and standard deviation of mde for our comparison over the five cross-validation networks.
Holzmann et al. [17] introduce attention gates into the skip connections of the U-Net and train the U-Net on labels distinguishing front and background using a distance-weighted loss function. Their dataset consists of SAR imagery showing two glacier systems in the Antarctic Peninsula. For pre-processing, Holzmann et al. [17] apply a median filter on the SAR images and resize both labels and images to \(512\,\times\,512\) pixels. The front labels are dilated to a width of six pixels to ease the class imbalance. Flipping and rotation augmentations are applied to enlarge the dataset. During post-processing, the output of the U-Net is simply binarized.
As we need a one-pixel-wide calving front to calculate the mde, we add skeletonization after the binarization.
The recently introduced SAM is a promptable foundation model for zero-shot image segmentation. We test the version pre-trained on the SA-1B dataset and ViT-H as the backbone in a zero-shot way on caffe; i. e., we do not fine-tune SAM on caffe, but simply use SAM as is. As SAM is trained using RGB images, we repeat our single-channel input three times to artificially create three input channels, as suggested by the authors. The images are rescaled for SAM’s image encoder - a ViT - so that the longest image side has \(1024\) pixels and the aspect ratio is preserved. The resulting image embeddings are fed into the mask decoder alongside prompts specifying the object to be segmented and an optional segmentation mask that can be used for refinement. SAM can take prompts in text, point, dense (i. e., coarse segmentation map), and bounding box form. We generate point prompts using the Contextual HookFormer [34] with the goal of enhancing the zone segmentations already created by the Contextual HookFormer. For each zone, a sigmoid is applied to the corresponding output channel to receive probability maps. Next, these probability maps are thresholded such that only areas with the highest probability remain. Then, the high probability maps are additionally eroded to focus on points in the center of the specific zone. As SAM is not able to conduct semantic segmentation, we focus on predicting the ocean zone. Hence, positive prompts are randomly drawn from the eroded high-probability ocean map. Negative prompts are randomly drawn from the three remaining eroded high-probability maps. We tested two approaches to feed prompts to SAM: iteratively and parallel. For parallel prompt feeding, we draw ten positive prompts and ten negative prompts per zone (rock, glacier, NA) and pass all prompts to SAM at once such that SAM’s mask decoder is just run once. Additionally, we use the Contextual HookFormer’s logits of the ocean channel as a dense prompt. For iterative prompt feeding, the point prompts are drawn in the same way, but instead of being handed to SAM altogether, the prompts are fed into SAM one after another. SAM’s mask decoder is run after every new prompt, receiving the new prompt and the last segmentation output as a dense prompt. Like this, the segmentation masks are iteratively enhanced. To extract the calving front from the segmentation mask, we overlay the binary ocean mask with the rock and NA predictions from the Contextual HookFormer and add the post-processing of Gourmelon et al. [12]’s Zones network.
Loebel et al. [18] analyze the effect of different inputs on a neural network, including multi-spectral, topographic, and textural inputs. For this purpose, Loebel et al. [18] train a U-Net with six down- and upsampling layers on binary labels distinguishing ocean and non-ocean areas. The employed loss function is the binary cross-entropy. Loebel et al. [18]’s dataset includes radiometrically calibrated and ortho-rectified level-1 Landsat-8 imagery of 23 Greenland outlet glaciers and two glaciers at the Antarctic Peninsula. Each glacier is either covered by one \(512\,\times\,512\) image or multiple overlapping \(512\,\times\,512\) images if the area is too large for a single image. During pre-processing, histogram clipping is performed for each multi-spectral band. The dataset is augmented eight-fold by rotations and flipping. During post-processing, images of the same glacier are merged, if necessary, by averaging the overlap. Next, the coastline is binarized and vectorized using a contour algorithm. The calving front is extracted from the coastline with a static mask, which is manually created for each glacier.
For the comparison, we stick to only SAR images as input and alter the U-Net to perform multi-class instead of binary segmentation. To do this, we change the number of output channels to four and replace the binary cross-entropy with the categorical cross-entropy. Additionally, we employ the post-processing of Gourmelon et al. [12]’s Zones network to extract the final calving front.
A different approach to front delineation is taken by Marochov et al. [26]. Instead of segmenting the entire images directly into the desired classes, Marochov et al. [26] use classification networks to determine the class of each single pixel in each image separately. The differentiated classes include open water, iceberg water, mélange, glacier ice, snow on ice, snow on rock, and bare bedrock. The employed dataset comprises Sentinel-2 images from three glaciers in Greenland. The paper’s approach is separated into two phases: In the first phase, a VGG16 network [31] is trained on image tiles with \(50\,\times\,50\) pixels, in which more than 95 % of pixels have the same class. The tiles are augmented using rotation and uniformly distributed noise. Hence, the input is an image tile, and the output is the predominant class in this image tile. With this first phase, the authors aim to overcome the need to produce pixel-wise labels for training, as the training labels for the VGG16 network can be coarse polygons, and the trained VGG16 network can then generate the pixel-wise labels for the second phase by classifying each pixel in the given training images. In the second phase, a small CNN takes in a small image patch of \(15\,\times\,15\) pixels and is trained to predict the center pixel’s class. Both networks employ the categorical cross-entropy loss function. After training, the small CNN is then used to classify each pixel in the test images. The calving front is extracted during post-processing. The largest glacier object is isolated and refined with morphologic geodesic active contours, and the boundary pixels of this glacier object are extracted. The classes associated with the ocean (open water, mélange, icebergs) are taken together and objects larger than 1 km2 are dilated by 30 pixels. The intersection of these ocean objects and the extracted glacier boundary gives the front prediction. Moreover, Marochov et al. [26] fine-tuned the trained model on one image from each of the glaciers in the test set. These images are not taken from the test set directly but from the glaciers in the test set at a time point, which is not included in the test set.
For a fair comparison, we omit the fine-tuning. We adapt the networks to predict the four classes prevalent in caffe’s zone labels. To counter class imbalance, we did not perform augmentations for glacier tiles, as glacier tiles occur much more frequently in the training set than the other three classes. Moreover, as the prominent feature of the NA class is a smooth black region, we do not add Gaussian noise to the tiles of this class. In the original code of phase 1, training is stopped when a validation accuracy of 0.985 is reached. We change this to early stopping when the change of validation accuracy is less than 0.005 with a patience of 10 epochs, as a validation accuracy of 0.985 is never reached for the caffe dataset.
Mohajerani et al. [19] employ a U-Net with a weighted binary cross-entropy as a loss function to segment multi-spectral Landsat images into calving front and background. Their data comprises 123 images of four Greenlandic glaciers. During pre-processing, these images are cropped to the region around the front with a buffer of 300 m, rotated such that the front is oriented in the y-direction, and resized to 200 \(\times\) 300 pixels using cubic interpolation. For training, the resulting 200 \(\times\) 300 sized images are cropped to a size of 150 \(\times\) 240 pixels. Moreover, Mohajerani et al. [19] normalize the image contrast, equalize grey-scale intensities to create a uniform distribution, and apply smoothing and edge enhancement kernels. As augmentation, the images are additionally flipped horizontally, and grey-scale intensities are inverted. The U-Net produces a probability mask, which must be post-processed to attain the final calving front prediction. The post-processing entails computing the least-cost path through the probability mask, with the values of the probability mask as step weights.
Since rotating the images so that the front is oriented in a certain way also requires prior knowledge of the test set, this part is omitted for the comparison. We replaced the multiple cropping and resizing steps in pre-processing by rescaling the images to the average bounding box size, as resizing to the average of the entire images resulted in a memory error. This procedure gave better validation results than cropping to the bounding box size and then resizing to 150 \(\times\) 240 pixels. During testing, we omit the rescaling altogether. Further pre-processing steps are kept unchanged. The labels used are the front labels of caffe. Hence, no architecture or loss function changes were needed. The post-processing was exchanged with that of Gourmelon et al. [12]’s Front network since the original is based on knowledge of the fjord boundaries, which we consider prior knowledge.
Periyasamy et al. [20] aim to find an optimal configuration for a U-Net trained to differentiate between ocean and non-ocean regions by optimizing data pre-processing, data augmentation, the loss function, normalization layer, dropout rate, bottleneck layer, and transfer learning. Their dataset consists of multi-looked, geo-referenced, ortho-rectified SAR imagery of two glaciers in the Antarctic Peninsula and one glacier in Greenland. The best-performing model takes inputs pre-processed with a bilateral and a CLAHE filter. For training, images are divided into non-overlapping patches of size \(256\,\times\,256\) pixels and augmented eight-fold by rotation and flipping. During inference, images are fed into the network as a whole. The bottleneck of the best-performing U-Net includes a residual connection and dilated convolutions. The loss function combines the binary cross-entropy and the Dice loss with equal weighting. During post-processing, the calving front is extracted by dropping all but the largest connected ocean component and applying the canny edge detector to receive the contour of the ocean.
For the comparison, we employ the optimized U-Net and alter the binary zone segmentation to a multi-zone segmentation. Hence, binary cross-entropy is replaced with a categorical cross-entropy. Moreover, we use a softmax as the final activation layer instead of a sigmoid and employ an argmax instead of a simple threshold to receive the zone predictions. Lastly, we replace the post-processing with the post-processing of Gourmelon et al. [12]’s Zones network.
In all systems designed for calving front extraction, images are either divided into patches or resized to alleviate GPU memory issues. Both resizing and patch extraction have their downsides: During resizing, high-frequency details are lost, while patches miss the global information around the patch. Wu et al. [21] address this trade-off by employing the HookNet [43]. The HookNet consists of two connected U-Nets. The first U-Net takes in the target patch, while the other receives a downsized patch of the context that covers both the target patch and the surrounding area. Therefore, this approach combines local high-frequency details and coarse global information in the input. Wu et al. [21] improve the HookNet by integrating an attention mechanism into multihooking U-Nets with deep supervision of the feature pyramid in the architecture. The improved network is called AMD-HookNet. The zone labels of the caffe dataset are employed for training and testing. Wu et al. [21] extract non-overlapping target patches with a size of \(288\,\times\,288\) pixels. The extracted context patches are of size \(567\,\times\,567\) pixels, with the corresponding target patch in the center. The context patches overlap by 288 pixels and are resized to a size of \(288\,\times\,288\) pixels before being fed into the U-Net of the AMD-HookNet’s context branch. The patches are jointly augmented via rotations and flipping. Wu et al. [21]’s AMD-HookNet is trained with a combination of the categorical cross-entropy and Dice loss of the target branch’s and context branch’s output as well as deep supervision of upsampled feature maps of the hooking mechanism. The output patches of the target branch are stitched together and post-processed to extract the calving front using the post-processing of Gourmelon et al. [12]’s Zones network.
No adaptation except the length of training had to be performed for the comparison.
The HookFormer is the first fully Transformer-based network for calving front extraction. Wu et al. [34] base the HookFormer on the AMD-HookNet but exchange the convolution blocks with Swin Transformer blocks [51] and improve the hooking mechanism by introducing a Cross-Attention Swin-Transformer module and a Cross-Interaction module. The dataset and labels are the caffe dataset and its zone labels, the same as for the AMD-HookNet. The target patch size is \(224\,\times\,224\), while the context patch size is \(488\,\times\,488\), which is rescaled to \(224\,\times\,224\) as well. Context patches are extracted with an overlap of \(224\) pixels, while the predictions of non-overlapping target patches are used as network outputs. All patches are augmented by rotation and flipping. During training, a combination of categorical cross-entropy and Dice loss is used to supervise target and context branch outputs and the upsampled target bottleneck map. To attain the final calving front prediction, the post-processing of Gourmelon et al. [12]’s Zones network is applied.
For the comparison, no adaptations were necessary.
Zhang et al. [27] replace the U-Net with the DeepLabv3 [29] architecture to segment optical and SAR imagery into land and sea. Their dataset, with corresponding manual delineations, is published by Zhang et al. [52]. As pre-processing, the images are cropped to the region of interest, de-speckled, and their histograms normalized. Before rotation and flipping augmentations are applied, the images are subdivided into patches of 960 \(\times\) 720 pixels. Zhang et al. [27] perform a comparison between the U-Net and the DeepLabv3+ with different backbones. The tested backbones include ResNet [53], DRN [54], and MobileNet [55]. Their post-processing is the same as of Zhang et al. [22], except that a final step is added where fronts with too complex shapes are omitted based on their frequency, amplitude of vibration, and convexity of the polygon. The performance metric, the mean difference, is likewise taken from Zhang et al. [22].
For the comparison, the mentioned pre-processing steps are omitted, as these have already been performed for the caffe dataset. The binary segmentation is altered to a multi-class segmentation to accommodate for caffe’s zone labels. For this purpose, the output channels have been increased to four, and the categorical cross-entropy loss instead of the binary cross-entropy loss has been applied. Moreover, the post-processing of Gourmelon et al. [12]’s Zones network is integrated, which includes an argmax instead of a threshold to receive the prediction. The original post-processing could not be applied because, first, it assumes the existence of only two classes, and second, it would require prior knowledge of the test set.
A complete calving front delineation pipeline for gee is presented by Zhang et al. [28]. The automated pipeline includes a screening module for erroneous predictions as well as an uncertainty estimation. To train the included DeepLabv3+ [29], Zhang et al. [28] curated the TermPicks dataset [56] and added additional manually annotated fronts summing up to 17,906 samples from 249 glaciers in Greenland. Only satellites available on gee are included. Before the images are fed into the model, a cloud screening is performed to ensure the calving front is visible. Next, histogram equalization is conducted and images with a width of less than 1000 pixels are resized to a width that is just larger than 1000 pixels. Patches of size \(960\,\times\,720\) are extracted with an overlap of 320, 240 (width, height) for training and 384, 288 for testing. Using flipping and rotation, Zhang et al. [28] enlarge the dataset artificially. The model learns to differentiate ocean from non-ocean using a binary cross-entropy loss. During post-processing, patches are merged by averaging the overlap, and the values are thresholded with 0.5. To extract the calving front, the prediction is converted to a polygon; small polygons and the image border are removed, leaving the predicted calving front. Lastly, the predicted calving fronts undergo a screening to remove erroneous fronts. The screening checks the calving front curvature and length, the number of intersections between glacier flowlines and front, and the size of enclosed areas between temporally adjacent calving fronts.
During the comparison, we omitted the cloud screening and the histogram normalization, as SAR penetrates cloud cover, and histogram normalization was already performed on the benchmark dataset. We changed the output channels of DeepLabv3+ to four, trained the network using the categorical cross-entropy, and used an argmax instead of a threshold for binarization to accommodate for caffe’s zone labels. Moreover, the screening module could not be applied, as three of the four checks are based on thresholds that can only be calculated using optical imagery, and the last check relies on glacier flowlines, which would be prior knowledge of the test set.
Zhu et al. [32] leverage the properties of both CNNs and ViTs by incorporating gla-sts into DeepLabv3+. They dub the resulting model gla and train it with a weighted combination of the binary cross-entropy loss and the Dice loss on the final output and the binary cross-entropy loss on an auxiliary output. Experiments to assess the model’s performance are based on the caffe dataset. Zhu et al. [32] fuse all classes but the ocean class in the zone labels, leading to a binary ocean segmentation. Consistent with Swin-L [51], patches of size \(384\,\times\,384\) are extracted with an overlap of 50 % for both training and testing. The only augmentation performed during the model’s training is random horizontal flipping. The conducted post-processing that is needed to calculate the mde is not described in the study nor published with the code.
To enable the calculation of the mde, the network is adjusted to predict all zones provided by the caffe dataset, the binary cross-entropy loss terms in the combined loss function are exchanged with the categorical cross-entropy loss and the post-processing of Gourmelon et al. [12]’s Zones network is applied.
This section provides the mean and standard deviations of the mdes for subsets of the test set for all 22 dl systems, a visual examination of the predictions (Figures 5 and 6), and the numerical results for the statistical analyses. Table 6 shows the mde for the complete test set as well as for only summer and only winter images. Table 7 provides mdes for the Mapple Glacier, encompassing all its images and further categorized into summer and winter sets. Similarly, mdes for the Columbia Glacier are given in Table 8, with a breakdown into summer and winter images as well as an overall measure for all images of the glacier. A breakdown of the test set results into the different sensors is provided in Tables 9 and 10.
 
 
| Summer | Winter | ||||||
| Paper | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 122\) | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 68\) | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 54\) | |
| Cheng et al. [24] | \(1767 \pm 536\) | \(15 \pm 9\) | \(1697 \pm 433\) | \(10 \pm 6\) | \(1828 \pm 660\) | \(5 \pm 4\) | |
| Davari et al. [11] | \(2414 \pm 425\) | \(27 \pm 5\) | \(2205 \pm 288\) | \(15 \pm 5\) | \(2626 \pm 618\) | \(12 \pm 3\) | |
| Davari et al. [10] | \(4327 \pm 248\) | \(69 \pm 1\) | \(4159 \pm 304\) | \(43 \pm 0\) | \(4523 \pm 273\) | \(26 \pm 1\) | |
| Gourmelon et al. [12] | Front | \(887 \pm 189\) | \(7 \pm 3\) | \(738 \pm 111\) | \(4 \pm 1\) | \(1054 \pm 308\) | \(4 \pm 2\) | 
| Zones | \(753 \pm 76\) | \(1 \pm 1\) | \(732 \pm 93\) | \(1 \pm 1\) | \(776 \pm 65\) | \(0 \pm 0\) | |
| Gourmelon et al. [13] | \(726 \pm 76\) | \(1 \pm 1\) | \(696 \pm 93\) | \(1 \pm 1\) | \(757 \pm 67\) | \(0 \pm 0\) | |
| Hartmann et al. [14] | \(1011 \pm 46\) | \(12 \pm 10\) | \(1085 \pm 82\) | \(8 \pm 5\) | \(942 \pm 59\) | \(5 \pm 4\) | |
| Heidler et al. [15] | Front | \(499 \pm 31\) | \(2 \pm 2\) | \(478 \pm 43\) | \(1 \pm 1\) | \(519 \pm 37\) | \(1 \pm 1\) | 
| Zones | \(646 \pm 67\) | \(6 \pm 5\) | \(640 \pm 74\) | \(3 \pm 2\) | \(648 \pm 95\) | \(3 \pm 4\) | |
| Herrmann et al. [16] | \(546 \pm 98\) | \(4 \pm 2\) | \(459 \pm 121\) | \(1 \pm 1\) | \(636 \pm 82\) | \(2 \pm 1\) | |
| Holzmann et al. [17] | \(2498 \pm 283\) | \(77 \pm 4\) | \(2587 \pm 314\) | \(50 \pm 1\) | \(2445 \pm 300\) | \(26 \pm 5\) | |
| Kirillov et al. [36] | Iterative | \(708 \pm 74\) | \(13 \pm 2\) | \(726 \pm 88\) | \(6 \pm 2\) | \(688 \pm 109\) | \(7 \pm 1\) | 
| Parallel | \(753 \pm 105\) | \(9 \pm 0\) | \(576 \pm 79\) | \(4 \pm 0\) | \(929 \pm 144\) | \(5 \pm 0\) | |
| Loebel et al. [18] | \(582 \pm 41\) | \(7 \pm 2\) | \(521 \pm 52\) | \(5 \pm 2\) | \(645 \pm 38\) | \(2 \pm 1\) | |
| Marochov et al. [26] | \(2670 \pm 349\) | \(97 \pm 2\) | \(2279 \pm 290\) | \(56 \pm 1\) | \(2880 \pm 420\) | \(40 \pm 1\) | |
| Mohajerani et al. [19] | \(1990 \pm 33\) | \(\mathbf{0 \pm 0}\) | \(1883 \pm 47\) | \(\mathbf{0 \pm 0}\) | \(2099 \pm 55\) | \(\mathbf{0 \pm 0}\) | |
| Periyasamy et al. [20] | \(1065 \pm 47\) | \(12 \pm 4\) | \(1144 \pm 55\) | \(6 \pm 3\) | \(992 \pm 36\) | \(6 \pm 1\) | |
| Wu et al. [21] | \(451 \pm 34\) | \(4 \pm 1\) | \(421 \pm 43\) | \(3 \pm 1\) | \(482 \pm 41\) | \(1 \pm 1\) | |
| Wu et al. [34] | \(\mathbf{360 \pm 13}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{333 \pm 13}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{389 \pm 21}\) | \(\mathbf{0 \pm 0}\) | |
| Zhang et al. [27] | \(1297 \pm 273\) | \(45 \pm 5\) | \(1455 \pm 268\) | \(31 \pm 5\) | \(1162 \pm 335\) | \(15 \pm 3\) | |
| Zhang et al. [28] | \(909 \pm 180\) | \(12 \pm 5\) | \(1047 \pm 233\) | \(9 \pm 3\) | \(800 \pm 148\) | \(3 \pm 2\) | |
| Zhu et al. [32] | \(914 \pm 77\) | \(26 \pm 3\) | \(988 \pm 74\) | \(13 \pm 3\) | \(838 \pm 101\) | \(13 \pm 2\) | |
| Summer | Winter | ||||||
| Paper | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 57\) | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 40\) | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 17\) | |
| Cheng et al. [24] | \(696 \pm 250\) | \(4 \pm 2\) | \(688 \pm 268\) | \(3 \pm 1\) | \(705 \pm 213\) | \(1 \pm 1\) | |
| Davari et al. [11] | \(233 \pm 29\) | \(8 \pm 3\) | \(251 \pm 38\) | \(6 \pm 2\) | \(192 \pm 34\) | \(2 \pm 2\) | |
| Davari et al. [10] | \(2140 \pm 41\) | \(56 \pm 1\) | \(2140 \pm 41\) | \(39 \pm 1\) | / | \(17 \pm 0\) | |
| Gourmelon et al. [12] | Front | \(150 \pm 24\) | \(6 \pm 2\) | \(140 \pm 26\) | \(2 \pm 1\) | \(173 \pm 33\) | \(2 \pm 1\) | 
| Zones | \(287 \pm 48\) | \(0 \pm 1\) | \(262 \pm 29\) | \(0 \pm 1\) | \(340 \pm 93\) | \(0 \pm 0\) | |
| Gourmelon et al. [13] | \(263 \pm 40\) | \(1 \pm 1\) | \(241 \pm 20\) | \(1 \pm 1\) | \(311 \pm 86\) | \(0 \pm 0\) | |
| Hartmann et al. [14] | \(411 \pm 28\) | \(1 \pm 1\) | \(346 \pm 27\) | \(1 \pm 1\) | \(546 \pm 45\) | \(0 \pm 0\) | |
| Heidler et al. [15] | Front | \(308 \pm 43\) | \(2 \pm 2\) | \(291 \pm 39\) | \(1 \pm 1\) | \(346 \pm 61\) | \(1 \pm 1\) | 
| Zones | \(256 \pm 32\) | \(3 \pm 3\) | \(225 \pm 17\) | \(2 \pm 1\) | \(325 \pm 69\) | \(1 \pm 1\) | |
| Herrmann et al. [16] | \(\mathbf{107 \pm 8}\) | \(1 \pm 1\) | \(\mathbf{108 \pm 9}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{104 \pm 18}\) | \(\mathbf{0 \pm 0}\) | |
| Holzmann et al. [17] | \(609 \pm 348\) | \(56 \pm 1\) | \(709 \pm 448\) | \(39 \pm 1\) | \(775 \pm 0\) | \(17 \pm 0\) | |
| Kirillov et al. [36] | Iterative | \(373 \pm 89\) | \(7 \pm 2\) | \(216 \pm 17\) | \(3 \pm 1\) | \(658 \pm 213\) | \(4 \pm 0\) | 
| Parallel | \(219 \pm 20\) | \(4 \pm 0\) | \(167 \pm 13\) | \(1 \pm 0\) | \(342 \pm 43\) | \(2 \pm 1\) | |
| Loebel et al. [18] | \(215 \pm 43\) | \(6 \pm 2\) | \(195 \pm 27\) | \(4 \pm 2\) | \(254 \pm 89\) | \(2 \pm 2\) | |
| Marochov et al. [26] | \(945 \pm 202\) | \(48 \pm 1\) | \(1011 \pm 182\) | \(34 \pm 1\) | \(888 \pm 280\) | \(15 \pm 1\) | |
| Mohajerani et al. [19] | \(607 \pm 9\) | \(\mathbf{0 \pm 0}\) | \(508 \pm 24\) | \(\mathbf{0 \pm 0}\) | \(822 \pm 58\) | \(\mathbf{0 \pm 0}\) | |
| Periyasamy et al. [20] | \(567 \pm 22\) | \(4 \pm 1\) | \(439 \pm 27\) | \(2 \pm 1\) | \(817 \pm 53\) | \(3 \pm 1\) | |
| Wu et al. [21] | \(207 \pm 42\) | \(3 \pm 1\) | \(202 \pm 52\) | \(1 \pm 1\) | \(217 \pm 37\) | \(1 \pm 1\) | |
| Wu et al. [34] | \(184 \pm 19\) | \(\mathbf{0 \pm 0}\) | \(138 \pm 30\) | \(\mathbf{0 \pm 0}\) | \(285 \pm 21\) | \(\mathbf{0 \pm 0}\) | |
| Zhang et al. [27] | \(652 \pm 260\) | \(33 \pm 3\) | \(626 \pm 224\) | \(24 \pm 3\) | \(702 \pm 355\) | \(9 \pm 1\) | |
| Zhang et al. [28] | \(534 \pm 78\) | \(5 \pm 3\) | \(506 \pm 106\) | \(2 \pm 1\) | \(603 \pm 100\) | \(2 \pm 2\) | |
| Zhu et al. [32] | \(466 \pm 10\) | \(14 \pm 3\) | \(421 \pm 23\) | \(9 \pm 3\) | \(560 \pm 20\) | \(4 \pm 1\) | |
| Summer | Winter | ||||||
| Paper | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 65\) | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 28\) | \(\downarrow\) MDE | \(\downarrow\) \(\varnothing \in 37\) | |
| Cheng et al. [24] | \(2375 \pm 884\) | \(11 \pm 8\) | \(2633 \pm 872\) | \(7 \pm 5\) | \(2197 \pm 917\) | \(4 \pm 3\) | |
| Davari et al. [11] | \(3102 \pm 510\) | \(19 \pm 5\) | \(3170 \pm 413\) | \(9 \pm 4\) | \(3054 \pm 650\) | \(10 \pm 3\) | |
| Davari et al. [10] | \(4331 \pm 252\) | \(12 \pm 1\) | \(4166 \pm 308\) | \(3 \pm 1\) | \(4523 \pm 273\) | \(9 \pm 1\) | |
| Gourmelon et al. [12] | Front | \(1032 \pm 227\) | \(2 \pm 1\) | \(907 \pm 131\) | \(\mathbf{0 \pm 0}\) | \(1157 \pm 350\) | \(2 \pm 1\) | 
| Zones | \(840 \pm 84\) | \(\mathbf{0 \pm 0}\) | \(854 \pm 111\) | \(\mathbf{0 \pm 0}\) | \(826 \pm 66\) | \(\mathbf{0 \pm 0}\) | |
| Gourmelon et al. [13] | \(814 \pm 86\) | \(\mathbf{0 \pm 0}\) | \(822 \pm 115\) | \(\mathbf{0 \pm 0}\) | \(807 \pm 71\) | \(\mathbf{0 \pm 0}\) | |
| Hartmann et al. [14] | \(1158 \pm 96\) | \(12 \pm 9\) | \(1372 \pm 203\) | \(7 \pm 5\) | \(998 \pm 76\) | \(5 \pm 4\) | |
| Heidler et al. [15] | Front | \(536 \pm 38\) | \(\mathbf{0 \pm 0}\) | \(532 \pm 56\) | \(\mathbf{0 \pm 0}\) | \(539 \pm 41\) | \(\mathbf{0 \pm 0}\) | 
| Zones | \(716 \pm 77\) | \(3 \pm 3\) | \(745 \pm 94\) | \(1 \pm 0\) | \(684 \pm 102\) | \(2\pm 3\) | |
| Herrmann et al. [16] | \(628 \pm 117\) | \(3 \pm 2\) | \(556 \pm 157\) | \(1 \pm 1\) | \(693 \pm 91\) | \(2 \pm 1\) | |
| Holzmann et al. [17] | \(2510 \pm 277\) | \(21 \pm 4\) | \(2608 \pm 289\) | \(11 \pm 1\) | \(2449 \pm 297\) | \(10 \pm 5\) | |
| Kirillov et al. [36] | Iterative | \(787 \pm 76\) | \(5 \pm 1\) | \(892 \pm 110\) | \(3 \pm 1\) | \(690 \pm 112\) | \(3 \pm 1\) | 
| Parallel | \(860 \pm 128\) | \(5 \pm 0\) | \(702 \pm 109\) | \(3 \pm 0\) | \(993 \pm 158\) | \(2 \pm 0\) | |
| Loebel et al. [18] | \(642 \pm 44\) | \(1 \pm 1\) | \(598 \pm 60\) | \(1 \pm 0\) | \(684 \pm 40\) | \(\mathbf{0 \pm 0}\) | |
| Marochov et al. [26] | \(2855 \pm 346\) | \(48 \pm 2\) | \(2558 \pm 394\) | \(22 \pm 1\) | \(2995 \pm 372\) | \(26 \pm 1\) | |
| Mohajerani et al. [19] | \(2155 \pm 55\) | \(\mathbf{0 \pm 0}\) | \(2118 \pm 80\) | \(\mathbf{0 \pm 0}\) | \(2191 \pm 65\) | \(\mathbf{0 \pm 0}\) | |
| Periyasamy et al. [20] | \(1155 \pm 52\) | \(7 \pm 4\) | \(1332 \pm 55\) | \(5 \pm 3\) | \(1011 \pm 42\) | \(3 \pm 2\) | |
| Wu et al. [21] | \(497 \pm 44\) | \(1 \pm 1\) | \(481 \pm 62\) | \(1 \pm 1\) | \(511 \pm 45\) | \(\mathbf{0 \pm 0}\) | |
| Wu et al. [34] | \(\mathbf{392 \pm 14}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{383 \pm 11}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{400 \pm 23}\) | \(\mathbf{0 \pm 0}\) | |
| Zhang et al. [27] | \(1407 \pm 283\) | \(13 \pm 3\) | \(1681 \pm 306\) | \(7 \pm 2\) | \(1208 \pm 335\) | \(6 \pm 2\) | |
| Zhang et al. [28] | \(989 \pm 209\) | \(7 \pm 2\) | \(1254 \pm 314\) | \(7 \pm 1\) | \(820 \pm 155\) | \(1 \pm 1\) | |
| Zhu et al. [32] | \(999 \pm 91\) | \(12 \pm 3\) | \(1144 \pm 95\) | \(3 \pm 1\) | \(870 \pm 112\) | \(9 \pm 2\) | |
| Sentinel-1 | ENVISAT | ERS | PALSAR | TSX | ||
| Paper | \(\downarrow\) MDE | \(\downarrow\) MDE | \(\downarrow\) MDE | \(\downarrow\) MDE | \(\downarrow\) MDE | |
| Cheng et al. [24] | \(2510 \pm 813\) | \(604 \pm 96\) | \(466 \pm 165\) | \(644 \pm 243\) | \(1737 \pm 561\) | |
| Davari et al. [11] | \(3549 \pm 112\) | \(462 \pm 197\) | \(422 \pm 196\) | \(258 \pm 31\) | \(2336 \pm 523\) | |
| Davari et al. [10] | \(4285 \pm 412\) | \(2140 \pm 41\) | / | / | \(4342 \pm 264\) | |
| Gourmelon et al. [12] | Front | \(2806 \pm 300\) | \(\mathbf{191 \pm 32}\) | \(127 \pm 38\) | \(197 \pm 41\) | \(\mathbf{63 \pm 188}\) | 
| Zones | \(2201 \pm 246\) | \(493 \pm 119\) | \(403 \pm 172\) | \(437 \pm 172\) | \(547 \pm 61\) | |
| Gourmelon et al. [13] | \(2287 \pm 260\) | \(491 \pm 86\) | \(449 \pm 153\) | \(408 \pm 48\) | \(218 \pm 51\) | |
| Hartmann et al. [14] | \(2255 \pm 206\) | \(583 \pm 81\) | \(465 \pm 133\) | \(524 \pm 140\) | \(850 \pm 34\) | |
| Heilder et al. [15] | Front | \(1167 \pm 142\) | \(354 \pm 138\) | \(152 \pm 21\) | \(595 \pm 99\) | \(395 \pm 38\) | 
| Zones | \(2106 \pm 372\) | \(441 \pm 103\) | \(156 \pm 49\) | \(481 \pm 114\) | \(474 \pm 73\) | |
| Herrmann et al. [16] | \(2605 \pm 316\) | \(270 \pm 85\) | \(99 \pm 43\) | \(\mathbf{195 \pm 44}\) | \(302 \pm 118\) | |
| Holzmann et al. [17] | \(3908 \pm 78\) | / | \(1135 \pm 0\) | \(1176 \pm 0\) | \(2103 \pm 314\) | |
| Kirillov et al. [36] | Iterative | \(1650 \pm 126\) | \(499 \pm 54\) | \(215 \pm 168\) | \(420 \pm 86\) | \(598 \pm 77\) | 
| Parallel | \(1653 \pm 103\) | \(325 \pm 90\) | \(\mathbf{69 \pm 3}\) | \(383 \pm 53\) | \(655 \pm 119\) | |
| Loebel et al. [18] | \(2196 \pm 187\) | \(608 \pm 200\) | \(469 \pm 278\) | \(360 \pm 107\) | \(344 \pm 43\) | |
| Marochov et al. [26] | \(1924 \pm 122\) | / | \(1469 \pm 0\) | \(380 \pm 241\) | \(4251 \pm 867\) | |
| Mohajerani et al. [19] | \(1491 \pm 221\) | \(431 \pm 54\) | \(682 \pm 135\) | \(457 \pm 70\) | \(2085 \pm 48\) | |
| Periyasamy et al. [20] | \(2175 \pm 86\) | \(1032 \pm 339\) | \(801 \pm 240\) | \(633 \pm 89\) | \(950 \pm 44\) | |
| Wu et al. [21] | \(1504 \pm 207\) | \(468 \pm 70\) | \(208 \pm 112\) | \(328 \pm 130\) | \(303 \pm 20\) | |
| Wu et al. [34] | \(\mathbf{918 \pm 76}\) | \(253 \pm 42\) | \(174 \pm 47\) | \(263 \pm 28\) | \(286 \pm 8\) | |
| Zhang et al. [27] | \(3927 \pm 837\) | \(1926 \pm 68\) | \(1368 \pm 709\) | \(1838 \pm 402\) | \(1158 \pm 257\) | |
| Zhang et al. [28] | \(1905 \pm 540\) | \(688 \pm 53\) | \(642 \pm 388\) | \(557 \pm 90\) | \(725 \pm 135\) | |
| Zhu et al. [32] | \(3094 \pm 730\) | \(1276 \pm 266\) | \(395 \pm 172\) | \(700 \pm 106\) | \(812 \pm 63\) | |
| Sentinel-1 | ENVISAT | ERS | PALSAR | TSX | ||
| Paper | \(\downarrow\) \(\varnothing \in 33\) | \(\downarrow\) \(\varnothing \in 10\) | \(\downarrow\) \(\varnothing \in 2\) | \(\downarrow\) \(\varnothing \in 8\) | \(\downarrow\) \(\varnothing \in 69\) | |
| Cheng et al. [24] | \(4 \pm 3\) | \(1 \pm 1\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(10 \pm 7\) | |
| Davari et al. [11] | \(8 \pm 1\) | \(4 \pm 2\) | \(\mathbf{0 \pm 0}\) | \(2 \pm 1\) | \(33 \pm 1\) | |
| Davari et al. [10] | \(16 \pm 1\) | \(9 \pm 1\) | \(2 \pm 0\) | \(8 \pm 0\) | \(33 \pm 1\) | |
| Gourmelon et al. [12] | Front | \(2 \pm 1\) | \(2 \pm 2\) | \(\mathbf{0 \pm 0}\) | \(3 \pm 2\) | \(\mathbf{0 \pm 0}\) | 
| Zones | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | |
| Gourmelon et al. [13] | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | |
| Hartmann et al. [14] | \(2 \pm 3\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(9 \pm 10\) | |
| Heidler et al. [15] | Front | \(\mathbf{0 \pm 0}\) | \(2 \pm 1\) | \(\mathbf{0 \pm 0}\) | \(1 \pm 0\) | \(\mathbf{0 \pm 0}\) | 
| Zones | \(3 \pm 3\) | \(2 \pm 2\) | \(\mathbf{0 \pm 0}\) | \(1 \pm 1\) | \(\mathbf{0 \pm 0}\) | |
| Herrmann et al. [16] | \(3 \pm 2\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | |
| Holzmann et al. [17] | \(16 \pm 1\) | \(10 \pm 0\) | \(2 \pm 0\) | \(8 \pm 0\) | \(41 \pm 4\) | |
| Kirillov et al. [36] | Iterative | \(8 \pm 1\) | \(1 \pm 0\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(3 \pm 1\) | 
| Parallel | \(4 \pm 0\) | \(2 \pm 1\) | \(\mathbf{0 \pm 0}\) | \(1 \pm 0\) | \(1 \pm 0\) | |
| Loebel et al. [18] | \(2 \pm 2\) | \(3 \pm 2\) | \(\mathbf{0 \pm 0}\) | \(1 \pm 1\) | \(\mathbf{0 \pm 0}\) | |
| Marochov et al. [26] | \(22 \pm 2\) | \(10 \pm 0\) | \(2 \pm 0\) | \(7 \pm 0\) | \(57 \pm 3\) | |
| Mohajerani et al. [19] | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | |
| Periyasamy et al. [20] | \(4 \pm 3\) | \(3 \pm 1\) | \(1 \pm 0\) | \(\mathbf{0 \pm 0}\) | \(3 \pm 1\) | |
| Wu et al. [21] | \(1 \pm 1\) | \(1 \pm 1\) | \(\mathbf{0 \pm 0}\) | \(1 \pm 1\) | \(1 \pm 1\) | |
| Wu et al. [34] | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | \(\mathbf{0 \pm 0}\) | |
| Zhang et al. [27] | \(24 \pm 2\) | \(9 \pm 1\) | \(1 \pm 1\) | \(6 \pm 1\) | \(4 \pm 3\) | |
| Zhang et al. [28] | \(\mathbf{0 \pm 0}\) | \(4 \pm 3\) | \(\mathbf{0 \pm 0}\) | \(0 \pm 1\) | \(7 \pm 2\) | |
| Zhu et al. [32] | \(15 \pm 1\) | \(8 \pm 2\) | \(1 \pm 0\) | \(1 \pm 1\) | \(\mathbf{0 \pm 0}\) | |
There is no single reason why a system has a lower mde than another, but several factors contribute to different mdes.
For systems with an mde higher than 1200 m, the possible reasons diverge: For Davari et al. [10], the network output is heavily speckled. For some images, edges, such as the calving front and the edge between glacier and rock, show a higher density of predicted front pixels but still no connected front line. Marochov et al. [26]’s system recognizes some higher-level structures, such as the approximate position of rocks, but cannot assign the patterns to the correct classes. The system of Holzmann et al. [17] predicts too few front pixels, and the resulting fronts do not show enough curvature and detail and are not in close proximity to the ground truth front. Davari et al. [11]’s system sometimes predicts the front in the wrong place. In addition, the predicted front is usually too short and does not have enough curvature and detail. The edge between the rock and glacier zones is often recognized as part of the front. Mohajerani et al. [19]’s system acts as a pixel-level edge detector, i. e., at a level where noise has a big influence, rather than recognizing global information. This is also the reason why the number of images with no predicted front is zero. Each image has pixel-level edges, which are thus incorrectly predicted as calving fronts. For Cheng et al. [24], the predictions are speckled, and the system cannot recognize the classes correctly. Sometimes, some edges are found in the images but not between the correct classes. Lastly, Zhang et al. [27]’s system does not seem to be able to capture the general, global structure of the SAR images; classes are mixed up, and the NA region is not predicted correctly.
For systems with an mde between 1200 m–600 m, the main influences are varying degrees of patching artifacts ([20]; [14]; [32]; [28]; [12] Zones, [36] Parallel; [13]; [36] Iterative; [15] Zones), confusion of glacier and ocean class ([20]; [14]; [32];[28]; [12] Zones, [36] Parallel; [13]; [36] Iterative; [15] Zones), confusion of ice mélange as glacial ice ([20]; [14]; [32]; [28]; [12] Zones, [36] Parallel; [13]; [36] Iterative; [15] Zones), and confusion of the coastline and other edges between different zones as calving front ([20]; [14]; [32]; [28]; [12] Front; [12] Zones, [36] Parallel; [13]; [36] Iterative; [15] Zones). In addition, the ocean class has many false positive predictions ([20]; [14]; [32]; [28]; [12] Zones, [36] Parallel; [13]; [36] Iterative; [15] Zones) and sometimes no ocean is predicted at all ([32]; [36] Parallel; [36] Iterative). When the ocean is predicted in the correct location of the image, the ocean outline and, thus, the calving front often do not have the correct shape ([20]; [14]; [32];[28]; [12] Zones, [36] Parallel; [13]; [36] Iterative; [15] Zones). In binary front segmentation, the predicted fronts in the majority of images only cover parts of the ground truth, and many additional false positive fronts are predicted [12].
Only five systems have an mde lower than 600 m: Loebel et al. [18], Herrmann et al. [16], Heidler et al. [15]’s front output, Wu et al. [21] and Wu et al. [34]. All five systems confuse parts of the rocky coastline as calving front, have slight issues with ice mélange, and show a decreased delineation performance for images of the Columbia Glacier captured by Sentinel-1. The outputs of the model with the lowest average mde, the HookFormer [34], additionally show slight patching artifacts and ragged edges between the classes.
The reported differences in the metrics between our dl systems suggest that there is a significant difference for both the mde (Chi-Squared(21)=\(101.72\), p=\(1.43e^{-12} < 0.05\)) and the number of images with no predicted front (Chi-Squared(21) = \(96.99\), p = \(9.80e^{-12} < 0.05\)). On average, the HookFormer [34] has the predictions with the lowest mde, as can be seen in Fig. 1. All four differences in mde to systems with an mde lower than 600 m, i.e., Wu et al. [21]’s system, Heidler et al. [15]’s system’s front output, Herrmann et al. [16]’s system, and Loebel et al. [18]’s system are significant (\(U = 0.0\), p=\(3.97e^{-3} < 1.25e^{-2}\); \(U = 0.0\), p=\(3.97e^{-3} < 1.25e^{-2}\); \(U = 0.0\), p=\(3.97e^{-3} < 1.25e^{-2}\); \(U = 0.0\), p=\(3.97e^{-3} < 1.25^{-2}\)), with effect sizes of \(-3.58\), \(-5.89\), \(-2.66\), and \(-7.36\) (Cohen’s d), respectively. For the number of images with no predicted front, the differences to Wu et al. [21]’s, Herrmann et al. [16]’s, and Loebel et al. [18]’s systems are significant (\(U = 0.0\), p=\(3.54e^{-3} < 1.25e^{-2}\); \(U = 0.0\), p=\(3.35e^{-3} < 1.25e^{-2}\); \(U = 0.0\), p=\(3.65e^{-3} < 1.25e^{-2}\)), with effect sizes of \(-6.32\), \(-2.18\) and \(-4.50\) (Cohen’s d). However, the difference to Heidler et al. [15]’s system’s front output is not significant (\(U = 5.0\), p=\(3.60e^{-2} > 1.25e^{-2}\)).
The differences between base architecture groups are significant (Chi-square(4)=\(24.82\), p=\(5.47e^{-5} < 0.05\)). The average mde for each architecture group is 2670 m for VGG16 [31], 1324 m for DeepLabv3+ [29], 1314 m for U-Nets [23], 914 m for a mix of DeepLabv3+ and ViT, and 607 m for ViTs [33]. The ViT-based architectures outperform the mixed architecture, DeepLabv3+, U-Net, and VGG16-based architectures significantly (\(U = 4.0\), p=\(7.74e^{-4}<1.25e^{-2}\); \(U = 12.0\), p=\(1.68e^{-5}<1.25e^{-2}\); \(U = 283.0\), p=\(2.68e^{-3}<1.25e^{-2}\); \(U = 0.0\), p=\(6.45e^{-5}<1.25e^{-2}\)), with effect sizes of \(-1.78\), \(-1.88\), \(-0.71\), and \(-8.78\) (Cohen’s d), respectively. The differences between models trained on caffe’s binary front labels, caffe’s zone labels, and models trained in a multi-task manner on both labels are significant (Chi-Squared(2)=\(36.30\), p=\(1.31e^{-8} < 0.05\)). The average mdes are 2423 m for binary, 938 m for zones, and 864 m for mtl. Both mtl dl systems and systems trained solely on the zone labels have a significantly lower mde than dl systems trained solely on the binary front labels (\(U = 45.0\), p=\(1.50e^{-6}<1.67e^{-2}\); \(U = 198.0\), p=\(1.59e^{-8}<1.67e^{-2}\)), with effect sizes of \(-1.66\) and \(-1.92\) (Cohen’s d). The difference of mtl to training on the zone labels is not significant (\(U = 480.0\), p=\(3.95e^{-2} > 1.67e^{-2}\)).
With a Kendall’s \(\tau\) of \(-0.15\) (p=\(2.53e^{-2} < 0.05\)), the mde and the mean input size in pixels during training are significantly negatively correlated, i. e., the bigger the input size, the lower the mde. Moreover, the number of down-sampling steps in U-Nets is significantly negatively correlated with the mde, with a Kendall’s \(\tau\) of \(-0.48\) (p=\(2.03e^{-7} < 0.05\)), i. e., the more local-global information interaction, the lower the mde.
Fig. 7 gives an overview of the annotators’ levels of expertise. The mde of the automatic annotations from the best-performing dl system is significantly higher than that of the manual annotations (\(U = 50.0\), \(p = 3.33e^{-4}\)), with an effect size of \(11.82\) (Cohen’s d).
 
\(^{1}\)Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany↩︎
\(^{2}\)School of Engineering and Design, Technische Universität München, Munich, Germany↩︎
\(^{3}\)Institut für Planetare Geodäsie, Technische Universität Dresden, Dresden, Germany↩︎
\(^{4}\)Jet Propulsion Laboratory, California Institute of Technology, USA↩︎
\(^{5}\)Institut für Geographie, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany↩︎
\(^{*}\)Corresponding author, nora.gourmelon@fau.de↩︎