JESS - Joint Energy-Based Semantic Segmentation

About

This is the project page for the paper "Utilization of deep learning tools to map and monitor biological soil crusts". In this project, we propose a new domain adaptation method called Joint Energy-Based Semantic Segmentation to improve the transferability of semantic segmentation models to address the challenge of data drift/domain drift in long-term biomonitoring.

Abstract

Biological soil crusts (biocrusts) form a layer of only one to few centimeters depth on the soil surface and occur mostly in hot and cold deserts. Biocrusts have a major impact on different processes in these ecosystems, like carbon and nitrogen cycling, biodiversity preservation, erosion protection and soil dust emission reduction, but also react highly sensitive upon climate alterations and land use intensification. Therefore, monitoring tools are required to keep track of the changes of these specialized communities in an altering environment. In the current study, we applied a semantic image segmentation approach, using neural networks. One main problem to be solved was, that the training data and target data, on which the model is applied, are often recorded with different camera devices. This leads to different statistical properties of the image data, like different scale, resolution, brightness etc., which could significantly affect the model’s performance. To solve this problem, we propose a new domain adaption method using a joint energy-based approach. To test a semantic segmentation approach in general, we utilized biocrust imagery taken in Utah (United States of America) and two sub datasets from the National Park Gesäuse (Austria). Here, we achieved highly reliable results with an overall classification accuracy of 85.9% for the USA data and 88.6% and 91.4%, respectively, for the two sub datasets of the National Park Gesäuse. To test our joint energy-based domain adaption approach, we used the two sub datasets from the National Park Gesäuse, which were recorded with different camera devices. With this newly established approach, we improved the accuracy of our segmentation on the unlabeled sub dataset from 70.4% to 75.3%. The results suggest that joint energy-based modelling is a well-suited domain adaptation method for semantic segmentation that could be applied to face various deep learning and image-based biomonitoring challenges.

Joint Energy-Based Semantic Segmentation Workflow
Figure 1: Methodological workflow of the Joint Energy-Based Semantic Segmentation (JESS). Green and orange arrows describe fieldwork and the annotation of images with Labelbox respectively. Light blue arrows describes the data processing that has been done using Python. The deep learning part, implemented with the Python library PyTorch is shown in dark blue. The energy-based optimization step of the JESS model is outlined with a red dashed line. The baseline model is trained without this optimization step.
Semantic Segmentation
Figure 2: Classification results of the standard image segmentation of the USA biocrust dataset. In the central column, the original image is shown, while in the left-hand column the manual image annotation results and in the right-hand column the standard image segmentation results are shown.
Semantic Segmentation
Figure 3: Classification results of the Joint Energy-Based segmentation compared to the baseline segmentation. 1st column: ground truth, 2nd column: biocrust input image, 3rd column: joint energy based prediction, 4th column: baseline model prediction.

Mathematical Derivation

The Boltzmann distribution,
Joint Energy-Based Semantic Segmentation Derivation
originally used in statistical mechanics, is the basis for energy based models (LeCun et al., 2006), where Eθ(x) is the energy function, which is parameterized by θ, and Z(θ) is the so-called partition function, with
Joint Energy-Based Semantic Segmentation Derivation
to normalize the prior distribution pθ(x). Energy based models, as a form of generative models, are used to learn the data distribution of a dataset in order to produce new data in a generative way. In contrast to that, discriminative models learn data features or representations to classify data by using a function fθ(x) to map an input x∈ RD to an output of logits y∈ RK, where K is the number of classes and D=H×W×3 for RGB image data with H and W for height and width of the image, respectively.
A so-called softmax function
Joint Energy-Based Semantic Segmentation Derivation
produces a categorical distribution over K different possible classes from the model output (Hinton, 2002). The Boltzmann distribution can be used to define an energy-based model of the joint probability of
Joint Energy-Based Semantic Segmentation Derivation
with the model output fθ(x)j as energy function Eθ(x,y). We marginalize out y by computing the sum over y and obtain the prior distribution.
Joint Energy-Based Semantic Segmentation Derivation
The posterior distribution is computed as
Joint Energy-Based Semantic Segmentation Derivation
The unknown normalization constant Z(θ) canceles out and we obtain the softmax function again. This means that there is a generative model hidden in every standard discriminative model as proposed in (Grathwohl et al., 2020). Consequently, we are allowed to combine discriminative and generative learning by optimizing pθ(y|x) and optimizing the log-likelihood of pθ(x) so that our model fits better to our data. For maximizing the log likelihood of pθ(x) it is necessary to compute the derivative of log pθ(x),
Joint Energy-Based Semantic Segmentation Derivation
with the expectation term
Joint Energy-Based Semantic Segmentation Derivation
This term could only be estimated, which is done by sampling x’ through Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh, 2011), where x describes the dataset on which we want to maximize the log likelihood and x’ describes the adversarial dataset, sampled from p(x) via SGLD.
The output of a classifier could be interpreted as an energy function
Joint Energy-Based Semantic Segmentation Derivation
without changing the parameterization θ of the neural network (Liu et al., 2020). During the training of our neural network, a discriminative loss function gets minimized. To maximize the log-likelihood of p(x) we minimize the negative one in addition to our discriminative loss. So, in addition to the gradient decent of the discriminative training, we can compute the generative loss of our model as
Joint Energy-Based Semantic Segmentation Derivation
as proposed in (Grathwohl et al., 2020). Log Z(θ) could be ignored here, since it is a constant and has no influence on the minimization.
In contrast to the image classification, we needed to consider, that, for image segmentation, we have one classification for every pixel of an image instead of one classification for the whole image. So additional considerations are required to apply Joint Energy-Based Models for semantic segmentation. The pixel index q describes the position of the respective classification.
The softmax function
Joint Energy-Based Semantic Segmentation Derivation
also depends on the pixel index q. The joint, the prior and the posterior distributions depend on the pixel index q as well, with
Joint Energy-Based Semantic Segmentation Derivation
Joint Energy-Based Semantic Segmentation Derivation
and
Joint Energy-Based Semantic Segmentation Derivation
By solving pθ(y│x,q) the normalization term Z(θ,q) canceles out and we obtain the softmax function
Joint Energy-Based Semantic Segmentation Derivation
again. The joint probability of pθ(x,q) is given as
Joint Energy-Based Semantic Segmentation Derivation
The pixel index q is distributed uniformly. So, p(q) is a uniform distribution
Joint Energy-Based Semantic Segmentation Derivation
with W and H as the image size. The joint distribution of
Joint Energy-Based Semantic Segmentation Derivation
As a simplification assumption, we suppose that the prior factorizes over the pixel index q, i.e.,
Joint Energy-Based Semantic Segmentation Derivation
Joint Energy-Based Semantic Segmentation Derivation
Joint Energy-Based Semantic Segmentation Derivation
The energy function finally results in
Joint Energy-Based Semantic Segmentation Derivation

Usage

To run the Standard Semantic Segmentation of the USA dataset execute:

                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test norm --set usa 
                 
            
To run the Standard Semantic Segmentation of the Johnsbachtal Camera dataset execute:
                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test norm --set john_cam 
                 
            
To run the Standard Semantic Segmentation of the Johnsbachtal Cellphone dataset execute:
                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test norm --set john_handy 
                 
            
To run the Joint Energy-Based Semantic Segmentation execute:
                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy True --num_classes 8 --num_tests 10 --test jess 
                 
            
and
                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test jess 
                 
            
To evaluate the model of the Standard Semantic Segmentation of the USA dataset execute:
                
                    python evaluate_model.py --test norm --set usa --num_classes 8 --batch_size 8
                
            
To evaluate the model of the Standard Semantic Segmentation of the Johnsbachtal Camera dataset execute:
                
                    python evaluate_model.py --test norm --set john_cam --num_classes 8 --batch_size 8
                
            
To evaluate the model of the Standard Semantic Segmentation of the Johnsbachtal Cellphone dataset execute:
                
                    python evaluate_model.py --test norm --set john_handy --num_classes 8 --batch_size 8
                
            
To evaluate the model of the Joint Energy-Based Semantic Segmentation of the USA dataset execute:
                
                    python evaluate_model.py --test jess --num_classes 8 --batch_size 8
                
            
Make sure you performed the training before, so that the models can be loaded for evaluation.
You can run neighbors.py to run the neighbor analysis of the USA dataset with:
                
                    python neighbors.py --test norm --set usa 
                
            
You can run neighbors.py to run the neighbor analysis of the Johnsbachtal Camera dataset with:
                
                    python neighbors.py --test norm --set john_cam 
                
            
You can run neighbors.py to run the neighbor analysis of the Johnsbachtal Cellphone dataset with:
                
                    python neighbors.py --test norm --set john_handy