JESS - Joint Energy-Based Semantic Segmentation

About

This is the project page for the paper "Utilization of deep learning tools to map and monitor biological soil crusts". In this project, we propose a new domain adaptation method called Joint Energy-Based Semantic Segmentation to improve the transferability of semantic segmentation models to address the challenge of data drift/domain drift in long-term biomonitoring.

Abstract

Biological soil crusts (biocrusts) form a layer of only one to few centimeters depth on the soil surface and occur mostly in hot and cold deserts. Biocrusts have a major impact on different processes in these ecosystems, like carbon and nitrogen cycling, biodiversity preservation, erosion protection and soil dust emission reduction, but also react highly sensitive upon climate alterations and land use intensification. Therefore, monitoring tools are required to keep track of the changes of these specialized communities in an altering environment. In the current study, we applied a semantic image segmentation approach, using neural networks. One main problem to be solved was, that the training data and target data, on which the model is applied, are often recorded with different camera devices. This leads to different statistical properties of the image data, like different scale, resolution, brightness etc., which could significantly affect the model’s performance. To solve this problem, we propose a new domain adaption method using a joint energy-based approach. To test a semantic segmentation approach in general, we utilized biocrust imagery taken in Utah (United States of America) and two sub datasets from the National Park Gesäuse (Austria). Here, we achieved highly reliable results with an overall classification accuracy of 85.9% for the USA data and 88.6% and 91.4%, respectively, for the two sub datasets of the National Park Gesäuse. To test our joint energy-based domain adaption approach, we used the two sub datasets from the National Park Gesäuse, which were recorded with different camera devices. With this newly established approach, we improved the accuracy of our segmentation on the unlabeled sub dataset from 70.4% to 75.3%. The results suggest that joint energy-based modelling is a well-suited domain adaptation method for semantic segmentation that could be applied to face various deep learning and image-based biomonitoring challenges.

Joint Energy-Based Semantic Segmentation Workflow — Figure 1: Methodological workflow of the Joint Energy-Based Semantic Segmentation (JESS). Green and orange arrows describe fieldwork and the annotation of images with Labelbox respectively. Light blue arrows describes the data processing that has been done using Python. The deep learning part, implemented with the Python library PyTorch is shown in dark blue. The energy-based optimization step of the JESS model is outlined with a red dashed line. The baseline model is trained without this optimization step.

Semantic Segmentation — Figure 2: Classification results of the standard image segmentation of the USA biocrust dataset. In the central column, the original image is shown, while in the left-hand column the manual image annotation results and in the right-hand column the standard image segmentation results are shown.

Mathematical Derivation

The Boltzmann distribution,

Joint Energy-Based Semantic Segmentation Derivation

originally used in statistical mechanics, is the basis for energy based models (LeCun et al., 2006), where E_θ(x) is the energy function, which is parameterized by θ, and Z(θ) is the so-called partition function, with

to normalize the prior distribution p_θ(x). Energy based models, as a form of generative models, are used to learn the data distribution of a dataset in order to produce new data in a generative way. In contrast to that, discriminative models learn data features or representations to classify data by using a function f_θ(x) to map an input x∈ R^D to an output of logits y∈ R^K, where K is the number of classes and D=H×W×3 for RGB image data with H and W for height and width of the image, respectively.
A so-called softmax function

produces a categorical distribution over K different possible classes from the model output (Hinton, 2002). The Boltzmann distribution can be used to define an energy-based model of the joint probability of

with the model output f_θ(x)_j as energy function E_θ(x,y). We marginalize out y by computing the sum over y and obtain the prior distribution.

The posterior distribution is computed as

The unknown normalization constant Z(θ) canceles out and we obtain the softmax function again. This means that there is a generative model hidden in every standard discriminative model as proposed in (Grathwohl et al., 2020). Consequently, we are allowed to combine discriminative and generative learning by optimizing p_θ(y|x) and optimizing the log-likelihood of p_θ(x) so that our model fits better to our data. For maximizing the log likelihood of p_θ(x) it is necessary to compute the derivative of log p_θ(x),

with the expectation term

This term could only be estimated, which is done by sampling x’ through Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh, 2011), where x describes the dataset on which we want to maximize the log likelihood and x’ describes the adversarial dataset, sampled from p(x) via SGLD.
The output of a classifier could be interpreted as an energy function

without changing the parameterization θ of the neural network (Liu et al., 2020). During the training of our neural network, a discriminative loss function gets minimized. To maximize the log-likelihood of p(x) we minimize the negative one in addition to our discriminative loss. So, in addition to the gradient decent of the discriminative training, we can compute the generative loss of our model as

as proposed in (Grathwohl et al., 2020). Log Z(θ) could be ignored here, since it is a constant and has no influence on the minimization.
In contrast to the image classification, we needed to consider, that, for image segmentation, we have one classification for every pixel of an image instead of one classification for the whole image. So additional considerations are required to apply Joint Energy-Based Models for semantic segmentation. The pixel index q describes the position of the respective classification.
The softmax function

also depends on the pixel index q. The joint, the prior and the posterior distributions depend on the pixel index q as well, with

and

By solving p_θ(y│x,q) the normalization term Z(θ,q) canceles out and we obtain the softmax function

again. The joint probability of p_θ(x,q) is given as

The pixel index q is distributed uniformly. So, p(q) is a uniform distribution

with W and H as the image size. The joint distribution of

As a simplification assumption, we suppose that the prior factorizes over the pixel index q, i.e.,

The energy function finally results in

Usage

To run the Standard Semantic Segmentation of the USA dataset execute:

                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test norm --set usa

To run the Standard Semantic Segmentation of the Johnsbachtal Camera dataset execute:

                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test norm --set john_cam

To run the Standard Semantic Segmentation of the Johnsbachtal Cellphone dataset execute:

                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test norm --set john_handy

To run the Joint Energy-Based Semantic Segmentation execute:

                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy True --num_classes 8 --num_tests 10 --test jess

and

                
                    python train_jess.py --batch_size 8 --learnrate 0.0001 p_x_weight 0.01 --optimizer adam --eval_every 1 --print_every 1 --ckpt_every 20 --energy False --num_classes 8 --num_tests 10 --test jess

To evaluate the model of the Standard Semantic Segmentation of the USA dataset execute:

                
                    python evaluate_model.py --test norm --set usa --num_classes 8 --batch_size 8

To evaluate the model of the Standard Semantic Segmentation of the Johnsbachtal Camera dataset execute:

                
                    python evaluate_model.py --test norm --set john_cam --num_classes 8 --batch_size 8

To evaluate the model of the Standard Semantic Segmentation of the Johnsbachtal Cellphone dataset execute:

                
                    python evaluate_model.py --test norm --set john_handy --num_classes 8 --batch_size 8

To evaluate the model of the Joint Energy-Based Semantic Segmentation of the USA dataset execute:

                
                    python evaluate_model.py --test jess --num_classes 8 --batch_size 8

Make sure you performed the training before, so that the models can be loaded for evaluation.
You can run neighbors.py to run the neighbor analysis of the USA dataset with:

                
                    python neighbors.py --test norm --set usa

You can run neighbors.py to run the neighbor analysis of the Johnsbachtal Camera dataset with:

                
                    python neighbors.py --test norm --set john_cam

You can run neighbors.py to run the neighbor analysis of the Johnsbachtal Cellphone dataset with:

                
                    python neighbors.py --test norm --set john_handy