FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection
Chintan Tundia
a
, Rajiv Kumar
b
, Om Damani
c
and G. Sivakumar
d
Indian Institute of Technology Bombay, Mumbai, India
Keywords:
Object Detection, Instance Segmentation, Change Detection, Remote Sensing.
Abstract:
Change detection for aerial imagery involves locating and identifying changes associated with the areas of in-
terest between co-registered bi-temporal or multi-temporal images of a geographical location. Farm ponds
are man-made structures belonging to the category of minor irrigation structures used to collect surface
run-off water for future irrigation purposes. Detection of farm ponds from aerial imagery and their evo-
lution over time helps in land surveying to analyze the agricultural shifts, policy implementation, seasonal
effects and climate changes. In this paper, we introduce a publicly available object detection and instance
segmentation (OD/IS) dataset for localizing farm ponds from aerial imagery. We also collected and anno-
tated the bi-temporal data over a time-span of 14 years across 17 villages, resulting in a binary change de-
tection dataset called Farm Pond Change Detection Dataset (FPCD). We have benchmarked and analyzed
the performance of various object detection and instance segmentation methods on our OD/IS dataset and
the change detection methods over the FPCD dataset. The datasets are publicly accessible at this page:
https://huggingface.co/datasets/ctundia/FPCD.
1 INTRODUCTION
Accurate and timely detection of geographical
changes on earth’s surface gives extensive informa-
tion about the various activities and phenomena hap-
pening on earth. Change detection task helps in an-
alyzing and understanding co-registered images for
change information. A change instance between two
images refers to the semantic level differences in the
appearance between the two images in association
to the regions of interest captured at different points
in time. On geographical images, it helps to keep
track of the changes to analyze evolution of land ge-
ography or land objects and to mitigate hazards at
local and global scales. The availability of high-
resolution aerial imagery has enabled land use and
land cover monitoring to detect objects such as wells,
farm ponds, check dams, etc. at instance level.
Change detection can be bi-temporal when two
points in time are compared or multi-temporal when
multiple points in time are compared. When multi-
temporal data is captured by satellites, drones or aerial
a
https://orcid.org/0000-0003-3169-1775
b
https://orcid.org/0000-0003-4174-8587
c
https://orcid.org/0000-0002-4043-9806
d
https://orcid.org/0000-0003-2890-6421
These Authors made equal contribution
vehicles, they are constrained by spatial, spectral and
temporal elements, in addition to atmospheric condi-
tions, resolution, etc. The cloud cover, shadows and
seasonal changes become visible in many aerial im-
ages that affect the overall appearance of the images
resulting in the co-registered images to appear from
different domains. Moreover, acquiring and annotat-
ing images from satellite images to build change de-
tection datasets is a costly process that involves many
underlying tasks. The absence of paired images of
the same location captured at different times makes it
difficult to obtain a useful change detection dataset.
Also when paired bi-temporal images are available,
there might not be changes present, or the bi-temporal
images are captured in very short intervals.
Figure 1: Different categories of farm ponds. From left to
right: Wet farm pond (lined), Dry farm pond (lined), Wet
farm pond (unlined) and Dry farm pond (unlined).
Farm ponds have become popular as private irri-
gation sources in developing countries like India over
the last two decades. A farm pond is an artificial dug
out structure having an inlet and outlet for collect-
862
Tundia, C., Kumar, R., Damani, O. and Sivakumar, G.
FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection.
DOI: 10.5220/0011797600003417
In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages
862-869
ISBN: 978-989-758-634-7; ISSN: 2184-4321
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
ing the surface runoff water flowing from the farm
area (Tundia et al., 2020). They are used to col-
lect and store rainwater so as to provide irrigation to
crops during periods of water scarcity. It is one of
the many minor irrigation structures (Tundia. et al.,
2022) with a cultivable command area of up to 2000
hectares. Farm ponds can be classified into two cate-
gories based on the presence or absence of water in it:
wet farm ponds and dry farm ponds. Also, farm ponds
can be either lined or unlined depending on the use of
plastic lining to prevent water seepage into ground-
water. Overall, farm ponds can be classified into 4
sub-categories (See Fig 1): lined wet farm pond, un-
lined wet farm pond, lined dry farm pond and unlined
dry farm pond based on their structure and the pres-
ence of water.
Though efforts have been put up to promote farm
ponds, serious concerns have been raised over their
implementation and their usage. The purpose of
building farm ponds has long drifted from their orig-
inal objective of storing rainwater for protective ir-
rigation to being used as storage tanks for pumped-
out groundwater exposing the underground water to
evaporation losses. In the due time, farm ponds have
accelerated the rate of groundwater exploitation to
many folds (Prasad et al., 2022). The objective of
our work is to develop change detection models to de-
tect and visualize changes associated with farm ponds
and to compute percentage differences in the pres-
ence/absence of wet/dry farm ponds. This can help
in making policy decisions and in analyzing impacts
of farm ponds on aspects like shifting of agricultural
practice, policies being implemented, seasonal effects
and changes, etc.
1.1 Contributions
Our contributions in this paper are:
1. Farm pond change detection dataset (FPCD):
A publicly available dataset for change detection
tasks on farm pond categories for different pur-
poses and stakeholders.
2. A small-scale public dataset for object detection
and instance segmentation of four farm pond cat-
egories.
The paper is organized as follows: section 1.3 cov-
ers the related work, section 2 covers the details of the
proposed dataset, section 3 covers the experiments,
section 4 covers the results and observation, and fi-
nally conclusion in section 5.
Table 1: Comparison of change detection datasets.
Dataset
Image
Pairs
Res.
(m)
Image
Size
Bands Mask Type
Change
Inst.
CLCD
(Liu et al., 2022)
600 0.5 to 2 512 x 512 RGB Binary -
S2Looking
(Shen et al., 2021)
5000
0.5
to
0.8
1024 x 1024 RGB Binary 65,920
SYSU-CD
(Shi et al., 2022)
20000 0.5 256 x 256 RGB Binary -
DSIFN
(Zhang et al., 2020)
3988 - 512 x 512 RGB Binary -
LEVIR-CD
(Chen and Shi, 2020)
637 0.5 1024 x 1024 RGB Binary 31,333
WHU
Building
CD (Ji et al., 2019)
1 0.075
32207
x
15354
RGB Binary 2297
SZTAKI
(Bourdis et al., 2011)
13 1.5 952 x 640 RGB Binary 382
FPCD 694 0.156 1024 x 768 RGB
Binary
616
1.2 Problem Formulation
Generally, change detection tasks involve an input
set of multi-temporal images and the correspond-
ing ground truth mask, with most change detection
datasets having bi-temporal images mapped to binary
mask labels. A general assumption in most change
detection tasks is that there is a pixel-to-pixel corre-
spondence between the two images and these corre-
spondences are registered to the same point on a ge-
ographical area. Based on the correspondences, each
pixel in the change mask can be assigned a label in-
dicating whether there is a change or not. In other
words, pixels belonging to the change mask are as-
signed a change label if the corresponding area of in-
terest has geographical changes with respect to each
other and those are not assigned a change label when
there are no changes. In a binary change detection
setting, the paired input images are T
0
R
CxHxW
and
T
1
R
CxHxW
with a size of CxHxW, where H and W
are the spatial dimensions and C=3 is the input image
channel dimensions. The ground truth mask can be
represented as pixel based mask label, M R
CxHxW
for the bi-temporal input images of the change detec-
tion task.
1.3 Related Work
1.3.1 Datasets
Change detection datasets generally use RGB images
(Chen and Shi, 2020) or hyperspectral images (Daudt
et al., 2018) with some datasets (Van Etten et al.,
2021) having change instances up to 11 million. The
image resolution of CD datasets range from a few
centimeters (Ji et al., 2019), (Shao et al., 2021), (Tian
et al., 2020) to 10 meters (Van Etten et al., 2021),
(Daudt et al., 2018). Some of these datasets have
image pairs ranging from a few hundred (Liu et al.,
2022) to tens of thousands (Shi et al., 2022) and some
have very few input image pairs of very high reso-
lution. While most datasets have fixed image sizes.
FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection
863
Most of the CD datasets have modest image sizes ex-
cept for the exception of a few (Wu et al., 2017), (Ji
et al., 2019). We summarize the different aspects of
various change detection datasets along with their de-
tails in Table 1.
1.3.2 Change Detection Techniques
Deep learning based change detection methods can be
classified into feature-based, patch-based and image
based deep learning change detection. Earlier CNN
architectures used siamese with triplet loss (Zhang
et al., 2019) and weighted contrastive loss (Zhan
et al., 2017) to learn discriminative features between
change and no-change images, and used Euclidean
distances between the images features to generate
the difference images. Recent developments in deep
learning have led to the usage of attention mecha-
nisms and feature fusion at various scales improving
feature extraction capabilities. We compare and list
some of the existing encoder-decoder based models
used for change detection below.
Deeplabv3+ (Chen et al., 2018) uses spatial pyra-
mid pooling module to encode multi-scale context at
multiple effective fields-of-view and encode-decoder
structure to capture sharper object boundaries. It im-
proves upon DeepLabv3 (Chen et al., 2017) by apply-
ing depthwise separable convolution to both Atrous
Spatial Pyramid Pooling and decoder modules. Pyra-
mid scene parsing network (PSPNet)(Zhao et al.,
2017) along with pyramid pooling module applies
different-region-based context aggregation to pro-
duce global prior representation for pixel-level pre-
diction tasks. Unified Perceptual Parsing Net (UPer-
Net)(Xiao et al., 2018) is a multi-task framework that
can recognize visual concepts from a given image us-
ing a training strategy developed to learn from het-
erogeneous image annotations. Multi-scale Attention
Net (MA-Net) (Fan et al., 2020) uses multi-scale fea-
ture fusion to improve the segmentation performance
by introducing self-attention mechanism to integrate
local features with their global dependencies. It uses
Position-wise Attention Block (PAB) to model the
feature inter-dependencies in spatial dimensions and
Multi-scale Fusion Attention Block (MFAB) to cap-
ture the channel dependencies between any feature
map by multi-scale semantic feature fusion.
LinkNet (Chaurasia and Culurciello, 2017) pro-
poses a novel deep neural network architecture that
uses only 11.5 million parameters learning with-
out any significant increase in number of parame-
ters. Pyramid Attention Network (PAN) (Li et al.,
2018) uses Feature Pyramid Attention module and
Global Attention Upsample module to combine at-
tention mechanism and spatial pyramid to extract
precise dense features for pixel labeling. Unet++
(Zhou et al., 2018) improves on Unet (Ronneberger
et al., 2015) with new skip pathways and is based
on the idea that optimizer can learning easily with
semantically similar feature maps by reducing the
gap between the feature maps of the encoder and de-
coder sub-networks. Bitemporal image transformer
(BiT) (Hao Chen and Shi, 2021) models contexts
within the spatial-temporal domain in a deep feature
differencing-based CD framework. The bi-temporal
images are encoded as tokens and a transformer en-
coder models the contexts in the compact token-based
space-time. A transformer decoder then refines the
original features from the learned context-rich tokens
back to the pixel-space.
1.3.3 Object Detection and Instance
Segmentation Techniques
Object detection is a computer vision task that local-
izes and classifies objects of interest in an image. In-
stance segmentation on the other hand provides a de-
tailed inference for every single pixel in input image.
Since the advent of deep learning(Kumar et al., 2021),
object detection has vastly benefited from sophisti-
cated architectures and image representations. Over
the years, deep learning based detectors have evolved
into one-stage and two-stage detectors. In one-stage
detection, the input image is divided into regions si-
multaneously with the probabilistic prediction of ob-
jects, while in two-stage object detection the object
proposals are classified in the second stage from a
sparse set of candidate object proposals generated in
the first stage. A few of the well-known two-stage de-
tectors include the FasterRCNN (Ren et al., 2017),
GridRCNN (Lu et al., 2019), while the YOLOv3
(Redmon and Farhadi, 2018), Generalized Focal Loss
(Li et al., 2020), Gradient Harmonized SSD (Li et al.,
2019) are one-stage detectors. Detecting objects from
aerial imagery is affected by adversities like view-
point variation, illumination, occlusion, etc. and be-
comes difficult due to objects being small, sparse and
non-uniform in high-resolution aerial images.
2 FARM POND CHANGE
DETECTION (FPCD) DATASET
FPCD is a novel publicly available change detection
dataset that focuses on changes associated with irri-
gation structures from India. The details of the multi-
temporal images including the district, village name
and the time-interval over which the images were col-
lected are summarized in Table 5.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
864
Table 2: Temporal object instances and their respective
change classes. Row 1 corresponds to T0 images, row 2 cor-
responds to T1 images, row 3 corresponds to change mask.
Columns - (a,b) No Change, (c) Farm pond constructed, (d)
Farm pond demolished, (e) Farm pond dried and (f) Farm
pond wetted.
a b c d e f
Table 3: Class distribution of change objects.
Change
Class
No. of
Instances
Farm pond constructed 431
Farm pond demolished 39
Farm pond dried 47
Farm pond wetted 99
2.1 Dataset Collection Technique
The images were collected by following the steps
given below:-
1. Area and Timestamp Selection: The ground
truth information about the farm pond locations
and the list of villages were collected from dif-
ferent sources like news articles, reports and web-
pages along with Jalyukt Shivar data(Maharashtra
Remote Sensing Application Centre (MRSAC),
2015). The location of the Farm ponds would
span across the different districts of Maharashtra,
India. The villages were then rigorously filtered
and chosen based on the presence or absence of
farm ponds, and bi-temporal pairs were decided
by visual inspection of various timestamps.
2. Grid Formation: For collecting the images of a
fixed size for each village, the grids of latitude
and longitude at a fixed zoom level of Google
Maps were needed. We selected zoom level as
18 which provides resolution of sub-meter level.
Grids of size 1024x768 pixels were created by
using the village boundaries provided by Indian
Village Boundaries Project (Data{Meet}, 2019)
in the geo-location dimensions i.e. latitudes and
longitudes.
3. Image Collection: Google Earth Pro desktop
software was used for collecting historical im-
agery. For each grid cell, the map was set such
that it would cover the grid cell boundaries with
the view. Then the bi-temporal images were saved
by setting the timestamp one at a time using the
Google Earth Historical Imagery tool.
Table 4: Class distribution of object instances in OD/IS
dataset.
Annotation
Class
No. of
Instances
Wet farm pond (lined) 287
Wet farm pond (unlined) 293
Dry farm pond (lined) 90
Dry farm pond (unlined) 668
Table 5: Villages and timestamps for FPCD dataset.
District Village Name Timestamp
#Image
Pairs
Nasik Hadap Sawargaon Feb-2019, Feb-2021 25
Akola Akhatwada Mar-2007, Mar-2018 24
Akola Ghusar April-2013, Mar-2018 124
Amravati Nardoda Nov-2009, Jan-2021 39
Aurangabad Kumbephal Feb-2014, Mar-2020 62
Beed Kumbhephal April-2012, Mar-2019 37
Buldhana Bhivgaon Jan-2014, Jan-2019 8
Hingoli Gondala Feb-2014, Mar-2020 46
Jalgaon Pimpalgaon Block Oct-2013, Nov-2020 17
Jalna
Dawargaon/
Bhatkheda
April-2012, April-2019 16
Latur Sumthana Jan-2013, Jan-2021 15
Nanded Chainpur Jan-2014, Mar-2019 56
Osmanabad Ambejawalga Jan-2014, Feb-2019 74
Parbhani Kinhola Block May-2010, April-2020 62
Wardha Kakaddara Jan-2013, Feb-2019 17
Washim Bhoyata April-2013, Feb-2021 46
Yavatmal Wadhona Pilki April-2014, April-2019 26
Total Images 694
4. Object & Instance Annotation: Once we have
collected all the image pairs for each village, we
annotate the objects individually in each image.
Depending on the types of farm ponds, there are
four main classes as given in Table 4. We use the
annotation tool, LabelMe(Russell et al., 2008) to
annotate and label the above mentioned classes.
And further we convert them into CoCo(Lin et al.,
2014) format required for the object detection
and instance segmentation tasks thus forming the
Farm pond OD/IS dataset.
5. Change Mask Generation: In this step, we
take bi-temporal pair of images and correspond-
ing farm pond annotations. Based on the loca-
tion and farm pond category we generate differ-
ent types of change masks with change classes
as given in Table 2. More details on the type of
changes are given in subsection 2.2.
2.2 Dataset Details
A total of 694 images of size 1024 x 768 pixels at
zoom level 18 were collected from Google Earth im-
ages using the technique described in subsection 2.1.
The regions of Maharashtra in India were chosen,
since it is largely a groundwater dependent region in
western India. The images collected at zoom level
18 are at a very high resolution scale, up to 1 me-
ter. The details of villages collected and their respec-
tive timestamps are given in Table 5. Most of the
villages have timestamps during the months of Jan-
April and the minimum year difference between bi-
temporal images is 2 years and the maximum year
difference is 9 years, earliest being 2007 and latest be-
ing 2021. The FPCD dataset consists of image pairs,
FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection
865
change masks and object annotations of farm ponds
as polygons set in COCO (Lin et al., 2014) format.
For farm pond change detection, we identify four
change classes i.e Farm pond constructed, Farm pond
demolished, Farm pond dried and Farm pond wetted.
Let’s consider the bi-temporal pair as T0-T1, T0 be-
ing the image captured with the old timestamp and
T1 being the image captured with new timestamp im-
age. We identify a binary change as FP constructed,
when there is no farm pond in the T0 image, but a
farm pond is observed at the same location in the T1
image. Likewise, we identify a binary change as FP
demolished, when a farm pond existed at any location
in the T0 image and there is no farm pond at the same
location or replaced by different terrain in the T1 im-
age. We identify a binary change as farm pond dried
when there existed a wet farm pond in the T0 image
and the farm pond at the same location in T1 image
becomes dry due to absence of water or presence of
low water level. We identify a binary change as farm
pond wetted, when there existed a dry farm pond in
the T0 image and the farm pond at the same location
in T1 image is filled with water or the water reaches
the surface level.
Some farmers may pump groundwater and fill
the farm ponds instead of using surface-water runoff
due to rain leading to water loss by further evapo-
ration. This leads to depletion of groundwater and
the practice of farm pond based irrigation becom-
ing unsustainable (Prasad et al., 2022). Thus group-
ing farm pond constructed and farm pond demolished
class helps in identifying the increase/decrease in
farm ponds. This helps stakeholders like researchers
and policy makers to monitor the impacts due to in-
crease/decrease in farm ponds like change in agricul-
tural patterns, impact on ground water, etc. We clas-
sify this task as Task-1.
In certain dry and semi-arid regions, due to un-
certain climatic conditions, yearly and seasonal rain-
fall is erratic. Agricultural officers and researchers
often correlate climatic conditions and ground water
levels with such changes to infer further observations
eventually leading to necessary interventions needed
for agricultural sustainability. Thus we group Farm
pond dried and Farm pond wetted into Task-2. We
also combined the above grouped changes so as to
also provide overall statistics. We call this as Task-
3. We further explain experimental details of Task-1,
Task-2 and Task-3 in Section 3. The details of change
class distribution are given in the Table 3.
Table 6: Distribution of positives and negative image pairs
of various change detection tasks.
Task 1 Task 2 Task 3
Type Neg Pos Neg Pos Neg Pos
Train 363 237 534 66 329 271
Validation 52 42 83 11 45 49
Table 7: Binary change detection benchmark on farm pond
constructed/demolished task (Task 1).
ResNet - 18 ResNet - 50 ResNet - 101
Model Precision Recall F-Score Precision Recall F-Score Precision Recall F-Score
UNet 0.8174 0.8537 0.7360 0.9331 0.5533 0.5534 0.9051 0.7072 0.666
UNet++ 0.8467 0.8439 0.7668 0.8465 0.8439 0.7668 0.9123 0.7969 0.7669
MANet 0.8347 0.5536 0.5539 0.9998 0.5531 0.5532 0.9894 0.5596 0.5539
LinkNet 0.8558 0.8251 0.7529 0.8580 0.7991 0.7226 0.9029 0.7554 0.7233
FPN 0.8833 0.8106 0.7567 0.8559 0.8446 0.7788 0.8582 0.8764 0.8003
PSPNet 0.9763 0.5855 0.5840 0.9763 0.5855 0.5840 0.9496 0.6178 0.6172
PAN 0.8572 0.8037 0.7250 0.8546 0.8383 0.7591 0.9176 0.7907 0.7685
DeepLabV3 0.8474 0.8428 0.7776 0.9138 0.7917 0.7720 0.8891 0.8188 0.7894
DeepLabV3+ 0.8998 0.8085 0.7870 0.8953 0.8151 0.7717 0.8563 0.8348 0.7771
UPerNet 0.8843 0.8148 0.7783 0.9055 0.8209 0.7855 0.9131 0.8389 0.8067
BiT 0.9252 0.8283 0.8704 - - - - - -
3 EXPERIMENTS AND
EVALUATION
Tasks like instance segmentation (Lin et al., 2014) and
image segmentation (Lin et al., 2014) involves a sim-
ilar pipeline like that of change detection. We used
the change detection framework (Kaiyu Li, 2021) for
conducting most of our CD-based experiments. The
pytorch based change detection framework supports
many models, encoders and deep architectures. In
this change detection pipeline, the bi-temporal images
are encoded into feature vectors using two encoders.
The encoders can be either siamese or non-siamese
type. The input images are encoded into encoded fea-
ture vectors, which are fused either by concatenating,
summing, subtracting or finding the absolute differ-
ence of the vectors. The resulting feature vector are
decoded as output which is compared to the ground
truth mask. We apply simple transforms like flipping,
scaling and cropping for augmentations of the dataset
images at the global image level. Unlike other tasks,
for change detection the same transform has to be ap-
plied to the multi-temporal images.
3.1 Evaluation Criteria
We report the precision, recall and F-score values as
the metric to compare the performance in different
change detection tasks under various encoder back-
bones settings which is defined as below:
Precision =
T P
T P + FP
, Recall =
T P
T P + FN
F-Score =
2 X Precision X Recall
Precision + Recall
where, TP is the number of true positives, FP is the
number of false positives, TN is the number of true
negatives and FN is the number of false negatives.
We report the bounding box mean average precision
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
866
Table 8: Binary change detection benchmark on farm pond
dried/wetted task (Task 2).
ResNet - 18 ResNet - 50 ResNet - 101
Model Precision Recall F-Score Precision Recall F-Score Precision Recall F-Score
UNet 0.9973 0.9094 0.9171 0.9893 0.9091 0.9082 0.9907 0.9128 0.9131
UNet++ 0.9813 0.9188 0.9112 0.978 0.9181 0.90573 0.9785 0.9147 0.9065
MANet 0.9826 0.9075 0.897 0.9927 0.9023 0.9052 0.9927 0.8966 0.902
LinkNet 0.9935 0.8867 0.8893 0.994 0.8974 0.9014 0.9918 0.91 0.9119
FPN 0.9923 0.9057 0.9063 0.9936 0.9125 0.9125 0.9949 0.9125 0.9133
PSPNet 0.9939 0.9054 0.907 0.9948 0.9023 0.9052 0.9926 0.9054 0.9064
PAN 0.9907 0.9146 0.9116 0.9905 0.9132 0.9113 0.9812 0.9135 0.901
DeepLabV3 0.9814 0.9066 0.8971 0.9326 0.9246 0.8698 0.9687 0.9096 0.8888
DeepLabV3+ 0.9873 0.9081 0.908 0.9918 0.9126 0.9109 0.9767 0.9246 0.912
UPerNet 0.9936 0.91367 0.9147 0.9957 0.9115 0.911 0.9929 0.9106 0.9096
BiT 0.8458 0.8254 0.8352 - - - - - -
Table 9: Comparison between results of training with all
image pairs and training with only positives images on Task
3.
With all image pairs With only positive image pairs
ResNet - 18 ResNet - 50 ResNet - 18 ResNet - 50
Model Precision Recall F-Score Precision Recall F-Score Precision Recall F-Score Precision Recall F-Score
UNet 0.8627 0.7778 0.7444 0.864 0.7794 0.7442 0.798 0.7933 0.7049 0.8163 0.7967 0.7253
UNet++ 0.8961 0.7795 0.7579 0.8653 0.8081 0.7575 0.7884 0.8008 0.7013 0.8441 0.8482 0.768
MANet 0.7868 0.821 0.741 0.8451 0.7586 0.7246 0.7115 0.7925 0.6388 0.7944 0.7054 0.6145
LinkNet 0.8592 0.7872 0.7268 0.7868 0.8604 0.7502 0.7667 0.8024 0.6871 0.8753 0.7627 0.7236
FPN 0.8775 0.7856 0.7653 0.9192 0.7591 0.7696 0.9067 0.6843 0.6909 0.8403 0.8094 0.7442
PSPNet 0.8749 0.5652 0.5519 0.9571 0.5767 0.5919 0.8518 0.589 0.5564 0.8783 0.6212 0.601
PAN 0.8804 0.7252 0.7048 0.7228 0.6935 0.5435 0.752 0.7977 0.6575 0.6464 0.7536 0.5753
DeepLabV3 0.9329 0.7462 0.7652 0.8775 0.8257 0.7952 0.7656 0.8276 0.6813 0.7744 0.8611 0.7169
DeepLabV3+ 0.8232 0.7953 0.7238 0.9522 0.7118 0.7519 0.8046 0.8214 0.7315 0.8168 0.8045 0.7296
UPerNet 0.8634 0.7801 0.7515 0.9198 0.773 0.7704 0.8414 0.8043 0.7569 0.8473 0.8103 0.7575
BiT 0.8840 0.8647 0.8741 - - - 0.8710 0.8758 0.8734 - - -
(bbox mAP) and segmentation mean average preci-
sion (segm mAP) as the performance metric to com-
pare the various object detection and instance seg-
mentation methods under different backbone settings
in the benchmark. Intersection Over Union (IOU) is
a measure that evaluates the overlap between ground
truth and predicted bounding boxes and helps to de-
termine if a detection is True Positive. Average Pre-
cision (AP) is obtained by interpolating the precision
at each recall level r, taking the maximum precision
whose recall value is greater than or equal to r + 1.
Finally the mean of AP of all classes gives mean Av-
erage Precision (mAP), the metric being used to com-
pare different detectors.
3.2 Experiments
We benchmarked various object detection (OD) and
instance segmentation (IS) methods on the farm pond
OD/IS dataset. We use an existing encoder-decoder
based change detection framework (Kaiyu Li, 2021)
for the binary change detection tasks corresponding
to Farm pond construction/ demolition (Task 1) and
Farm pond dried/ wetted (Task 2). The encoder be-
longs to siamese type and the features from each
branch are fused by concatenation. A few experi-
ments were conducted to analyze the effect of vari-
ous components of the change detection pipeline. We
combined the masks from Task 1 and Task 2 to under-
stand the impact of performance on multiple change
classes. Combining the tasks lead to masks having
more change objects than either of the tasks (Refer
Fig 2). The train-validation split for each Task has
been mentioned in Table 6 . We split the farm ponds
dataset for OD/IS task in a 80/20 train/test proportion.
We used the MMDetection (Chen et al., 2019) frame-
work for implementing the various methods in the
Figure 2: T0 and T1 input image pair in the top row and the
image masks correspondingly for Task 1, Task 2 and Task 3
from left to right in the bottom row.
OD/IS benchmark. We used the pretrained backbones
of ResNet-50, ResNet-101 and MobileNetv2 primar-
ily trained on Imagenet, COCO and cityscapes for the
object detection experiments and Resnet-50, Resnet-
101 and Swin Transformer pretrained backbones for
instance segmentation experiments.
4 RESULTS AND OBSERVATIONS
The results of change detection on Farm pond
construction/demolition (Task-1) taking ResNet-18,
ResNet-50 and ResNet-101 as encoders are given in
Table 7. BiT model despite not having best precision
or recall values performs the best in terms of F-score
with the score of 0.8704. MANet despite perform-
ing highest in precision scores poorly in recall values
leading to poor F-score and FPN scores the highest
recall value. FPN has the best recall in Resnet-101
and consistently better than its smaller counterparts,
ResNet-18 and ResNet-50.
The results of change detection on Farm pond
dried/wetted (Task-2) taking ResNet-18, ResNet-50
and ResNet-101 as encoders are given in Table 8.
We can notice that Unet has the best F-Score of
0.9171 and the highest precision for ResNet-18. For
ResNet-50 encoder, we can observe that FPN has
the best F-Score of 0.9125, though DeepLabV3 has
best recall and UperNet has best precision values.
DeepLabV3+ has the highest recall value with com-
paratively smaller precision than the best performing
model. If we observe carefully, Task-2 has higher
overall scores irrespective of the encoder architecture.
This may be due to the presence of farm ponds at
the same locations in both temporal images where the
change is registered and also due to the fact that the
number of positives are much smaller than the num-
ber of negative images for the task.
FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection
867
The results of combining change detection tasks
(Task-1 and Task-2) as Task-3 taking ResNet-18,
ResNet-50 and ResNet-101 as encoders are given in
Table 9. For the ResNet-18 encoder, we can notice
that BiT has the best F-score of 0.8741 and the high-
est recall. For the ResNet-50 encoder, we can notice
that PSPNet has the best precision but with poor recall
values affecting the F-score.
We also conducted an ablation study on Task 3
to check if the absence of negative examples in the
training set could affect the change detection task per-
formed. We can see that BiT achieves the best F-
Score with ResNet-18 as the encoder, but we can also
see that there is a decrease in precision when com-
pared to the best model in the complete dataset, de-
spite better recall values. The results for F-scores
also show a similar trend, suggesting that even with
a high ratio, the performance is comparable even in
the absence of negative images in the train set. Sim-
ilarly, we observe that UNet++ with the best F-Score
trails the model trained on the entire dataset by only a
few points; this further supports the idea that positive
samples aid in learning while negative images may
increase the model’s robustness.
In this section, we analyze the performance of
object detection and instance segmentation task on
the Farm pond OD/IS dataset. The results of com-
parison of object detection on various existing mod-
els are given in Table 11. For the object detection
task, Probabilistic Anchor Assignment (PAA) model
achieves the best performance of 0.575 mAP with
IoU(0.5:0.95) with the ResNet-50 backbone. We
can notice that empirical attention with the atten-
tion component (0010) achieves the best mAP un-
der IoU(0.75). Guided Anchoring with FasterRCNN
model performs the best in bbox mAP(0.5) using
ResNet-50 backbone. For the instance segmenta-
tion task, many models have the capability to gener-
ate segmentation masks along with bounding boxes
results and the COCO metric for both are given
in the Table 10. We can notice that CascadeR-
CNN with ResNet-50 performs the best with 0.577
bbox mAP(0.5:0.95) and 0.694 bbox mAP(0.75) in-
dicating it to be a robust detector. Swin Trans-
former has the best bbox mAP(0.5) with its own
independent backbone. For instance segmentation,
Deformable ConvNetsv2 with MRCNN model using
ResNet-50 as backbone performs the best on segm
mAP metric with IoU(0.5:0.95), while MaskRCNN
with ResNet-50 performs the best in segm mAP(0.75)
and Swin Transformer performs the best under the
segm mAP(0.5) metric.
Table 10: Instance segmentation results on OD/IS dataset.
Method Backbone
bbox mAP
(0.5:0.95)
bbox mAP
(0.5)
bbox mAP
(0.75)
segm mAP
(0.50:0.95)
segm mAP
(0.5)
segm mAP
(0.75)
Mask RCNN
ResNet - 50 0.558 0.834 0.68 0.59 0.853 0.696
Swin Transformer Swin Transformer
0.563 0.851 0.667 0.59 0.874 0.671
Cascade RCNN
ResNet - 50 0.577 0.81 0.688 0.575 0.829 0.693
ResNet - 101 0.563 0.813 0.694 0.572 0.827 0.685
Hybrid Task Cascade
ResNet - 50 0.533 0.785 0.663 0.541 0.805 0.643
Deformable
ConvNets v2 (MRCNN)
ResNet - 50 0.571 0.833 0.673 0.591 0.859 0.691
5 CONCLUSION
The availability of high-resolution images due to ad-
vances in remote sensing capabilities has led to the
use of deep learning models for change detection. In
this paper, we introduced FPCD, a publicly available
dataset for change detection tasks for minor irriga-
tion structures - farm ponds. We also introduced a
small-scale public dataset for object detection and in-
stance segmentation of four farm pond categories. Fu-
ture works can address the issues associated with lim-
ited data availability and class imbalance issues in the
dataset.
Table 11: Object detection results on OD/IS dataset.
Method Model Name Backbone
bbox mAP
(0.5:0.95)
bbox mAP
(0.5)
bbox mAP
(0.75)
Grid RCNN Grid RCNN
ResNet - 50 0.56 0.553 0.551
Deformable DETR 2-stg Def. DETR
ResNet - 50 0.526 0.774 0.624
Faster RCNN Faster RCNN
ResNet - 50 0.555 0.832 0.646
Empirical Attention
AC-0010
ResNet - 50 0.553 0.838 0.687
AC-1111
ResNet - 50 0.557 0.829 0.663
GFL
GFL ResNet - 50 0.553 0.815 0.653
GH SSD
GHM ResNet - 50 0.528 0.787 0.629
Guided Anchoring
Faster RCNN
ResNet - 50 0.566 0.843 0.672
RetinaNet ResNet - 50 0.547 0.832 0.637
PAA
PAA ResNet - 50 0.575 0.832 0.686
YOLOv3
YOLOv3 MobileNetv2 0.437 0.729 0.485
REFERENCES
Bourdis, N., Marraud, D., and Sahbi, H. (2011). Con-
strained optical flow for aerial image change detec-
tion. In 2011 IEEE IGARSS.
Chaurasia, A. and Culurciello, E. (2017). Linknet: Exploit-
ing encoder representations for efficient semantic seg-
mentation. 2017 IEEE VCIP.
Chen, H. and Shi, Z. (2020). A spatial-temporal attention-
based method and a new dataset for remote sensing
image change detection. Remote Sensing, 12(10).
Chen, K., Wang, J., Pang, and et. al. (2019). MMDetec-
tion: Open MMLab Detection Toolbox and Bench-
mark. arXiv:1906.07155.
Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H.
(2017). Rethinking atrous convolution for semantic
image segmentation. ArXiv, abs/1706.05587.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and
Adam, H. (2018). Encoder-decoder with atrous sep-
arable convolution for semantic image segmentation.
In Proceedings of ECCV.
Data{Meet} (2019). Indian village boundaries project.
URL: http://projects.datameet.org/indian village
boundaries/. Online; Accessed 13-June-2021.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
868
Daudt, R. C., Le Saux, B., Boulch, A., and Gousseau, Y.
(2018). Urban change detection for multispectral earth
observation using convolutional neural networks. In
2018 IEEE IGARSS.
Fan, T., Wang, G., Li, Y., and Wang, H. (2020). Ma-net:
A multi-scale attention network for liver and tumor
segmentation. IEEE Access, 8:179656–179665.
Hao Chen, Z. Q. and Shi, Z. (2021). Remote sensing image
change detection with transformers. IEEE TGRS.
Ji, S., Wei, S., and Lu, M. (2019). Fully convolutional
networks for multisource building extraction from an
open aerial and satellite imagery data set. IEEE TGRS.
Kaiyu Li, Fulin Sun, X. L. (2021). Change de-
tection pytorch. URL: https://github.com/likyoo/
change detection.pytorch.
Kumar, R., Dabral, R., and Sivakumar, G. (2021).
Learning unsupervised cross-domain image-to-image
translation using a shared discriminator. CoRR,
abs/2102.04699.
Li, B., Liu, Y., and Wang, X. (2019). Gradient harmonized
single-stage detector. In AAAI Conference on Artifi-
cial Intelligence.
Li, H., Xiong, P., An, J., and Wang, L. (2018). Pyramid
attention network for semantic segmentation. arXiv
preprint arXiv:1805.10180.
Li, X., Yang, J., and et. al (2020). Generalized Focal Loss:
Learning Qualified and Distributed Bounding Boxes
for Dense Object Detection. arXiv:2006.04388.
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,
R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,
and Doll
´
ar, P. (2014). Microsoft coco: Common ob-
jects in context.
Liu, M., Chai, Z., Deng, H., and Liu, R. (2022). A
cnn-transformer network with multi-scale context ag-
gregation for fine-grained cropland change detection.
IEEE Journal of Selected Topics in Applied Earth Ob-
servations and Remote Sensing.
Lu, X., Li, B., Yue, Y., Li, Q., and Yan, J. (2019). Grid
r-cnn. In Proceedings of the IEEE Conference on
CVPR.
Maharashtra Remote Sensing Application Centre
(MRSAC) (2015). Jalyukt shivar , water con-
servation department, mantralaya, mumbai.
Source:http://mrsac.maharashtra.gov.in/jalyukt/.
Online; Accessed: 2019-09-15.
Prasad, P., Damani, O. P., and Sohoni, M. (2022). How can
resource-level thresholds guide sustainable intensifi-
cation of agriculture at farm level? a system dynamics
study of farm-pond based intensification. Agricultural
Water Management.
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental
improvement. CoRR, abs/1804.02767.
Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster
r-cnn: Towards real-time object detection with region
proposal networks. In IEEE TPAMI, number 6.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:
Convolutional networks for biomedical image seg-
mentation. In MICCAI.
Russell, B. C., Torralba, A., Murphy, K. P., and Freeman,
W. T. (2008). Labelme: a database and web-based
tool for image annotation. IJCV.
Shao, R., Du, C., Chen, H., and Li, J. (2021). Sunet: Change
detection for heterogeneous remote sensing images
from satellite and uav using a dual-channel fully con-
volution network. Remote Sensing.
Shen, L., Lu, Y., Chen, H., Wei, H., Xie, D., Yue, J., Chen,
R., Lv, S., and Jiang, B. (2021). S2looking: A satel-
lite side-looking dataset for building change detection.
Remote Sensing, 13(24).
Shi, Q., Liu, M., Li, S., Liu, X., Wang, F., and Zhang, L.
(2022). A deeply supervised attention metric-based
network and an open aerial image dataset for remote
sensing change detection. IEEE TGRS.
Tian, S., Zheng, Z., Ma, A., and Zhong, Y. (2020).
Hi-ucd: A large-scale dataset for urban semantic
change detection in remote sensing imagery. CoRR,
abs/2011.03247.
Tundia., C., Kumar., R., Damani., O., and Sivakumar., G.
(2022). The mis check-dam dataset for object detec-
tion and instance segmentation tasks. In VISAPP.
Tundia, C., Tank, P., and Damani, O. (2020). Aiding Irri-
gation Census in Developing Countries by Detecting
Minor Irrigation Structures from Satellite Imagery. In
GISTAM.
Van Etten, A., Hogan, D., Manso, J. M., Shermeyer, J.,
Weir, N., and Lewis, R. (2021). The multi-temporal
urban development spacenet dataset. In Proceedings
of the IEEE/CVF CVPR.
Wu, C., Zhang, L., and Du, B. (2017). Kernel slow feature
analysis for scene change detection. IEEE TGRS.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. (2018).
Unified perceptual parsing for scene understanding. In
Proceedings of the ECCV.
Zhan, Y., Fu, K., Yan, M., Sun, X., Wang, H., and Qiu,
X. (2017). Change detection based on deep siamese
convolutional network for optical aerial images. IEEE
Geoscience and Remote Sensing Letters.
Zhang, C., Yue, P., Tapete, D., Jiang, L., Shangguan, B.,
Huang, L., and Liu, G. (2020). A deeply supervised
image fusion network for change detection in high res-
olution bi-temporal remote sensing images. ISPRS
Journal of Photogrammetry and Remote Sensing.
Zhang, M., Xu, G., Chen, K., Yan, M., and Sun, X. (2019).
Triplet-based semantic relation learning for aerial re-
mote sensing image change detection. IEEE Geo-
science and Remote Sensing Letters.
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyra-
mid scene parsing network. In Proceedings of the
IEEE conference on CVPR.
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., and
Liang, J. (2018). Unet++: A nested u-net architecture
for medical image segmentation. In Deep learning in
medical image analysis and multimodal learning for
clinical decision support.
FPCD: An Open Aerial VHR Dataset for Farm Pond Change Detection
869