Center : To generate pixels in the multi-scale case, we
can also condition the sub-sampled image pixels (in
light blue).
Right : Connectivity diagram inside a masked
convolution. In the first layer, each RGB channel is
connected to the previous channel and to the context,
but is not connected to the channel itself. In the next
layer, the channel is also connected to the subtitle.
3 SYSTEM AND
ARCHITECTURAL MODEL
Author The architectural model of understanding
imagery in the context of language can be divided into
3 processes:
1. Detect Objects and text areas.
2. The overall system architecture framework which
is the relationship between object features, subject
features, caption features and context features.
3. Object and network proposal caption region As
represented in the following image simulation.
3.1 Objects and Text Areas
As shown in Figure 4 below, the initial process begins
with video extraction by determining object detection
and the detected text area on a scene graph. Then
determine the Caption Region, Relationship Region,
and Object Region. Furthermore, feature extraction
consisting of Caption Feature, Relationship Feature
and Object Feature and combining/processing
Relationship Feature, Caption Feature and
Relationshipo Context Feature with CCN (Caption
Context Network) method to produce Caption
Feature and Caption Context Feature in a Caption
Generation. The final part is to combine and process
the Subject Feature, Relationship Feature and Object
Feature with the RCN (Relationship Context
Network) method to produce Relationship Detection
and then extract the feature object to produce Object
Detection.(D. Shin and I. Kim,2018).
Figure 4: Object Detection Network and Text Region.
3.2 Dataset Exploration
Each The dataset used in this system uses the
Tensorflow platform, with the Inception3
Architecture and the encoders and decodercnn
methods. The Tensorflow dataset is a benchmark
dataset for object detection of image segmentation.
Detection of objects is done by regression (object
restriction) and classification.(S. Ren, K. He, R.
Girshick, 2017).
1. Setting up a Library / Data library (In research
using Pytorch)
2. Downloading Data
3. Loading Data
4. Distribution of Objects in the Tensorflow Dataset
5. Utilities function (object constraint area)
3.3 Preprocessing
At this stage there are four processes, namely:
1. Image Representation
Sentence description can create a reference to the
object and its attributes, so according to Girshick's
method to detect objects in each image with RCNN.
(L. R. Jácome-Galarza, 2020). CNNs were pre-
trained on ImageNet and tuned to 200 classes by
using the top detected locations in addition to the
entire image and calculating the representation based
on the Ib pixels within each bounding box as follows.
v = Wm[CNN✓c(Ib)] + bm, (1)
Where CNN(Ib) converts the pixels inside the
bounding box Ib into 4096 dimension activation of
the fully connected layer just before the classifier. (S.
Bai and S. An, 2018).
2. Sentence Representation
To establish the inter-modal relationship, then to
represent the words in the sentence in the same h-
dimensional embedding space as the image region.
The simplest approach might be to project each
shared word directly into this embedding. However,
this approach does not consider word order and
context information in sentences. To solve this
problem, Bidirectional Recurrent Neural Network
(BRNN) is used to calculate word representation.
BRNN takes a sequence of N words (encoded in a 1-
of-k representation) and transforms each into an h-
dimensional vector. However, the representation of
each word is enriched by the varying sized context
around that word. (S. Aditya, Y. Yang, C. Baral,
2017). Using index t = 1. . . N to indicate the position