encoder for fake news detection. They have used
different deep learning techniques to achieve the
better result (Khattar D, 2019).Panet. al have given
the survey on fake new detection using knowledge
graphs (Ravi K,2015).
3 METHODOLOGY
In this paper fake news detection using machine
learning approach such as data collection, data pre-
processing and so on. Data preprocessing contains
different techniques like cleaning, tokenization etc.
3.1 Machine Learning based Approach
This approach is used to predict fake news detection
that is based on trained data sets as well as test
datasets. It uses different Machine learning
algorithms to train the dataset and these trained
models are used for specific purposes. There are two
learning approaches used for training model named
as supervised learning method and unsupervised
learning method (Jeff Z., 2018).
3.1.1 Supervised Learning Approach
This approach is used when there is finite number of
classes defined named as positive or negative. It uses
labelled dataset for training purpose. Decision tree
algorithm, Artificial neural network, Random forest,
Regression, Logistic Regression, Support Vector
Machine, Nearest Neighbour, Naïve Bayes, are the
several supervised learning algorithms.
3.1.2 Unsupervised Learning Method
This method does not require labelled datasets and it
is work on document- level SA. The aim is to
identify semantic orientation in given phrase.
Partitioning clustering is the unsupervised learning
algorithm.
3.2 Data Collection
This is very initial and important phase in order to
perform fake news detection. Now a day, there are
various freely available data sources that are public
to everyone such as twitter dataset for analysis.
Apart from this, data can be acquire from different
world wide web, social media sites like twitter,
facebook, instagram and online blogging sites and
many more. These websites contains large amount
of data that is used to perform analysis. This dataset
contains two parts of data i.e fake news and real
news. This dataset includes 21418 numbers of data
on true news and 23503 numbers of data on fake
news from the kaggle website (Khattar D, 2019).
This datasets used for the detection of fake news by
using different machine learning approach.
3.3 Data Pre-processing
Data preprocessing method includes different
essential phases such as data cleaning, data
formatting and many more. The data sources contain
raw information that is preprocessed by applying
some data formatting and cleaning process (Shu K,
2017). There are some preprocessing techniques
available named as tokenization, stemming, feature
extraction, POS (part of speech) tagging, stop word
removal and so on. In this research paper, we used
preprocessing techniques for cleaning dataset. The
detail information is following:
3.3.1 Tokenization
It is the procedure of breaking the sentences into
phrases, symbols, words and other meaningful
tokens. This process is done by applying different
open source tools such as Natural Language
Processing Tokenizers.
3.3.2 Stemming
The sentence or document contains different form of
words like organize, organizing and organizes;
stemming is the procedure of reducing this kind of
word which is in derivationally related form.
3.3.3 Stop Word Removal
The sentence contains stop words. Stop word can be
defined as ‘a’ and ‘the’ in article, ‘he’, ’they’, ’it’ in
pronouns are stop words that leads the complexity in
the process of sentiment analysis. The process of
removing this kind of stop words are stop word
removal process.
3.3.4 Feature Extraction
This procedure is related to extract the most relevant
feature from text to perform sentiment analysis task.
Feature extraction comes under the classification
task. We select different feature from text and train
the different models by using classification methods.
Numerical feature and binary feature are the feature
vector categories that show the frequency
occurrences. Several texts feature is given below: