Fake news has a significant impact on our social life, mainly in the political world. Fake news detection in research is still in it’s early stages so it’s challenging to researchers due to lack of availability of resources such as datasets and literature survey published about fake news detections. In this paper, we will discuss different machine learning techniques used for the detection of fake news.
Keywords-component; Fakenews, Na?ve’s Beye’s, Random Forest, Logistic Regression, SVM
Researcher Myles Anderson people are well-connected to reviews says people opinions are main information to gain knowledge about the product that is planning to buy.
According to Myles Anderson’s opinion, 88% of the customer’s trust, the online reviews compare to personal recommendation . Fake news detection is having top significant in our social life . By using machine learning we can overcome the problem of fake news . Writer Lemann N gives more information about how and when the fake news occurred for first and according to Lemann N fake news- people fall for nothing new – misinformation’s .
In general, Fake news could be categorized into three groups. The first group is fake news, which is news that is completely fake and is made up by the writers of the articles . The second group is fake satire news, which is fake news whose main purpose is to provide humor to the readers. The third group is poorly written news articles, which have some degree of real news, but they are not entirely accurate.
In short, it is news that uses, for example, quotes from political figures to report a fully fake story. One of the identified complex tasks now way days is detection of fake news, compare to the fake reviews about the products which are spreading easily in the social media.
The intension and impact of fake news are not difficult to understand. We cannot measure the promotion or publicity created by the spreading the fake reviews. For example, product owner, customers will be affected by these fake reviews and also it is difficult to spot whole unit , which is affected by these fake reviews.
Shlok Gilda in 2016, disinformation of American politics was subject of attention, particularly Trump election who is president of America. Fake news is the common parlance of the issue . He explores the natural language processing techniques on detection of fake news. The person’s emotion and belief are replaced with facts and evidence when we enter the post-trust era. The people acts of news are shifting towards the emotion and belief -based market are the nature of news .
The much easier sources of information and comfortable to get the information are through social media and the internet. Even the people are very much trustees of the social media and the internet. In our day-to-day life mass media is playing a powerful role and more impact on society. This will be the advantage for someone, they want to take. To reach some goals the mass-media may write the information in different way. The main intention of fake news websites is mainly affect public opinions (political). Fake news is an universal topic as well as universal challenge. Mykhailo Granik, Volodymyr Mesyura, using artificial intelligence algorithms (Na?ve Bayes Classifier) they described simple fake news detection.
N-gram model: N-gram model is found in natural language processing. This model mostly used for text categorization, which has word-based and character-based N-grams. In online fake news detection, we are using word-based N-gram  because generates the features to the documents and also represent the context to the document. From the collection of words in a story, extract the unigram and bigram. The extracted words are stored in Term Frequency and Inverse Document Frequency (TF, IDF).
Logistic Regression: The logistic regression as same as the linear regression the only difference where the logistic regression is used for classification and the linear regression is used for prediction. One of the most famous machine learning algorithms after linear regression is logistic regression.
Sigmoid function (linear function), to predict the value logistic regression algorithm also use independent linear equation. The predicted value lies between positive and negative infinity.0-no 1-yes is the output class variable of the algorithm . A Cost function is used to find the class values. To calculate the cost for misclassifying Logarithmic function is used, as same cost function is not used
Na?ve Beye’s: To construct the classifier Na?ve Beye’s is the simple technique. Na?ve Beye’s classifiers are a family of simple probabilistic classifiers in machine learning. Based on applying beye’s theorem. By using joint probabilistic words and classes we will find the probabilities of classes assigned to text is the basic idea of the Na?ve Bayes classifier. The following formula to find the probability is:
Random Forest: Classification and Regression problems use a Random Forest algorithm. It is one of the best advantages of this algorithm. As the name suggests it will create the forest with numbers of trees and the algorithms belongs to a supervised learning algorithm. You might be thinking that the forest look like a robust when there are more trees . If the number of trees are more than the accuracy is also high. To avoid the regression and classification problem we are using a decision tree algorithm. Training model is created by using decision trees, these training modes used to predict the classes (value) of target variables; by training data, we are going to learn the decision rules . Reduce the risk of overfit by using multiple trees. When the large portion of data is missing, accuracy is maintained by Random Forest.
TF and IDF: Term Frequency (TF) and Inverse Document Frequency (IDF) are weighting scheme, commonly used in information retrieval tasks. The goal of this model is ignoring the extra ordering of words in the document and models the each document into a vector space; retain the information about the occurrences of each word . To find TF-IDF we require three steps
Tokenization: The first step is to tokenize the text. For this, we need nltk library, which contains the algorithms written in python, which natural language processing. Again the tokenization of document, which as two steps. The first sentence is splitted into text and then the individual words are splitted into words. Stop words are ignored in information extraction. Model the vector space: after the step tokenization, document frequency quantity computation is the next step, in how many documents that term appeared. Compute TF-IDF: in this now we will compute the inverse document frequency.
In vector space not all the terms are not equally important, by its occurrence probability, we can weight each term.
Term frequency: TF(d ,t)
The number of times t appear in the description of item d.
Inverse Document Frequency: IDF(t)
The term that occurs in many descriptions, will be scale down.
Support Vector Machine: support vector machines are one of the talked about machine learning algorithms and most popular algorithm. Maximum points are splitted for two data classes by hyperplane. In this, four evaluation results to compute True Positive, True Negative, False Positive, and False Negative are precision, recall, measure and accuracy. Let us see how to calculate these four evaluation results through the following formula as an equation.
precision= TP/((TP+FP)) (2)
F-masures= (2?Precision?Recall )/((Precision +Recall)) (4)
The True Positive is identified correctly and True Negative is rejected correctly.
False Positive is identified incorrectly and False Negative is rejected incorrectly.
Fake news detection is the universal problem and is very difficult to identify the contents of facts, which are rumors. Here we are using different supervised machine learning algorithms to detect the fake news. our future work includes, propose optimal algorithm to detect the fake news using the combination of these algorithms.