Text Classification of BBC News

Essay, Pages 9 (2174 words)

Views

1023

Abstract

A short text is substantially different from traditional long text documents which are due to its shortness and conciseness which is somehow obstruct the applications of conventional machine learning and data mining algorithms in short text classification. According to the traditional artificial intelligence methods, we can divide a short text classification into three steps and they are pre-processing, feature selection and classifier comparison. Specifically, in feature selection, we compared the performance and robustness of the method of TF-IDF weighting and we deliberately chose Naive Bayes as classifier technique.

Don't use plagiarized sources. Get your custom essay on

“ Text Classification of BBC News ”

Get custom paper

NEW! smart matching with writer

After that, we compared and analysed the classifiers horizontally with each other and vertically with feature selections. With the expeditious growth of the number of short text and how to effectively realize the automatic classification of a short text in the information domain is needed to be solved. According to the characteristics of short text, proposed Naive Bayes, which is classification algorithms based on the improvement of currently integrated classifiers.

Traditional classifier Naive Bayes is used as the basis classifiers to train the classification models.

Compared with several individual classifiers, our method Naive Bayes have excellent results in a variety of classification evaluation indexes. Based on that BBC news dataset is used to classify using a Naive Bayes algorithm. Most of the peoples used to read BBC news but everyone has a different interest as like technology, sports, business, politics, and entertainment.

Acknowledgement

It is matter of great pleasure for me to submit this seminar report on "SHORT TEXT CLASSIFICATION OF BBC NEWS", as a part of curriculum for Master of technology (Computer Engineering) of Savitribai Phule University of Pune.

I am thankful to my guide Prof. Guide name, Assistant Professor/Associate professor/ Professor in Computer Engineering Department for his/her constant encouragement and able guidance. I am also thankful to Dr. B. S. Karkare, Principal of VIIT Pune, Dr. S.R. Sakhare, Head of Computer Department for their valuable support. I take this opportunity to express my deep sense of gratitude towards those, who have helped us in various ways, for preparing my seminar. At the last but not least, I am thankful to my parent, who had encouraged and inspired me with their blessings.

Introduction

Background

Online social media and news have emerged recently as a medium of information sharing and communication. Blogging, status updates, social networking, watching the news and video sharing are some of the ways in which people try to achieve this. Popular online social media like Facebook, Orkut or Twitter, and news sites like BBC news, CNN, FOX News allows users to post or watch a short message to their homepage. These are often introduced to as micro-blogging sites and the message which is called a status update. News updates from BBC channel are more commonly called as news on a different category of data.News is often related to some event information rapidly.based on the topic of interest like a business, technology, entertainment, personal thoughts, and opinions. News can contain text, emotion, link or their combination. News has recently gained a lot of importance due to their ability to disseminate.

Short Text Classification

Motivation and Social ImpactFor easier understanding of users classify the dataset of BBC news. Classifying manually the data into the different category is easy only when the dataset is very short but many times it is not easy to classify or categorize the data which has a large number of data. It is very clumsy or tricky to classify a large number of data set for that use algorithms for classification of short text. This is proposed to classify the BBC news which is having multiclass and multi labels.

Objectives and Outcomes

The Short text classification task consists of learning models for a given set of classes and applying these models to new imaginary documents for a class assignment. It is mainly a supervised classification task, where a training set subsists of documents with already assigned classes is provided, and a testing set is used for the evaluation of the models. Short text classification is shown in Figure 1, including the pre-processing steps which consist of document representation and space reduction/feature extraction; and the learning/evaluation procedure like Naive Bayes. Great relevance has been deservedly given to learning procedures in short text classification. However, there must be a preprocessing stage before the learning process. Pre-processing alters the input space which is used to represent the documents that are conclusively included in the training and testing sets, used by machine learning algorithms to learn classifiers, which they are evaluated after.

Mathematical Model of Problem Solved

Confusion Matrix: Figure 1.4.1: Confusion Matrix‚ True Positive (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.‚ True Negative (TN): We predicted no, and they don't have the disease.‚ False positive (FP): We predicted yes, but they don't actually have the disease. And it is also known as a "Type I error."‚ False negative (FN): We predicted no, but they actually do have the disease. And it is also known as a "Type II error."Accuracy: Figure Accuracy

The accuracy is a measure of the degree of closeness of a measured or calculated value to its actual value.

Precision and Recall:Figure 1.4.3: Precision and RecallPrecision is the ratio of correctly predicted positive observations to the total predicted positive observations. And Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes.F1-Score: F1-ScoreF1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

Literature Survey

Existing Techniques

An Improved Information Retrieval Approach to Short Text Classification

Twitter act as the most important medium of information sharing and communication. As tweets of Twitter do not provide sufficient word occurrences that are of 140 characters limits and classification methods that use traditional approaches like Bag-Of-Words have some of the limitations. The proposed system used an instinctive approach to determine the class labels with a set of features. The system can able to classify all incoming tweets mainly into three generic categories as News, Movies, and Sports. Since all these categories are diverse and cover most of the topics that people usually tweet about. Experimental results using the proposed technique outperform the existing models in terms of accuracy, precision, recall, support.

Short Text Classification Improved by Learning Multi-Granularity Topics. Understanding the fastly growing short text is very essential. A short text is different from traditional documents in its sparsity and shortness, which hinders the application of conventional text mining algorithms and machine learning. The major two approaches have been exploited to enrich the representation of short text. One approach is to fetch contextual information of a short text to directly add more text and the other one is to derive latent topics from an existing large corpus, which are used as features to enrich the representation of short text. The latter approach is elegant and efficient in most cases. To set up effective feature spaces, the topics of certain granularity are usually not sufficient. In this, we move forward along this direction by proposing a method to leverage topics at multiple granularities, which can model the short text more precisely.

Implementation

Flow of Work:

STEP 1: The features extracted for the classes that are stored in files.

STEP 2: The BBC news which has to be correctly classified and the feature sets are fed into the system.

STEP 3: The BBC news is then disambiguated. Disambiguation involves tokenizing the news, making the tokens Case-less, removing stop words, lemmatizing the tokens using Word Net, stemming the tokens and finally, the stemmed tokens are Part of Speech tagged.

STEP 4: A loop executes on each word in the BBC news. A POS tagged word is selected and all senses of that word are learned.

STEP 5: If the learned sense is not a noun or verb then it is ignored and skip to the next sense.

STEP 6: Loop on all other words in the same news and find their senses.

STEP 7: Then the definition of all the senses are derived from Word Net.

STEP 8: The senses of a precise word are then compared with the senses of the remaining words. An overall score is evaluated and the maximum score is then considered for further.

STEP 9: The senses which give these maximum scores are then returned.

STEP 10: The steps from 4 to 9 are also executed on the feature sets.

STEP 11: The senses of the feature sets and the words of the news are then evaluated.

STEP 12: The feature set which gives the maximum similarity with the news of BBC is considered the correct feature set. The class of the feature set is then extracted and the news is classified to that class.

Data collection and Data sets

There is one class of name as BBC and that class contains some files as Entertainment, Technology, Business, Politics, and Sports. Each file contains related category wise news files which is in the form of text of news in BBC. Software Requirements 1) Language Used: PythonPython is a high-level, interpreted, general-purpose programming language. And it was created by Guido van Rossum, and in 1991 python was released.It is used for: Web development (server-side), Software development, Mathematics, System scripting, etc. What can Python do? To create web applications python is used on a server. To create workflows, it can be used alongside the software. Python can connect to database systems so it can also read and modify files. Python can also be used to handle big data and perform complex mathematics.

Python can be used for production-ready software development or rapid prototyping. Why Python? Because python works on different platforms or supports different platforms like Windows, Mac, Linux, Raspberry Pi, etc. It has a simple syntax which is similar to the English language. As well as it has a syntax that allows developers to write programs with fewer lines than some other programming languages. It runs on an interpreter system which means that code can be executed as soon as it is written. This means that prototyping can be very quick. Python can be treated in a procedural way, a functional way or an object-orientated way.

Platform: Jupyter Notebook

It is an open-source web application which is allowed to create and share documents which contain equations, visualizations, live code, and narrative text. It has some uses which include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, etc.

The notebook extends the console-based approach only to interactive computing in a qualitatively new direction and for providing a web-based application suitable for capturing the whole computation process including developing, documenting, and executing code as well as communicating the results. The Jupyter notebook has two components: A web application: It is one of the component of Jupyter Notebook where a browser-based tool for interactive authoring of documents which combine all explanatory text, mathematics, computations, and their rich media output. Notebook documents: And the second component of Jupyter Notebook is the representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, images, mathematics, and rich media representations of objects. Results obtained‚ Label's and their counts: Testing Data: Tfidf Vectorizer: Result as accuracy using Nave Bayes algorithm and their precision, recall, f1-score, support.

Results and Discussion

Discussion on Result Obtained

News are harder to classify than a larger corpus of text. Here we classify news efficiently based on some attributes.‚ because of this, it is easy to find news related to some topic.‚ this is primarily because there are few word occurrences and hence it is difficult to capture the semantics of such messages.‚ hence, traditional approaches when applied to classify news do not perform as well as expected. Here, the method used to classify news is a supervised method as it does require a source of data or labelling the news.‚ in these, by using the Naive Bayes algorithm, it gets an accuracy of near about 96%.4.2 Comparison of Results (with other researchers)Existing short text classification is on twitter tweets. There has a class with different attributes or categories like movie, news, and sports. They use the Naive Bayes algorithm for short text classification and get an accuracy of 60%.

While comparing with our short text classification of BBC news get almost near to 96% accuracy using the same Naive Bayes algorithm. Which is having some attributes like business, sport, politics, entertainment, technology that contains some text files.11Chapter 5Conclusion and Future WorkFor short classification use, a class BBC news which contains some attributes and each attribute contains some text files related to their attribute. Classify using Naive Bayes algorithm the BBC news with different attributes and calculate the accuracy, support, precision, recall, f1-score, etc. While comparing with existing short text classification we get the highest accuracy almost near to 96%.The future scope is classifying this BBC news as short text classification by using some other classification algorithms. And calculate precision, recall, support, f1-score and try to get highest accuracy.