Prediction in the stock market has been a big research question for many years. If any system which can reliably predict the trends of the stock market be developed, it would make the users of the system wealthy. The purpose of prediction is to reduce uncertainty associated to investment decision-making. Stock market follows a random trend, which implies that the best prediction you can have about tomorrow’s value is today’s value. Unquestionably, forecasting stock indices is very difficult because of market inefficiency.
There are two types of analysis possible for prediction, technical and fundamental. Technical analysis is done using historical data of stock prices by applying machine learning and fundamental analysis is done using social media data by applying sentiment analysis. Social media data has high impact today than ever. It can aid in predicting the trend of the stock market. The method involves collecting news and social media data and extracting sentiments expressed by individuals.
What is Big Data and Data analysis?
Big Data can be defined as large sized or highly complex data sets which are generally difficult to analyse and process using traditional methods of data processing and analysis. These difficulties may include analysis of data, capturing of data, metadata, data storage, search, data transfer and sharing, visualization and data privacy.
Data analysis can be defined as the methodology utilized for analysis and process of random data to make sense out of it. There is a lot of quantitative and qualitative data which businesses accumulate. This data can be highly valuable if it is analysed and interpreted in the right manner. It is helpful towards developing useful insights and results.
Data Analysis and Big Data in the Stock Market
Data Analysis and Big Data are on the cusp of completely transforming how the stock markets will function and how investors will make their buying, selling and investment decisions. The technology of data analysis and Big Data is growing rapidly across industries, and the financial sector is not far behind in the development of data analysis and Big Data technologies.
Need for the Study
The need for forecasting the stock price is the main motive of the investor who is really interested in maximizing their financial wealth by making investment in financial assets like equity shares, bonds and debentures which are normally traded in the stock exchanges. Trading on the stock market requires accurate and timely inputs. The magnitude of data that is generated within the stock market on a daily basis is impossible to be managed, analysed and made sense of by human beings due to the sheer volume of data generated and the speed at which this financial data is being generated from various sources.
Big data for investment is no longer just a big firms’ game. Though investors still need to know the ins and outs of the stock market, big data analytics is the winning ticket to compete against the giants in the stock market. Big data analytics will help the investors, ranging from small retail investors, foreign institutional investors, mutual funds, to take well informed investment decision based on scientific thinking and rational approach.
Review of Literature
An extensive review of literature, in the area of forecasting of stock prices using big data, has been done to find the research gap and to get an idea of big data analytics in stock market. Etzioni (1976) forecasted the movements of stock prices and explained the difficulties in making prediction of financial market. One of the original works describing the application of evolutionary algorithms to stock trading can be found in Oussaide`ne et al. (1996, 1997). The usefulness of a model of this type depends upon its ability to forecast future changes in stock market valuations. Kavitha.S and Raja Vadhana.P, Nivi.A.N in their study on Big Data Analytics in Financial Market found that finance is the new sector where the big data technologies like Hadoop, NoSQL are creating its mark in predictions from financial data by the analysts. For this analysis, regular information and historical information of specific stock exchange are needed for creating predictions.
Hayes et al. (2000), Ranganathan and Samarah (2001) estimated how stock market price is fluctuated with announcements of enterprise resource planning implementation. Furthermore, Hendricks examined stock market reactions to enterprise resource planning systems implementations, supply chain management implementations, and customer relationship management implementations, respectively.
Prit Modi, Shaival Shah Himani Shah(2019) concluded that big data analytics can be used in many domains for accurate prediction and analysis of the large amount of data. As big data analytics is a one of the emerging information systems today, by implementing big data analytics, organizations expect to achieve excellence in their business. In this paper, we describe the relationship between big data analysis and stock market to understand its volatile nature. The study will also elaborate on a framework(Hadoop) on big data analysis based on proper assessment of fundamentals, operational efficiency, investor’s sentiments, market performance and probable future business prospects of the stock market behavior.
This is a qualitative descriptive paper based on articles, research materials and reports on big data analysis and stock market. The analysis part is divided into three headings: nature of big data, Application of big data in stock market and framework used in big data analysis.
Data and analysis have always been an integral part of trading and investing. A conventional capitalist goes through volumes of company annual reports, news, stock price charts and other forms of date before making a decision. Data is classified into structured (i.e. numerical or tabular) vs unstructured (i.e. text, images, etc). Unstructured data does not fall into a pre-determined model. This is the data gathered from social media sources, which help institutions gather information on customer needs. Structured data is already managed information by the organization in databases and spreadsheets. As a result, the various forms of data must be actively managed in order to form better business decisions Structured data is usually what a beginner trader would start playing around with.
Technology is growing at an exponential rate and today we are processing immense data if numbers are to be believed. A recent report reveals that the total data existing in the world will grow at a CAGR of 61% to 175 zettabytes by 2025 from 33 zettabytes in 2018. This is where the algorithms kick in, armed with machine learning and data analytics. Machines can inexhaustibly monitor data and news feeds, learn from them and act upon them. For an algorithm trader, the ability to process vast amounts of data at speed and scale is a primary edge.
Algorithmic Trading & Machine Learning
Algorithmic trading has become similar with big data due to the growing capabilities of computers. The automated process allows computers to carry out financial trades at speeds and frequencies that a human trader cannot. Institutions can more effectively curtail algorithms to incorporate massive amounts of data, leveraging large volumes of historical data to back test strategies, thus creating non risky investments. As algorithms can be created with both types of data, incorporating real-time news, social media and stock data in one algorithmic engine can generate better trading decisions. Though decision making can be influenced by information, human emotion and bias, algorithmic trades are executed solely on financial models and data.
Today, Machine Learning and Data Analytics are making trade more systematic. They complement each other and act as catalysts towards each other, improving ability to identify opportunities and reduce trading costs. Leveraging data in trading comes in two flavours.
A human trader usually uses anywhere from 5 to 8 technical indicators to help us ‘predict’ where a stock would go the next period. In the machine learning context, these indicators are called ‘features’ or ‘independent variables’. So the idea is to train the machine by giving it features that we think that have relationship with the dependent variable, be it next period’s return, volatility, etc. With present time computing power, one can easily feed it hundreds of features from different data sources, and the Machine Learning algorithm would figure out the relationship on its own. Then, when a new data is given, it would use what it has learnt to make predictions. This, in a nutshell, is what machine learning and Algorithms do in a stock market.
During the research phase, the ability to evaluate and learn effectively from huge amount of data gives investors an edge. During implementation phase, these tools can be used to gain the ability to quickly react to changing market conditions.
Financial analysis alone is not enough for examining share prices and share price behaviour. These financial analyses integrated with external factors like social and economic trends within the economy, political environment, consumer behaviour and preferences give rise to stable financial models. These have the potential of impacting the share prices of a particular stock or stocks prices within a particular industry.
Natural Language Processing
Natural language processing (NLP) is a field of linguistics, computer science and artificial intelligence. It is concerned with the interactions between computers and human languages. It helps how to program computers to process and analyze large amounts of natural language data. Using NLP, machines can analyze and learn from unstructured data and texts, like using it to create strategies based on sentiment analysis for trading.
Information Discovery
Data discovery is the collection and analysis of data from various sources to gain insight from hidden patterns and trends. It is the first step in fully harnessing an organization’s data to inform critical business decisions. In the data discovery process, data is gathered, combined, and analyzed in a sequence of steps. The goal is to make messy and scattered data clean, understandable, and user-friendly. Big data is making information discovery, which was hitherto impossible, like estimating sales at supermarkets before quarterly results announcements, by counting footfalls through analysis of satellite images of parking lots.
Reinforcement learning (RL) is an area of machine learning inspired by behaviorist psychology. It is concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. In short, the RL algorithm is concerned with finding the local/global maxima/minima of something that you’re trying to optimise. The machine is allowed to extract the historical data and figure out by itself the most optimal policy. When new data arrives, it would follow this strategy and hopefully arrive at an optimal outcome. Big data helps because it provides more example for the RL agent to explore and learn from. To use a non-trading example, AlphaGo, Google’s AI software which beat the best human Go players, played millions of games on its own to learn high-optimized strategies.
Social Media Algorithm
Another application of big data to be found for investing with market mood and sentiment is to use the finsent website http://www.finsents.com/. This website helps investors scan financial resources and content for the mood of the market. A trader can use this tool to find the mood of a particular sector such as real estate or commodity. Then he may develop his own algorithm from Quantopian.com and create trading algorithm from that website. This is powerful as a person can build his own trading algorithm based on market sentiment from big data software.
Market Sentiment Trading Platform
Hedge funds and institutional investors with large amount of capital are already mining through the web and using social media and blogging website and trading. The best way for an average retail investor is to do the same. A Hedge fund company called as Derwent Capital has already developed a trading platform named DCM Dealer with an Interface to allow retail investors trade on market sentiment from data from Facebook, Twitter, and other social media sites. This interface will help retail investors to review the market sentiment and trade in the market or individual equities or sectors they may choose.
The proposed framework for portfolio optimization using Hadoop can be explained using 5-step process: (a) Data Envelopment Analysis (DEA), (b) Validation of selected stocks, (c) Stock clustering, (d) Stock ranking, and (e) Optimization. All listed firms at a particular stock exchange are considered as the initial input to the framework and the output would be a set of stocks that would maximize the return and minimize the risk. The abstract framework for portfolio optimization is shown in figure 1.
Data Envelopment Analysis is used to narrow the sample of the firms by identifying the efficient firms. In order to authenticate these firms as potential candidates for portfolio optimization, the latest information about the company is retrieved and processed from latest news articles and tweets using text mining to the sentiments about the company in current context. The validated efficient firms are clustered into different groups to assist the diversification of portfolio. This is further followed by ranking of the stocks within each cluster and followed by asset weighting using algorithms. Each process is explained below.
Data Envelopment Analysis (DEA) is a non-parametric linear programming that calculates the effective score in a Decision-Making Unit (DMU) based on a given set of inputs and outputs. The DMUs with score 1 are considered to be competent. Apart from its applicability in the discipline of manufacturing, DEA can be used for stock selection. In the case of stock market, stocks form the DMU. The accepted four input parameters, namely, total assets, total equity, cost of sales and operating expenses and two output parameters, namely, net sales and net income, are to be considered. The stocks with score 1 are considered for the second stage.
Hadoop Framework for Sentiment Analysis
This stage involves processing unstructured data using Hadoop Map Reduce. This step complements the previous stage. Events like change in government, change of management and declaration of dividends have an effect on the market sentiment, which is not a criteria in quantitative analysis.
As first step, online news articles and tweets of the efficient firms are collected. Tweets can be obtained through Twitter API but it is limited to 1500 tweets. The ease-of-use and failover make Hadoop Map Reduce a popular choice for processing big data efficiently. Tweets and news articles are processed using text mining to obtain the positive and negative sentiments about the firm. Hadoop Map Reduce infrastructure quickens the distributed text mining process. Figure 2 shows the Map Reduce framework for distributed text mining. As shown in the figure, the company tweets and news articles are distributed to different Map processors to product data. This intermediate data is processed by the Reduce processors to give the aggregated result. The firms with positive sentiments are chosen for the next stage.
Stock Clustering
In this stage, the correlation coefficients of the stocks are calculated. The stocks are assigned to different clusters based on these correlation coefficients. The greater the number of 5 clusters more is the diversification. The goal for number of clusters and quality of clusters is to maximize similarity within the cluster and to minimize the similarity between the clusters. Many clustering algorithms can be used. This process reduces the portfolio risk through diversification of stocks. These resulting clusters consist of firms with alike business activity.
Stock Ranking
The appropriate stocks from each cluster are to be chosen. The stocks in each cluster can be ranked using Artificial Neural Network (ANN). Till the previous stage, only the internal factors of the firms were measured. At this stage, external factors like Gross Domestic Product (GDP) growth rate and interest rate are to be measured. ANN is a model for processing that consists of three layers: input, hidden and output layer respectively. The inputs for ANN can be GDP growth rate and interest rate and the outputs can be future return on investments. This results in ranking the stocks within a cluster.
Optimization
Previous stage leads to an assumption that the investor might choose the top stocks from each cluster. But the problem that still remains is: How much to invest in each stock? Previous study measures simple (equal) stock weighting method, a primitive method, to allocate the resources among the stocks. Hence the ranked stocks should be able to maximize returns and to minimize risk. Markowitz’s mean-variance model can be used at this stage. The distribution of the stocks in a portfolio will be formed at the end of this stage. Top 3 performing portfolio will be recommended to the investor.
The proposed framework tries to integrate both structured data from database (stock price, balance sheet data etc) and unstructured data from online news articles and tweets. Both the qualitative factors (Management of firms, etc.) along with quantitative factors (financial ratios) provides better alternatives for formation of portfolio. Top three portfolios that are generated will give the investors a choice to choose.