Summarizing text is a very challenging task, especially when it involves paraphrasing the text to get a concise and meaningful summary from it, also known as abstractive text summarization. In this project, the encoder-decoder network architecture is used to perform the abstractive text summarization by generating headlines of the give articles. The concept of attention is used in the system to capture the important concepts introduced in the input text. The input is fed in the form of GloVe word embeddings to the neural network.
The model performs well on the used dataset in comparision to some baseline extractive summarization models.
In this decade, the amount of text data available at our disposal is practically unlimited. Thus, ideally, to get a rel- evant piece of information, we will have to go through all that data to get the information that we were looking for. This seems an impossible task. Another approach to get the information that we desire might be by looking at the summaries of the available information and then decide which text to select.
Text summarization is an important tool to have. But it comes with its own set of challenges, like how do you decide which part of the text is relevant and which is not. Many of the existing summarization systems focus on selecting certain lines from the text which best summarize the text. This form of summarization is called extractive summarization. Even though this kind of summarization works basically like a highlighter, there are many cases that fool a summarizer utilizing this kind of summarization.
For example, an extractive summarizer that selects portions from the text based solely on the frequency of the words may generate a summary that does not contain the gist of the text.
Abstractive summarization, on the other hand, is paraphrasing the text such that the summary generated contains the main idea of the text. This project uses neural network architecture, called encoder- Copyright 2013 ACM X-XXXXX-XX-X/XX/XX …$10.00. decoder network to generate an abtractive summary of the text given as input to the system, based on  and .
A text can be considered as a sequence of words. The in- put text sequence has been converted to vector format us- ing GloVe embeddings, which is then passed to the neural network. A concept called attention has also been used to preserve the relation between neighboring words in the gen- erated sequence.
Recently, many researchers are trying to solve the problem of abstractive summarization and produce concise and meaningful summaries. In , the authors focus on deleting parts of sentences to perform sentence compression and generate grammatically correct summary. They use a noisy channel approach to decide which parts are less relevant and should be deleted. McKeown et al. have used similarites across related documents to generate an abstractive summary in  by using linguistic analysis and machine learning to select parts of the documents that have similar contexts. Jing et al.  use a Hidden Markov Model approach to compose a summary by decomposing a human generated summary. In a human generated summary, they identify whether the source of the important concepts in the summary is from the given document, and if so where in the document does the concept appear. They use this pattern information to generate a concise summary. In , Banko et al. use statistical models to select the important concepts and the ordering of those concepts and use those models to generate a concise summary.
A concept known as Phrase based Machine Translation is deployed by , to simplify sentences using a corpus of phrases. Devlin et al.  use Neural Network Language Network (NNLM) and a context window to perform machine trans- lation. They used their system to generate one word in the sequence at a time. Chandar et al.  use an autoencoder for the task of machine translation and use mapping of input to output using the bag-of-words representational model. Nallap- ati et al.  use a encoder-decoder recurrent neural network with attention to generate abstractive summary. They store keywords by using information like parts-of-speech tags, entity tags and words statistics along with the word embeddings. Instead of classifying unseen words as out of vocabulary, they also model rare words using a switching generator- pointer. The system used in this project uses a Encoder Decoder network with LSTM units and an attention layer.
In section 3, we will explore more about the architecture of the system, and see the results of the system in section 4. In section 5, we will see the future work and the conclusions derived from our experiments.
The main idea behind text summarization is that given an input body of text containing m words, the summarizer should generate a n-word sequence of text, such that n < m. The generated sequence of text should preserve the gist of the input text. In this project, a neural network model is used to identify the sequence of n words in the vocabulary of words fed to it that best represent the essence of the text. The Recurrent Neural Network (RNN) is used for this task as RNNs are exceptionally good at handling sequential and time-based data. The neural network model takes in variable length in- put and a outputs a variable length sequence. However, the maximum length of the input and output sequence is xed. Therefore, any sequence that is less than the maxi-mum length is padded with a empty sequence or 0. The end of the input sequences end with a end sequence or 1. The vocabulary of the system is limited to a certain number and the out of vocabulary words are marked unkowns.
For the neural network system, we convert the input text into a vector of embedding using GloVe model. These em- beddings are fed to the encoder network which creates the context and attention vectors to keep a track of the im- portant concepts as well as the context seen in the input sequence. These vectors are then pass to the decoder net- work which generates a output sequence of words by passing the information of the previous word generated using a beam-search algorithm to select the responses that have the highest likelihood.
For comparision of the performance of the abstractive summarizer, three extractive summarization techniques are used: Line-based Summarization, Frequency-based Summarization, and Ratio-based Summarizer.
This type of extractive summarization is based on the idea that most of the information in a piece of text is contained in the rst, the middle or the end part of the text. Therefore, the best summary according to this idea is the line(s) from that section of the text. This project uses the rst line based extractive summarizer as one of the baseline model.
Frequency-based summarizer selects a line from the input text that contains most of the frequent words. For the selection of a valid line, stop words are not used for frequency evaluation.
This extractive summarizer generates a line as output whose length is a certain ratio of the length of the input text.
The rst challenge associated with text data is to convert it to a suitable numerical representation to be fed to the encoder decoder network. A naive way of approaching this problem is to map each word in the vocabulary to a unique number or id. While such a representation is simple to understand and implement, it does have certain disadvantages. Such a mapping must be reproducible between runs. If we circumvent this problem by ordinally sorting the data, we introduce a few problems. Firstly, there is an artificial notion of distance between the words e.g. the word \zebra’ is further than the word\antelope’from the word\animal’. As hinted by the previous example, the naive encoding also ignores contextual relationships between words. Finally, since the number of words in our vocabulary may number in the hundreds of thousands or even millions, words that occur later on get weighted more. The distance problem may be solved by treating each word in the same way that categorical variables are traditionally treated; through one-hot encoding. In one hot encoding, if we have `n’ words in our vocabulary, we introduce `n’ columns in the data all set to 0. Ordering the words as before, to represent the ith word, we set the ith column to 1. While elegant for a few words, this method quickly runs into the curse of dimensionality problem. Moreover, a large number of columns for any given data-row mostly consists of 0s. Word embeddings are a statistical solution to the problem of representing words as number. We use the textual data itself to drive the process of obtaining a suitable represen- tation.
GloVe  is a popular embedding technique. First, a matrix consisting of word occurrences is formed. Entries in corresponding to words occurring together have a higher value, while words that don’t occur together are assigned 0. Then, log-bilinear regression model is used to learn associa- tions between two words in the vocabulary. Only non-zero entries of the matrix are examined and therefore the com- putational cost is low. The resulting embedding has the property of grouping similar or contextually related words closer together within the overall wordspace.
The encoder is a RNN consisting of 3 units of Long Short Term Memory (LSTM) . The input text is rst encoded into a distributed representation using an embedding layer that takes as weight the vocabulary of the system in the form of GloVe embeddings. The words of the input sequence in their encoded form are fed one at a time to the LSTM layers, where they are combined and a contextual representation is saved of the input sequence. This system uses a Attention- based Encoder so that the relation between words that are next to each other can be recorded. The encoder is able to distinguish the stop words from the important words.
The decoder network takes as input the output of the en- coder network which is the context of the input sequence and generates the output sequence. During the generation of each component of the output sequence, the last compo- nent generated by the decoder is fed back to the decoder to generate the next component of the sequence. The com- ponents of the sequence are generated by considering their contextual probability with respect to the previous compo- nent generated. Table 1: Performance Scores of the baseline models Summarizer METEOR Score ROUGE Score Line-based 0.0364 0.0295 Frequency-based 0.0321 0.0499 Ratio-based 0.0321 0.0499 3.3.3 Attention The attention mechanism proposed by Bahdanau et al.  is essential to keep a track of the important concepts that have been seen by the network. The words in the input se- quence are assigned weights based on their importance in the network. This helps in the system remembering keywords that are encountered in the input sequence.
The summary of the input text are generated using Beam- Search algorithm. Beam-Search algorithm generates k most probable summaries. For each part of the output sequence containing maximum n words, it searches for the k most likely words based on the previously generated word. At each iteration, it saves the k most likely sub-sequences hav- ing the highest probabilities.
In the project, we have generated a headline sequence for a news article dataset. The dataset  consists of 2225 ar- ticles falling into 5 di erent categories: Business, Entertainment, Politics, Sports and Technology. The article headlines were used as desired output. The number of words in the articles that were sent to the neural network system were limited to 40 and the maximum number of words in the headline were 10. Apart from the encoder decoder network, three extractive baseline methods were used to generate summaries. In the ratio-based extractive summarizer, the ratio used to select a valid summary line is 0.1. The ROUGE andMETEOR scores for the baseline models are very low as seen in Table 1. Few examples of the three baselines in Table 2 show that while the baseline extractor models sometimes manage to select a line containing the essence of the text, it is not consistent. The encoder decoder network uses a learning rate of 0.00001 and a vocabulary of 40,000 most frequent words in the dataset. The network has been trained for 700 epochs. Table 3 shows the results for few cases generated by the neural network model. In many cases, the system is able to generate exactly the same headline as the actual headline. However, in some cases it is observed that the generated headline is very di erent from the input headline. However, in some of these cases, a relation can be found between the words in the article content and the words produced by the system.
The abstractive model uses a 40 words input text to gen- erate a 10 words headline. The next step to improve this project is to increase the length of the input passed to the neural network model. Also, the vocabulary used for the system is limited and is unable to handle any new words that are outside of the vocabulary e ectively. Handling this limitation is another logical next step. The results show that the baseline models used in this pa- per perform poorly on the given dataset. Whereas, the ab- stractive text summarizer performs well at generating sum- mary in the form of headline. It is observed that the model is able to form word associations between certain concepts and is quite capable of paraphrasing the input text.