Sign Language Translator for Speech Impaired People

Categories: Society

Speech, Pages 11 (2713 words)

Views

632

Abstract.

This paper explores how Human-Computer Interaction (HCI) technology addresses one of the real-life problems that individuals face. We all know hand gestures play the most important link in conveying the messages and the emotions during the communication. In this study, we are proposing a new method to help those people who are speech impaired and finds it difficult to address and to convey their messages and information to other people. This study helps to convert the American Sign Language into Text and then converting that text into speech in real time environment.

Don't use plagiarized sources. Get your custom essay on

“ Sign Language Translator for Speech Impaired People ”

Get custom paper

NEW! smart matching with writer

This will help a normal person to understand what message that speech impaired person is trying to convey through his/her hand gestures. Previous methods like Convexity Defect method solved the problem but due to inconsistency in managing convexity points gave inaccurate results and hence not a good method. This paper proposes a new way of using convolution neural networks algorithm to improve the accuracy and a new method to discriminate the hand from the background using the background subtraction method.

Introduction

As we all know that Human-Computer Interaction (HCI) is a very important and a common thing in our daily life. The most important role in the interaction was played by Keyboard and Mouse in the last decades[1]. These are traditional devices and now the gesture recognition system is the most advisable and desirable method for HCI[2]. Human gestures is a proper and instinctive way for communication, passing information with other human beings, guiding some robots to do certain jobs or do interaction with machines like computers[3].

But gesture recognition is a complex task. Some authors and researcher used gloves or some sensors like a kinetic sensor to capture the hand gestures and different motions[4]. But using extra equipment like digital gloves or using some kind of sensor reduce the naturality of gesture though they provide a practical solution.

Using sign language is one of the most important used methods which acts as a communication system for deaf-mute peoples. These sign languages could be hand shapes, facial expression or even could be body movements[5]. Expressive gestures are used as a method to express their feelings and emotions through hand gesture. There are many models in the past which are suggested for gesture recognition in HCI which is different from each other in a way or two. Some of the models are deep learning neural networks, Fuzzy system and Hidden Markov model (HMM). The author proves that HMM is a statistical model and efficient in spatio-temporal time series[6]. Detecting motion in a hand gesture is the primary step in a computer vision system and it is done by the background subtraction method. And this method is needed to be fast and simple as much it could be[7].

We proposed a model to translate the American Sign Language which is communication means of speech impaired person into text using convolution neural network. Deep learning methods like ANN, CNN, RNN have brought a lot of success in the field of human gesture recognition. The main benefit to use CNN is that they don't require any input from a programmer so they are less prone to human errors[8]. Setting hand histogram is totally a new method of background subtraction. Creating their own gestures is probably the most dynamic thing of this project as anyone could add their gesture according to the comfort of their own. Conversion of text into speech is done using python pyttsx3 library which itself convert output text into a speech which can be easily heard.

So to convert the sign language through the hand gestures into text and that text into speech is a big help to overcome the barrier between the speech impaired people and the rest of the world.

Approach

Our main aim is that the camera recognizes the hand gestures in real time and provide the respective text and message with good accuracy. So what is important is the recognition of the hand gestures accurately and precisely and then its conversion to that accurate text. Conversion of text to speech is the secondary thing. As the approach of this paper is using the convolution neural network so what is necessary for that is the images of the different hand gestures of American Sign Language or we can say is the dataset. So in our dataset, we are taking the images of the 44 gestures. For each gesture, there are 2400 images of size 50X50 pixel. In 44 gestures, it consists of 26 alphabets gestures and 10 numbers of American Sign Language and some other gestures. All the images are in grayscale. Some gestures images are shown in Fig.1. All the gestures in the dataset are labelled with its respective text or with a message. What if a person wants to add a new gesture or new message for which he has his own gesture? The solution for this to be described in section 2.2.

Fig 1 Some of the gestures in the dataset

So the thing of the approach is to create a CNN model and trained it with the dataset of the images of the different gestures which must have a respective predefined text or message with it and then converting that text into the respective sound or speech without using any internet. But before using it in real life what is important is the background subtraction and separating the hand from the background. It is to be done using the histogram technique and is described in section 2.1.

Setting Hand Histogram

As background subtraction is the foremost and the important thing to do for a good hand gesture recognition. For skin colour segmentation and setting the background, background subtraction is to be done. It can be done by creating a histogram. Histogram creation is a dynamic way of background subtraction which can be used to remove extra background except for skin colour. It actually uses initially the video streaming by using the camera of the device. OpenCV is the backbone of this module. The file created 50(5X10) squares of green colour on the screen. Palm should be placed in the green squares and it is making sure that all the squares should cover the only palm. It gives the thresholded image. Original VideoStream and its thresholded hand histogram image are shown in Fig.2.

Fig 2. Setting Hand Histogram

It should make sure that the background shouldn't same the skin colour. If in any case background colour matches with skin colour, gloves can be used to detect the colour.

Creating your Own Gestures

Gestures can be defined as the physical movement of the hands, arms, face and body with the intent to convey information or meaning It is a symbol of emotional expression or physical behaviour. It is a non-verbal phrase of actions. In the study of this paper, the CNN model is trained over 44 gestures but if anybody wants to create his/her own gesture then there is an option for that so that a user can train their own gesture. Here also OpenCV library of python is used. OpenCV is used to create the window for the video stream and to capture the images. Along with this sqlite3 is used as a database to store the images. At the time of creating the gesture, it asks for the gesture number and gesture label or text or message for that respective gesture as shown in Fig.3.

Fig 3 Creating Gestures

To create a particular gesture, a person has to put his hand in the green window which is created with the help of the library. The counter has been set up which run the counter up to 1200. It means 1200 images are snapped and stored in the database.

To enrich the dataset for greater accuracy of the images get flip vertically. So for a particular gesture that you have created there are 2400 images. After the addition of new gesture and flipping of images of that gesture, loading of those gesture's images into the system is necessary once so that the images are available at the time of training. Loading of gestures need not be done again and again unless the addition of any new gesture happens. All the gestures can be seen together as shown in the Fig.4.

Fig.4 All gestures image

Conversion of predicted text into speech

As discussed earlier aim of the approach used in this paper is divided into 2 parts. Recognition of hand gesture and predict the correct text or message and the second thing is two convert that text into speech. To convert the predicted text into speech "pyttsx3" library of python is used. Pysstx is a nice text to speech conversion library written in Python2. Pyttsx is completely offline and works seamlessly and has multiple tts-engine support.

Training the model

As all the gesture images are labelled with a specific label and are stored in the gesture directory. Now the model has to be trained with the images of the gestures. Here tf.keras, a high-level API to build and train the models in Tensorflow. The train_images and the train_labels arrays are the training set - the data the model uses to learn. The model is tested against the test set, the test_images and test_labels arrays. The images are 50X50 numpy arrays. Before the training of the model, the first step is that the neural network requires configuring the layers of the model and then compiling the model. Most of the deep learning consists of chaining together simple layers. Most layers, like tf.keras.layers.Dense, have parameters that are learned during training. Before the model is ready for training, it need compile step which includes Loss Function, Optimizer and the metrics. Now here comes training of the model which include the feeding of the training data to the model. From this, the model learns to associate images and labels. And lastly, the model is asked to make predictions about a test set. To start the training of the model called the method "model.fit".

The main reason to use keras because tf.keras models are optimized to make predictions on a batch, or collection, of examples at once. So even though we are using a single image, we need to add it to a list.

Results

In the experiment, the data set of hand gestures are used to assess the above method performance. Dataset is a collection of images of 44 gestures. For a particular gesture, there are 2400 images. So for hand gesture recognition, there are total 105,600 images. Each image is of 50X50 pixel. All 44 gestures are shown in figure 4. From left to right and top to bottom, these 44 gestures consist of 26 Alphabets and 10 numbers of American Sign Language and some other gestures. As the images are trained, the hand gesture can be recognized and the label of the gesture is predicted and the voice is created for that gesture. The training ran on a laptop computer had a RAM of 8GB, an intel core i5 processor and an NVIDIA graphics card of 4GB.

Gesture Prediction

As a final result, we can see physically check whether the gesture is giving the predicted text or not. As we put the hand in the green box, we can see the predicted the text and with the corresponding predicted text we can see the text appear on the screen. This appeared text simultaneously convert into speech. Image of the same can be shown in Figure 5.

Fig 5 Testing the model

Confusion Matrix

For the total 105,600 images, the classification result can be summarized with a confusion matrix as shown in Figure 6. The last row and the first column of the confusion matrix the labels of the gestures. The other entries of the matrix record the numbers of gestures images predicted as the corresponding labels. For example, for the second row, the number 396 is in the columns corresponding to the label of gesture 2. As we can see in the confusion matrix, the above methodology performs well and gives good accuracy.

Fig. 6 Confusion Matrix

Classification Report

Now the classification report throws some lights on the accuracy again. With our proposed method the precision, recall and F-score are 1.00 (refer to figure 7) and the accuracy is about 99.90%. The validation accuracy we get is about of 99.994%.

Conclusion

In this paper, a simple and new way for hand gesture recognition is proposed. Hand Gesture is detected from the background by the background subtraction method. Our Method's performance is evaluated on a dataset of 105,600 images of 44 gestures. CNN helps to improve the accuracy of the hand gesture recognition and also helps to discriminate the hand from the background. Using the python library pyttsx3 helps to convert the text into speech easily and accurately Our proposed method performs well and it is completely fit for real-time application.

References

[1] Zhi-hua Chen, Jung-Tae Kim, Jianning Liang, Jing Zhang, and Yu-Bo Yuan, "Real-Time Hand Gesture Recognition Using Finger Segmentation" in Proceedings of the Scientific World Journal, June 2014
[2] Z. Mo and U. Neumann, "Real-time hand pose recognition using low-resolution depth images," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 1499-1505, June 2006.
[3] B. Ionescu, D. Coquin, P. Lambert, and V. Buzuloiu1. Dynamic hand gesture recognition using the skeleton of the hand. EURASIP Journal on Applied Signal Processing, 13:2101-2109, 2005.
[4] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, "Robust part-based hand gesture recognition using Kinect sensor," IEEE Transactions on Multimedia, vol. 15, no. 5, pp. 11101120, 2013.
[5] Jie Huang, Wengang Zhou, Houqiang Li and Weiping Li, "Sign Language Recognition using 3D convolutional neural networks," 2015 IEEE International Conference on Multimedia and Expo (ICME), Turin, 2015, pp. 1-6.
[6] Mahmoud Elmezain, Ayoub Al-Hamadi, Jorg Appenrodt, Bernd Michaelis, ''A Hidden Markov Model-Based Continuous Gesture Recognition System for Hand Motion Trajectory'' published in 2008 19th International Conference on Pattern Recognition.
[7] Benezeth, Yannick & Jodoin, Pierre-Marc & Emile, Bruno & Laurent, Hlne & Rosenberger, Christophe. (2010). Comparative study of background subtraction algorithms. Journal of Electronic Imaging. 19. 033003-033003. 10.1117/1.3456695.
[8] Srujana Gattupalli, Amir Ghaderi, Vassilis Athitsos, ''Evaluation of Deep Learning based Pose Estimation for Sign Language Recognition'' in 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments - PETRA '16.
[9] J. Zeng, Y. Sun, and F. Wang, "A natural hand gesture system for intelligent human-computer interaction and medical assistance," in Proceedings of the 3rd Global Congress on Intelligent Systems (GCIS '12), pp. 382-385, November 2012.
[10] C. Li and K. M. Kitani, "Pixel-level hand detection in egocentric videos," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13), pp. 3570-3577, 2013.
[11] C. Keskin, F. Kira?, Y. E. Kara, and L. Akarun, "Real-time hand pose estimation using depth sensors," in Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV '11), pp. 1228-1234, November 2011.
[12] A. Shimada, T. Yamashita, and R.-I. Taniguchi, "Hand gesture based TV control system towards both user & machine-friendly gesture applications," in Proceedings of the 19th Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV '13), pp. 121-126, February 2013.
[13] A. D. Bogdanov, A. Del Bimbo, L. Seidenari, and L. Usai, "Real-time hand status recognition from RGB-D imagery," in Proceedings of the 21st International Conference on Pattern Recognition (ICPR '12), pp. 2456-2459, November 2012.
[14] M. Elmezain, A. Al-Hamadi, and B. Michaelis, "A robust method for hand gesture segmentation and recognition using forward spotting scheme in conditional random fields," in Proceedings of the 20th International Conference on Pattern Recognition (ICPR '10), pp. 3850-3853, August 2010.
[15] M. R. Malgireddy, J. J. Corso, S. Setlur, V. Govindaraju, and D. Mandalapu, "A framework for hand gesture recognition and spotting using sub-gesture modelling," in Proceedings of the 20th International Conference on Pattern Recognition (ICPR '10), pp. 3780-3783, August 2010.
[16] Necati Cihan Camgoz, Simon Hadfield, Oscar KolleKoller, rd Bowden, ''SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition'' in 2017 IEEE International Conference on Computer Vision (ICCV).
[17] Runpeng Cui, Hu Liu, Changshui Zhang, ''Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization'' in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18] X Zabulis, H Baltzakis, and A Argyros. Vision-based Hand Gesture Recognition for Human-Computer Interaction, pages 1-30. Lawrence Erlbaum Associates, Inc. (LEA), 2009.
[19] B. Ionescu, D. Coquin, P. Lambert, and V. Buzuloiu1. Dynamic hand gesture recognition using the skeleton of the hand. EURASIP Journal on Applied Signal Processing, 13:2101-2109, 2005.
[20] Pradeep Kumar, Himanshu Gauba, Partha Pratim Roya, Debi Prosad Dogra, ''A multimodal framework for sensor based sign language recognition'' Neurocomputing, 259, 21-38. doi:10.1016/j.neucom.2016.08.132, 2017.