24/7 writing help on your phone
Save to my list
Remove from my list
This paper explores how Human-Computer Interaction (HCI) technology addresses one of the real-life problems that individuals face. We all know hand gestures play the most important link in conveying the messages and the emotions during the communication. In this study, we are proposing a new method to help those people who are speech impaired and finds it difficult to address and to convey their messages and information to other people. This study helps to convert the American Sign Language into Text and then converting that text into speech in real time environment.
This will help a normal person to understand what message that speech impaired person is trying to convey through his/her hand gestures. Previous methods like Convexity Defect method solved the problem but due to inconsistency in managing convexity points gave inaccurate results and hence not a good method. This paper proposes a new way of using convolution neural networks algorithm to improve the accuracy and a new method to discriminate the hand from the background using the background subtraction method.
As we all know that Human-Computer Interaction (HCI) is a very important and a common thing in our daily life. The most important role in the interaction was played by Keyboard and Mouse in the last decades[1]. These are traditional devices and now the gesture recognition system is the most advisable and desirable method for HCI[2]. Human gestures is a proper and instinctive way for communication, passing information with other human beings, guiding some robots to do certain jobs or do interaction with machines like computers[3].
But gesture recognition is a complex task. Some authors and researcher used gloves or some sensors like a kinetic sensor to capture the hand gestures and different motions[4]. But using extra equipment like digital gloves or using some kind of sensor reduce the naturality of gesture though they provide a practical solution.
Using sign language is one of the most important used methods which acts as a communication system for deaf-mute peoples. These sign languages could be hand shapes, facial expression or even could be body movements[5]. Expressive gestures are used as a method to express their feelings and emotions through hand gesture. There are many models in the past which are suggested for gesture recognition in HCI which is different from each other in a way or two. Some of the models are deep learning neural networks, Fuzzy system and Hidden Markov model (HMM). The author proves that HMM is a statistical model and efficient in spatio-temporal time series[6]. Detecting motion in a hand gesture is the primary step in a computer vision system and it is done by the background subtraction method. And this method is needed to be fast and simple as much it could be[7].
We proposed a model to translate the American Sign Language which is communication means of speech impaired person into text using convolution neural network. Deep learning methods like ANN, CNN, RNN have brought a lot of success in the field of human gesture recognition. The main benefit to use CNN is that they don't require any input from a programmer so they are less prone to human errors[8]. Setting hand histogram is totally a new method of background subtraction. Creating their own gestures is probably the most dynamic thing of this project as anyone could add their gesture according to the comfort of their own. Conversion of text into speech is done using python pyttsx3 library which itself convert output text into a speech which can be easily heard.
So to convert the sign language through the hand gestures into text and that text into speech is a big help to overcome the barrier between the speech impaired people and the rest of the world.
Our main aim is that the camera recognizes the hand gestures in real time and provide the respective text and message with good accuracy. So what is important is the recognition of the hand gestures accurately and precisely and then its conversion to that accurate text. Conversion of text to speech is the secondary thing. As the approach of this paper is using the convolution neural network so what is necessary for that is the images of the different hand gestures of American Sign Language or we can say is the dataset. So in our dataset, we are taking the images of the 44 gestures. For each gesture, there are 2400 images of size 50X50 pixel. In 44 gestures, it consists of 26 alphabets gestures and 10 numbers of American Sign Language and some other gestures. All the images are in grayscale. Some gestures images are shown in Fig.1. All the gestures in the dataset are labelled with its respective text or with a message. What if a person wants to add a new gesture or new message for which he has his own gesture? The solution for this to be described in section 2.2.
Fig 1 Some of the gestures in the dataset
So the thing of the approach is to create a CNN model and trained it with the dataset of the images of the different gestures which must have a respective predefined text or message with it and then converting that text into the respective sound or speech without using any internet. But before using it in real life what is important is the background subtraction and separating the hand from the background. It is to be done using the histogram technique and is described in section 2.1.
As background subtraction is the foremost and the important thing to do for a good hand gesture recognition. For skin colour segmentation and setting the background, background subtraction is to be done. It can be done by creating a histogram. Histogram creation is a dynamic way of background subtraction which can be used to remove extra background except for skin colour. It actually uses initially the video streaming by using the camera of the device. OpenCV is the backbone of this module. The file created 50(5X10) squares of green colour on the screen. Palm should be placed in the green squares and it is making sure that all the squares should cover the only palm. It gives the thresholded image. Original VideoStream and its thresholded hand histogram image are shown in Fig.2.
Fig 2. Setting Hand Histogram
It should make sure that the background shouldn't same the skin colour. If in any case background colour matches with skin colour, gloves can be used to detect the colour.
Gestures can be defined as the physical movement of the hands, arms, face and body with the intent to convey information or meaning It is a symbol of emotional expression or physical behaviour. It is a non-verbal phrase of actions. In the study of this paper, the CNN model is trained over 44 gestures but if anybody wants to create his/her own gesture then there is an option for that so that a user can train their own gesture. Here also OpenCV library of python is used. OpenCV is used to create the window for the video stream and to capture the images. Along with this sqlite3 is used as a database to store the images. At the time of creating the gesture, it asks for the gesture number and gesture label or text or message for that respective gesture as shown in Fig.3.
Fig 3 Creating Gestures
To create a particular gesture, a person has to put his hand in the green window which is created with the help of the library. The counter has been set up which run the counter up to 1200. It means 1200 images are snapped and stored in the database.
To enrich the dataset for greater accuracy of the images get flip vertically. So for a particular gesture that you have created there are 2400 images. After the addition of new gesture and flipping of images of that gesture, loading of those gesture's images into the system is necessary once so that the images are available at the time of training. Loading of gestures need not be done again and again unless the addition of any new gesture happens. All the gestures can be seen together as shown in the Fig.4.
Fig.4 All gestures image
As discussed earlier aim of the approach used in this paper is divided into 2 parts. Recognition of hand gesture and predict the correct text or message and the second thing is two convert that text into speech. To convert the predicted text into speech "pyttsx3" library of python is used. Pysstx is a nice text to speech conversion library written in Python2. Pyttsx is completely offline and works seamlessly and has multiple tts-engine support.
As all the gesture images are labelled with a specific label and are stored in the gesture directory. Now the model has to be trained with the images of the gestures. Here tf.keras, a high-level API to build and train the models in Tensorflow. The train_images and the train_labels arrays are the training set - the data the model uses to learn. The model is tested against the test set, the test_images and test_labels arrays. The images are 50X50 numpy arrays. Before the training of the model, the first step is that the neural network requires configuring the layers of the model and then compiling the model. Most of the deep learning consists of chaining together simple layers. Most layers, like tf.keras.layers.Dense, have parameters that are learned during training. Before the model is ready for training, it need compile step which includes Loss Function, Optimizer and the metrics. Now here comes training of the model which include the feeding of the training data to the model. From this, the model learns to associate images and labels. And lastly, the model is asked to make predictions about a test set. To start the training of the model called the method "model.fit".
The main reason to use keras because tf.keras models are optimized to make predictions on a batch, or collection, of examples at once. So even though we are using a single image, we need to add it to a list.
In the experiment, the data set of hand gestures are used to assess the above method performance. Dataset is a collection of images of 44 gestures. For a particular gesture, there are 2400 images. So for hand gesture recognition, there are total 105,600 images. Each image is of 50X50 pixel. All 44 gestures are shown in figure 4. From left to right and top to bottom, these 44 gestures consist of 26 Alphabets and 10 numbers of American Sign Language and some other gestures. As the images are trained, the hand gesture can be recognized and the label of the gesture is predicted and the voice is created for that gesture. The training ran on a laptop computer had a RAM of 8GB, an intel core i5 processor and an NVIDIA graphics card of 4GB.
As a final result, we can see physically check whether the gesture is giving the predicted text or not. As we put the hand in the green box, we can see the predicted the text and with the corresponding predicted text we can see the text appear on the screen. This appeared text simultaneously convert into speech. Image of the same can be shown in Figure 5.
Fig 5 Testing the model
For the total 105,600 images, the classification result can be summarized with a confusion matrix as shown in Figure 6. The last row and the first column of the confusion matrix the labels of the gestures. The other entries of the matrix record the numbers of gestures images predicted as the corresponding labels. For example, for the second row, the number 396 is in the columns corresponding to the label of gesture 2. As we can see in the confusion matrix, the above methodology performs well and gives good accuracy.
Fig. 6 Confusion Matrix
Now the classification report throws some lights on the accuracy again. With our proposed method the precision, recall and F-score are 1.00 (refer to figure 7) and the accuracy is about 99.90%. The validation accuracy we get is about of 99.994%.
In this paper, a simple and new way for hand gesture recognition is proposed. Hand Gesture is detected from the background by the background subtraction method. Our Method's performance is evaluated on a dataset of 105,600 images of 44 gestures. CNN helps to improve the accuracy of the hand gesture recognition and also helps to discriminate the hand from the background. Using the python library pyttsx3 helps to convert the text into speech easily and accurately Our proposed method performs well and it is completely fit for real-time application.
Sign Language Translator for Speech Impaired People. (2019, Nov 26). Retrieved from https://studymoose.com/sign-language-translator-for-speech-impaired-people-through-example-essay
👋 Hi! I’m your smart assistant Amy!
Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.
get help with your assignment