A Study on Applications of Semi Supervised Learning

Categories: Science

Abstract

Aim: The aim of this paper is to center on the different areas of applications of Semi-Supervised Learning and its different calculations, with its points of interest over well-known and state-of-the-art machine learning strategies.

Background: Among all machine learning calculations, Supervised Learning calculations work on information whose yield is as of now show, known as labeled information, Unsupervised learning calculations work on information whose yield isn't as of now present, known as Unlabeled information, obtaining labeled data may be a dull and time-devouring assignment which leads to the use of Semi-Supervised learning calculation which works on both labeled information and Unlabeled information and gives higher exactness.This paper provides comprehensive survey on various applications of Semi-Supervised learning such as Social Data, Cyber Security, Medical Data and Educational Data.

Methods: A comparative consider is performed utilizing Semi-Supervised Learning calculations in each field of application and based on the results obtained, the execution of each learning strategy is talked about.

The ultimate conclusion is in like manner made.

Results and Discussion: Based on the application of Semi-Supervised Learning calculation on each dataset given, which comprises of both Labelled as well as Unlabelled information, the truth that it performs impressively superior to the other calculations are demonstrated as each set of yields is talked about and the conclusion is determined.

Get quality help now
Writer Lyla
Writer Lyla
checked Verified writer

Proficient in: Science

star star star star 5 (876)

“ Have been using her for a while and please believe when I tell you, she never fail. Thanks Writer Lyla you are indeed awesome ”

avatar avatar avatar
+84 relevant experts are online
Hire writer

Moreover, how promising and advantageous can this learning strategy can be in different areas and how beneficial the combination of both Named as well as Unlabelled information in a dataset can demonstrate to be, is examined.

Introduction

When we customarily approach the machine learning concepts, we make utilize of the information set which has its possess set of adjust yields related with it, moreover known as labeled information.

Get to Know The Price Estimate For Your Paper
Topic
Number of pages
Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"
Write my paper

You won’t be charged yet!

The labeled information set could be a portion of the preparing set. The preparing prepare through the labeled information is in this way said to be beneath 'supervision'. The labeled information is as of now classified into categories and the machine-learning calculation is connected specifically to it. At that point, everything that the machine-learning calculation has learned from the labeled information set will be connected to the unlabeled information set, which is the test set. The test set isn't a portion of the preparing set. The test set is the set, which does not have its yields related with it, and the learnings of the machine-learning calculation are connected to the test set to create the set of yields as precisely as conceivable.

In any case, in spite of the fact that the results obtained utilizing a supervised learning approach are exact, they are time expanding and will not be pertinent in ranges where time could be an extravagance. In addition, collecting labeled information is costly. In the other approach of machine learning concepts, we make utilize of the information set, which does not have its yield-related with them, moreover known as unlabeled information set. The client has got to apply the machine-learning calculation straightforwardly on the information set and deliver the right yield to be associated with each unlabeled information within the information set. The machine-learning calculation has no prior training and the foremost appropriate calculation is chosen from a extend of calculations and is connected on the information set, to induce the most precise results. Here the information set is itself the test set.

Be that as it may, in spite of the fact that the results obtained utilizing unsupervised learning are fast and spare time, they are not as precise as those compared to supervised learning. In the genuine world, the information is present in a combination of both labeled as well as unlabeled information. In reality, there's a part of unlabeled information and restricted labeled information. So, it'll be costly on the off chance that the client tries to name all the data present. So the finest strategy would be some place within the center of both supervised and unsupervised learning. Such a learning method is called Semi-Supervised learning. Semi-Supervised learning may be a machine learning approach in which we make utilize of unlabeled information, alongside limited labeled information so as to direct our yield to the unlabeled information towards a specific course inside the fixed classes.

The most reasonable machine learning calculation is chosen for the information set and is connected to the labeled set present in it. The learnings of the calculation are at that point connected to the unlabeled information within the information set. This learning approach addresses the impediments of both the directed and unsupervised learning approaches. the information set comprises of a constrained number of labeled information sets, which makes it less costly, and since the calculation employments labeled information set to guide the classification of unlabeled information, by making the most exact work, so the yields are the most precise. Over time, a lot of Semi-Supervised learning approaches have been created, and have been used in a lot of applications such as twitter sentiment analysis, facial landmark detection, intrusion detection, improving writer identification, outlier detection.

Tweet Sentiment Analysis

Twitter could be a social media stage utilized to send and get messages from the user's companions. These messages are called 'tweets' and are an overflow of information for 'sentiment analysis' for machine learning devotees. As a rule, estimation examination takes put utilizing administered learning, but since it was costly and time-consuming to procure labeled information, hence, we are propelled to utilize Semi-Supervised learning for this process.

Since there's an overpowering sum of 'unannotated tweets' and as it was a constrained sum of 'explained tweets' as an information set, the Semi-Supervised learning(SSL) strategy for the opinion investigation for the unannotated and explained tweets will be utilized, because it goes well with the essential definition of the working of SSL strategy. The work of Thakkar and Patel [2013] has been based on the labeled information, but here the SSL approach to tweet assumption examination will be utilized.(Silva, Coletta, & Hruschka, 2016)

Labeled data set is given by:

dl = {(xi, yi)|(xi, yi) ∈ x× y, i = 1, . . . , l};

Unlabeled data set is given by:

du = {xj |xj ∈ x, j = l + 1, . . . , l + u};

dl: labelled dataset

du: unlabelled dataset

x: input space

y: label space

Classifier “f” is trained from a collection of both labelled and unlabelled data, thus making it better than training just from labelled dataset.

Topic based methods: Sentences also provide information about the topic based on which they get framed. This can change the meaning of the keywords, based on the sentiments given by the sentences. Many methods overlook this aspect of sentiment analysis of datasets. For example, the word 'amazing' in the sentence, 'this book is amazing' and 'this mountain climb is amazingly dangerous' have different meanings in both sentences, subject to the topics it is used in.

The performance of topic-based approach depends on two factors: The confidence threshold and Cluster interference.

If labeled tweets are very limited and confidence threshold is high, then learning process will hamper, and if threshold is low, then the model will learn wrong classes. No significant learning was acquired from the classes with 1%, 5%, 10%, 20%, 40% of the training set, setting the confidence threshold from 0.4-0.9.

Actual learning began only when 2M unlabeled tweets were taken and the threshold was set to 0.96. Thus were created good models of classification.

Tweet sentiment analysis was conducted, using Semi-Supervised learning, which was motivated by the fact that labeled tweets are expensive and difficult to obtain and the unlabeled tweets are widely available at low cost. A conclusion was arrived at, that a topic based model needs to be included in the sentiment analysis as the general meanings of specific keywords change according to the sentences or phrases they are used in.(Silva et al., 2016)

Big Social Data

There’s a surplus of Unstructured Information that can be obtained from places such as social media, social communities, blogs and various other forms of media. Based on the unstructured information brands are created and shaped by the marketers. The many practical purposes of this process gives rise to big data analysis, in which relevant knowledge will be filtered from huge chunks of available data using graph mining and natural language processing methods.

Parcels of instruments were utilized for data preparing but it was found that Semi-Supervised learning strategies, such as one-sided SVM (bSVM) and one-sided Support Learning (bRLS) methods are utilized to memorize from a dataset which contains a parcel of unlabeled information and a constrained sum of labeled information.(Hussain & Cambria, 2018)

Regularization hypothesis is the driving constraint of advanced classification strategies. A function “f” has a place to Reproducing Kernel Hilbert Space (RKHS). For any dataset X, a square lattice is built utilizing components that are symmetric and emphatically positive.

Rreg=Remp[ f ] + ƛΩ[ f ]

ƛ: a positive parameter

Remp[ f ]: loss function

Ω[ f ]: regularization operator

For further details, refer (Hussain & Cambria, 2018).

When the proposed strategy was compared with the state-of-the-art strategies accessible, the exactness of feeling acknowledgment and extremity location was generally higher than the rest of the strategies. The comparison table was shaped after extricating near to 12,000 articulations and physically labeling them as positive and negative.

Every day, lots of websites are filled with unused audits, comments, conclusions and the recovery and filtration of such data ended up a vital implies for assignments such as social media showcasing, the situating of items and monetary showcase forecast. Because it was apparent that the state-of-the-art strategies were still not precise sufficient and require a parcel of preparing, the Semi-Supervised thinking served as a great arrangement to enormous social information examination. Comes about appeared that Semi-Supervised thinking was clearly more exact in both feeling acknowledgment and extremity discovery, and consequently, can be utilized for tackling issues based on enormous social information analysis.

Cyber Security

Facial Landmark Detection

There are various areas such as face recognition, face reconstruction and detection of facial attributes such as smile, age and gender of the person. All these areas come under the study of face analysis. For the study, the procedure applied was face alignment, which is commonly known as Facial Landmark Detection. A variety of algorithms were designed for the procedure and they are divided into two categories: Template Fitting and Regression Methods.

The traditional methods of facial landmark detection function perfectly, but only in theory. The real world is filled with extreme poses and unfavorable lighting conditions, and this is where the traditional methods fail. When a human looks at a face, there are no methods such as template fitting or regression methods. Since both the methods fail in real life applications, the onus is upon us to choose a method which is in the middle of these two, similar to how a human brain works. This is known as the multi-task system. For this method to work, the algorithm was applied to training model faces and then this learning was applied to the test faces, in the template fitting method. Regression methods helped detect various landmarks on the faces simultaneously.

For Semi-Supervised case, there are two cases. Each image has two ground truths, the bounding-box related truths, and landmark coordinates inside the bounding box.

For positive samples, we convert the coordinates of landmarks to a vector(x, y, z, h). The c in the subscript indicates the center coordinates of the bounding box. For negative samples, we ignore the vector and only assign them a background label.

j {eyebrow, eye, nose, mouth} indicates the landmark category. li and li∗ indicate the predicted classification and ground-truth classification of proposal i. ti is the predicted vector representing the offset between the ith proposal and its corresponding ground-truth bounding box, and ti∗ is the true offset value between them. The rest four terms belong to landmark task and they have a similar

form

λji w(i j)·Llandmark ( p(i j) , p∗i ).

Here,

w(ij) is the weight vector of the ith proposal,

p(ij) is the predicted landmark relative coordinates according to the ith proposal and

p∗i is the relative coordinates of ground truth with regard to the ith proposal box

The loss weight is given by λj.

λcls = λreg = 1 and λeyebrow = λeye = λnose = λmouth = 2 were set in order to focus on the main task of landmark detection.

· = λclsLcls (li, li∗ ) + λregli∗ Lreg (ti, ti∗ )

i

I

+λ j

wi j · Llandmark ( pi j , p∗i )

For more information refer (Tang, Guo, Shen, & Du, 2018)

The results showed significantly that the Semi-Supervised model for facial landmark detection, the multi-task algorithm, can locate landmarks and detected errors more accurately than any other methods used for the same purpose. The use of unannotated data along with limited annotated data has significantly improved our accuracy of facial landmark detection.

The Semi-Supervised algorithm for facial landmark detection was the first attempt at combining object detection system with facial landmark detection task. this combination allowed us to predict landmarks as well as detect facial components simultaneously. the results showed that the Semi-Supervised facial landmark detection algorithm was more accurate than any of the state-of-the-art algorithms for facial landmark detection. (Pise & Kulkarni, 2009)

Multilayer Clustering Intrusion Detection

Intrusion detection and prevention system is a system which is responsible for detection, monitoring and identifying intrusions, thus acting as an integral component of the company's security system infrastructure.

Interruption location and avoidance frameworks are classified into two categories based on their capacity to identify known and obscure interruptions. The rule-based interruption discovery and avoidance frameworks are frameworks which have a pre-defined set of rules on which they work, so they are specialists in identifying known interruptions but are less competent when it comes to the obscure interruptions. The bigger the information set, the more time expending this framework gets to be. An anomaly-based interruption location and anticipation framework build a show which it considers as ordinary, and when there's any deviation from the ordinary show, the framework will consider it as an interruption. But in an expansive information set, there are aiming to be a parcel of likenesses between the normal set and the malevolent activity. So we are going to utilize a framework which utilizes the finest of both the previously mentioned frameworks. Hence, a Semi-Supervised show shall be planned.

Data clustering and weighted Euclidean distance:

Let x1; x2;…; xn be a set of instances in d-dimensional space

K is a predefined number of clusters. The K-Means algorithm minimizes the objective function

F(x1,x2,...xN)=∑∑||xi-xk||2

Where ck denotes the kth cluster,

Xk= (1/nk)∑xi

is the center of the kth cluster, and nk is the number of instances in the kth cluster. k denotes the Euclidean norm used by the K-Means algorithm.

For more information, refer (Al-Jarrah, Al-Hammdi, Yoo, Muhaidat, & Al-Qutayri, 2018)

The Kyoto 2006+ dataset contains real network traffic data. Nearly 425L are known attack sessions, and 4.25L are unknown attack sessions.

From the Kyoto 2006+ dataset taken, as an illustration, it was found out that interruption location and anticipation framework, particularly the SMLC show, can work on expansive volumes of information productively and can beat the other conventional calculations.

The proposed SMLC model has been tested on various datasets and the results have been such that this model has outperformed the Semi-Supervised training model even though it has used less labeled data.(Al-Jarrah et al., 2018)

Medical Fields

The Electronic Health Record contains information that will not be adjusted with respect to a patient at a given experience. Exception location demonstrated within the information set given by the Electronic Wellbeing Records will be utilized in the discovery of such inaccuracy.

Semi-Supervised learning approach will be applied to construct data distributions and then comparing this distribution with the distributions in Electronic Health Records to detect outliers.

Assume a vector of unlabelled information, { x (1) , x (2) , …}, where x ( i ) ∈ R (n ) . An auto-encoder applies backpropagation to re-construct the input values, by setting the target values ( y ( i ) ) to be break even with the input values ( x ( i ) ). In other words, the auto-encoder points to memorize work hwb( x ) that approximates ˆ x with the slightest conceivable sum of twisting from x. It compresses the input information into a moo measurement inactive subspace and decompresses the input information to play down the recreation mistake.

ε ( i ) = (∑j=1 (xj(i) - x'j(i) )2 )0.5

for more information, refer (Estiri & Murphy, 2019)

Non-parametric hypothesis testing on entire encoders was performed, based on the features that formed them.

Results appeared that the leading encoder execution was that of the one which had least J index of 0.999, which implies that there's at least one encoder having J list more noteworthy than 0.999. We moreover found that straightforward super-encoders gave fabulous exception discovery outputs.

It was found that the finest exception locator was one encoder, a Semi-Supervised encoder. On normal, the J record was 0.9968, so there will be at slightest one encoder with J record esteem more prominent than 0.999

Text Classification with Laplacian SVMs: Application to Cancer Case Management

With the appropriation of innovation within the restorative field, such as Electronic Medical Records (EMRs), a part of labeled and unlabeled information has been produced and gathered. Utilizing this endless sum of information can demonstrate to be progressive in healthcare conveyance and investigates. In common content classification applications, due to the wealth of unlabeled information over labeled information, Semi-Supervised Learning has demonstrated to perform superior than its Directed partner.

The inspiration behind this think about is cancer case administration. There have been delays in cancer determination since of unusual data (unlabeled) appearing up within the reports, which can be decreased utilizing Semi-Supervised Learning.

To use unlabeled data, Semi-Supervised Learning makes assumptions, which can be divided in two categories, Low-Density Separation (LDS) and Manifold Paradigm.

For the Manifold model, Let G = (V,E) be a graph with vertices V = {v1, . . . ,vn}, where n = l + u. W=(wi,j) is the adjacency matrix of G, where the weight wij represents the similarity between vertices vi and vj. If wi,j = 0, the vertices vi and vj are not connected by an edge. Approach involves creating an edge between its k nearest neighbours. σ controls the rate of decay.

When Common Dialect Processors (NLP) review the information, it may classify a few information as ‘cancer alerts’, which may be a wrong positive. To keep the wrong positives in check, cancer care facilitator is utilized as a portion of schedule case administration. This will make strides the execution of cancer caution coding system.

Around 1500 occasions were part of preparing reference and testing reference. The initial reference standard utilized gave 49% positive cancer alarms compared to 77% positives given by proposed reference standard. Laplacian SVMs were utilized with Gaussian Bits, and are controlled by bit parameters, regularization parameters, parameters utilized to degree inherent geometry and the sum of unlabeled preparing information. 5%-25% of unlabeled information was included to Laplacian SVMs to discover changes, and it was found that esteem of macro-F1increased from 0.741 to 0.756 (including 5%) and to 0.773 (including 25%).

Clinical content classification is a critical step towards utilizing unlabeled information. Semi-Supervised learning strategies demonstrated to be more productive in recognizing wrong positives and anticipating cancer caution delays. The calculations may moreover move forward classifier execution on other clinical content classifications.(Garla et al., 2013)

Educational Data

The quicker the academic staff analyze the low performers, the faster they can implement different learning strategies such as seminars, remedial classes, tests, training material) for their improvement.

The best approach to analyzation of such situations on students’ performance is given by Semi-Supervised learning as compared to other machine learning methods. In this application, the Tri-Training algorithm of Semi-Supervised learning is used. In combination with this, Nearest Neighbor Rule is used in order to eliminate incorrect classifications and detecting noisy data.

A dataset of 344 students is taken and a table of several attributes was made including general information such as gender, age, domestic, children, working time, computer knowledge and occupation. An education module was created which included written examinations, face to face sessions and the final examinations. Every student who had grade greater than 5 (on a scale of -1-10 (-1 means no assignment submissions)) can appear for the final examination and will not require any different learning strategies.

The performance is observed over a series of four tests (TEST1, TEST2, TEST3 and TEST4). The presence or absence in the contact sessions will be attributed as 1 (present) or 0 (absent).

Dataset that was used for evaluating the student performance included 10% labeled data and 90% unlabeled data. The correctly predicted instances is given by ACCURACY, as follows.

ACCURACY= ((TP + TN)/n)*100%

Here, TP: a student that passes is classified as pass

TN: a student that fails is classified as fail

n: number of instances (students)

for detailed information, refer (Bellatreche & Manolopoulos, 2015)

The results indicate that a good predictive accuracy can be achieved using Tri-Training algorithm in comparison to a well-known supervised learning algorithm such as a decision tree.

Identifying low performance students as quickly as possible will allow the academic staff to develop strategies as per students’ personal needs.

Conclusion

Some of the Semi-Supervised learning approaches applied to various applications such as social data, cyber security, clinical data and educational data were analyzed and studied By understanding the drawbacks associated with both the traditional methods, that is the Supervised and Unsupervised learning methods, a learning method that provided us with the advantages of both the learning methods and was more accurate, was used.

References

  1. Al-Jarrah, O. Y., Al-Hammdi, Y., Yoo, P. D., Muhaidat, S., & Al-Qutayri, M. (2018). Semi-supervised multi-layered clustering model for intrusion detection. Digital Communications and Networks, 4(4), 277–286. https://doi.org/10.1016/j.dcan.2017.09.009
  2. Bellatreche, L., & Manolopoulos, Y. (2015). Model and data engineering: 5th international conference, MEDI 2015 Rhodes, Greece, september 26–28, 2015 proceedings. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9344, 259–270. https://doi.org/10.1007/978-3-319-23781-7
  3. Estiri, H., & Murphy, S. N. (2019). Semi-supervised encoding for outlier detection in clinical observation data. Computer Methods and Programs in Biomedicine, (xxxx), 1–16. https://doi.org/10.1016/j.cmpb.2019.01.002
  4. Garla, V., Taylor, C., & Brandt, C. (2013). Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management. Journal of Biomedical Informatics, 46(5), 869–875. https://doi.org/10.1016/j.jbi.2013.06.014
  5. Hussain, A., & Cambria, E. (2018). Semi-supervised learning for big social data analysis. Neurocomputing, 275, 1662–1673. https://doi.org/10.1016/j.neucom.2017.10.010
  6. Pise, N. N., & Kulkarni, P. (2009). A Survey of Semi-Supervised Learning Methods, (124), 30–34. https://doi.org/10.1109/cis.2008.204
  7. Silva, N. F. F. Da, Coletta, L. F. S., & Hruschka, E. R. (2016). A Survey and Comparative Study of Tweet Sentiment Analysis via Semi-Supervised Learning. ACM Computing Surveys, 49(1), 1–26. https://doi.org/10.1145/2932708
  8. Tang, X., Guo, F., Shen, J., & Du, T. (2018). Facial landmark detection by semi-supervised deep learning. Neurocomputing, 297, 22–32. https://doi.org/10.1016/j.neucom.2018.01.080
Updated: Feb 19, 2024
Cite this page

A Study on Applications of Semi Supervised Learning. (2024, Feb 19). Retrieved from https://studymoose.com/document/a-study-on-applications-of-semi-supervised-learning

Live chat  with support 24/7

👋 Hi! I’m your smart assistant Amy!

Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.

get help with your assignment