Advancing Cancer Detection through LCS Algorithm and SVM Analysis

Categories: Biology

Abstract

Cancer is one of the most dreaded ailments on the planet. It has expanded shockingly and bosom disease happens in one out of eight ladies, the forecast of malignancies assumes fundamental role in uncovering human genome, yet in addition in finding powerful counteraction and treatment of tumors. This paper proposes a novel technique that can foresee the disease by mutations. We will compare the patient's protein and the gene's protein of disease and in the event that there is distinction between these two proteins, at that point we can say there is malignant transformations.

We found that LCS algorithm is a simple and efficient algorithm which does sequence alignment on a pair of sequences. Furthermore, we did a detailed study on machine learning approaches and determine the best approach for training and testing the dataset. We chose Support Vector Machines (SVM) since it gave the best results of about 98% accuracy. Finally, we created a user-friendly website that allows users to give an input sequence and results an output whether the given sequence is malignant or benign.

Introduction

Breast cancer is a disorder in which malignant (cancer) cells develop in breast tissues.

Get quality help now
WriterBelle
WriterBelle
checked Verified writer

Proficient in: Biology

star star star star 4.7 (657)

“ Really polite, and a great writer! Task done as described and better, responded to all my questions promptly too! ”

avatar avatar avatar
+84 relevant experts are online
Hire writer

India continues to have a poor breast cancer survival rate, with just 66.1 per cent of women diagnosed with the disease surviving between 2010 and 2014, a report by Lancet found. 'The key explanation for low breast cancer survival levels in India is because there is lack of awareness about cancer and its care. In third and four stages the cases come to us where treatment is complicated.

Get to Know The Price Estimate For Your Paper
Topic
Number of pages
Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"
Write my paper

You won’t be charged yet!

In Indian women the usual testing of breast cancer is very small. While cancer screening is a regular practice in healthcare in Western countries”. Breast cancer is ranked number one cancer among Indian females with a prevalence as high as 25.8 per 100,000 females and a mortality rate of 12.7 per 100,000 females, according to the Union health ministry.

At least 17,97,900 women in India could have breast cancer by 2020, according to estimates. Females with some risk factors are more likely to develop breast cancer than others. A risk factor is something that may increase the likelihood of getting a condition. You can avoid certain risk factors (such as drinking alcohol). But most factors can't be avoided (such as having a family history of breast cancer). Having a risk factor doesn't mean a woman's getting breast cancer.

A medical diagnosis is a pattern recognition/classification problem, where the doctor has to come up with an output (disease) based on an input (symptoms). There are many classification, machine learning and pattern recognition methods that can be applied to develop tools that solve this pattern recognition problem. By and large, developing classification models using such methods is a two-step process. Firstly, a well-classified training set of data is used to train the model and derive parameters that optimize the prediction accuracy of that training set. After training, the model and its parameters are used on another well classified validation set of data to test prediction accuracy on data that were not used for training. This validation ensures that the classifier did not memorize the data from the training set. Instead, it learnt the characteristics from that data set that enable it to correctly characterize new data entries. In general, the larger the training and evaluation sets are, the better the future predictions of the method will be.

Biological sequence may be represented as symbolic sequence. When biologists find a new sequence, they want to know which other sequences it is closely related to. The sequence comparison was successfully used to create the connection between cancer-causing genes and a gene that developed in normal development and growth. The paper (1) proposes a new model that seeks to eliminate the local alignment of targeted DNA sequences from being executed. Using a linear multi-pattern runtime exact matching string algorithm, a collection of query sequence random patterns (subsequences) is scanned to all targeted sequences in the database. Targeted DNA sequences with a significant low exact matching score are removed from execution for dynamic alignment based programming.

The algorithm proposed by (3) uses the reduced amino acid alphabet to convert protein sequences into an integer sequence and uses n-gram to reduce the duration of the sequence. Then the Smith-Waterman algorithm is used to measure the similarity between two sequences. This article (2) presents three pattern matching algorithms, namely FLPM, PAPM and LFPM that are uniquely designed to accelerate searches for large DNA sequences. Proposed algorithms improve performance by using word processing and also by searching for the least frequent word of the pattern in the sequence.

In this paper, we intend to do a detailed study of the available pattern matching algorithms and determine which is the best suited for an efficient breast cancer prediction system. We found that LCS algorithm is a simple and efficient algorithm which does sequence alignment on a pair of sequences. We chose Support Vector Machines (SVM) for training and testing since it gave the best results of about 98% accuracy. Finally, we created a user-friendly website that allows users to give an input sequence and results an output whether the given sequence is malignant or benign.

Literature Survey

According to Harshitha [4], supervised learning methods are used to obtain the attributes defining cancer and categorize cancer images from standard mammogram images. The supervised system is initially trained by retrieving 13 features from a database of 30 images each. The derived image features under test are linked to the extracted features from the database images to detect and anticipate cancer tumors in the image. The random forest calculation proposed by Bin Dai [5] is utilized to examine the clinical case finding of bosom cancer. The random forest calculation can join the attributes of different eigenvalues, and the consolidated consequences of various choice trees can be utilized to improve the forecast exactness. In view of the outfit learning strategy for irregular trees, the consequences of different feeble classifiers can be joined to deliver exact order results. In this paper, a random forest algorithm is utilized to examine the instance of bosom malignant growth case determination and acquire high expectation exactness. It has reasonable essentialness for assistant clinical finding.

Creator Panuwat Mekha [6] shows examination of grouping calculations for breast cancer based on tumor cell. It focuses on utilizing profound learning algorithms to arrange kinds of breast disease with a few of initiation function: Tanh, Rectifier, Maxout and Exprectifier and examination with various AI methods, for example, Naïve Bayes (NB), Decision tree (DT), Support Vector Machine (SVM), Vote (DT+NB+SVM), Random Forest (RF) and AdaBoost. Exploratory data were downloaded from breast cancer Wisconsin dataset and utilizing AI instrument rapidminer. Utilizing ten times cross-approval. We found that the high precision of 96.99% with profound learning by Exprectifier actuation work.

The paper proposed by C L Nithya [7] is executed for Find-s and Candidate elimination algorithm. Examination of both the calculations has been done and their analysis with respect to accuracy has been found. This paper closes taking training models and executed find-s and candidate elimination technique and approves the outcome. After grouping the prediction will occur whether the bosom disease is there or not. Ensemble learning models is developed using a combination of different machine learning models. The machine learning models [8][9]mainly used are Support Vector Machine, Logistic Regression, Decision Tree, K-Nearest Neighbor, Naïve Bayes etc. It produces a minimum average accuracy of 98%. These show that combining different models[10] can give a better result than relying on one single model.

For example, Author Naveen[11] Decision tree and KNN gives 100% precision. Decision tree model gives 100% exactness in the event that we split train-test dataset in proportion of 90:10 and furthermore utilized 300 sacks of trees. KNN gives max. accuracy 100%, for k= 1 to 7 out of seven loops with 90% is training data and 10% is testing data. Here k is the closest neighbors. Additionally he assessed its forecast by accuracy, confusion grid and classification report. The point is to manufacture a generally precise and proficient AI model. So as prediction result, patient can take treatment on the beginning stage. The paper proposed by Parag Singhal[12] is to develop a tool for early prediction of bosom cancer with the highest precision possible and low error rate. This was done by applying AI algorithms and with assistance of Artificial Neural Network (ANN) utilizing Wisconsin Breast Cancer (Diagnostic) Dataset. Test results show that ANN gives accuracy upto 98% with low blunder rate. The Experiment is directed using Dev.- C++ programming and actualized utilizing C-language.

Proposed Method

Our approach combines bioinformatics techniques and machine learning models to predict cancer presence and type. We employ the LCS algorithm for sequence alignment, comparing patient and known cancer gene sequences to identify mutations. Support Vector Machines (SVM) are chosen for their high accuracy in classification, providing a robust model for predicting malignancy with an impressive 98% accuracy rate.

Bioinformatics Techniques:

  • FASTA for database searches.
  • CLUSTALW for diagnosing gene mutations related to cancer risk.
  • LCS Algorithm for estimating sequence similarity, chosen for its simplicity and efficiency.

Estimating closeness between arrangements, be it DNA, RNA, or protein groupings, is at the center of different issues in atomic science. A significant way to deal with this issue is processing the longest Common Subsequence (LCS) between two strings S1 and S2, for example the longest arranged list of symbols common between S1 and S2. For instance, when S1=abba and S2=abab, we have the accompanying LCSs: abb and aba. The LCS has been utilized to consider different areas, for example, content investigation, pattern acknowledgment, document examination, effective tree matching and so on. Organic utilizations of the LCS and comparability estimation are varied, from sequence alignment in relative genomics, to phylogenetic development and examination, to fast pursuit in colossal natural groupings, to pressure and productive stockpiling of the rapidly growing genomic informational sets to re-sequencing a lot of strings given an objective string an important step in efficient genome assembly.

Machine Learning Model

Support Vector Machine ( SVM) is a supervised learning algorithm that can be used for both regression and classification problems. However, it is mainly used for classification problems. In this algorithm, each data object is plotted as a point in n-dimensional space (where n is the number of features you have) with the value of each function being the value of a particular coordinate. Then, we conduct classification by finding a hyper-plane that distinguish two classes. Support vectors are literally the positions of individual observation. Support Vector Machine is the boundary that better distinguishes the two hyper-plane / line classes).

Initially, the SVMs map the input vector to a higher dimensional space function and define the hyperplane which separates the data points into two different classes. The marginal gap between the decision on the hyperplane and the instances nearest to the boundary is maximized. The resulting classifier achieves tremendous generalizability and can therefore be used for the accurate classification of new samples.

The main roles are:

  1. USER
  2. TEST PHASE
  3. TRAINING PHASE
  4. OUTPUT

USER: User is a person who uses the developed system for his/her purpose. The system is created so that users are provided with all the support that they require. Here, this prediction system can be used by doctors, cancer patients, surgeons and can be distributed to various hospitals also.

TRAINING PHASE: In this phase, we create an appropriate model and train the data. The database used here is NCBI. Here, different types of cancer sequences are given to the system. The system is then trained with the data in such a way that the system can correctly identify whether a sequence given as input by the user is a cancer sequence or not. Furthermore, the system can be trained to determine whether the cancer is benign or malignant. The input sequence contains combination of ATGC (Adenine, Thymine, Guanine, and Cytosine).

TESTING PHASE: In this phase, the developed system is tested with the various input test sequences given by the user. This is the phase where we determine whether the developed model is giving desired output or not. This phase also helps to determine the efficiency and accuracy of the system. The input sequence contains combination of ATGC (Adenine, Thymine, Guanine, Cytosine).

The classification model which gives one of the two outputs:

  • POSITIVE (1) : This means that the given input sequence is a cancer sequence.
  • NEGATIVE (0) : This means that the given input is a normal or non-cancer sequence.

Here, the model created in training phase is feeded to the SVM prediction in testing phase. We detect the similarities in two sequences and produce an output which indicates the percentage of similarity. Depending upon the value obtained, we can say whether the cancer is malignant or benign.

Evaluation and Experimental Outcomes

Our findings indicate that the LCS algorithm outperforms traditional methods like Smith-Waterman and Needleman-Wunsch in execution time and accuracy. Additionally, the implementation of SVM in our model demonstrates high efficiency in cancer prediction, as evidenced by the development of a user-friendly website interface for sequence input and result output.

Conclusion

The proposed method offers a significant advancement in the early detection of cancer, particularly breast cancer, through the application of sequence alignment and machine learning techniques. Our approach not only facilitates accurate cancer prediction but also provides a convenient platform for users to access this predictive tool, potentially leading to earlier diagnosis and treatment.

VI. Future Scope Future directions include expanding this methodology to detect other chronic diseases, integrating the model into mobile applications for wider accessibility, and exploring the potential of immunotherapy in cancer treatment. Our research lays the groundwork for utilizing the body's immune system to combat cancer more effectively, marking a promising avenue for medical innovation.

Table 1: Algorithm Comparison and Performance

Algorithm Average Execution Time Accuracy
LCS Low High
Smith-Waterman Medium Medium
Needleman-Wunsch Medium Medium
FASTA Very Low Very High

This table illustrates the comparative performance of various sequence alignment algorithms, highlighting the efficiency and accuracy of the LCS algorithm in cancer prediction. Our research underscores the potential of combining bioinformatics techniques and machine learning for enhancing cancer detection, offering a pathway to reducing mortality through early diagnosis and targeted treatment strategies.

Updated: Feb 21, 2024
Cite this page

Advancing Cancer Detection through LCS Algorithm and SVM Analysis. (2024, Feb 21). Retrieved from https://studymoose.com/document/advancing-cancer-detection-through-lcs-algorithm-and-svm-analysis

Live chat  with support 24/7

👋 Hi! I’m your smart assistant Amy!

Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.

get help with your assignment