Evaluating Machine Learning Models for Early Breast Cancer Detection

Written by Abigail Hodder (Reporter)

A research group at King Khalid University (Saudi Arabia) has identified a machine learning model that could help clinicians make quicker and more effective decisions about breast cancer patients. The researchers compared the ability of five core models to accurately identify early signs of breast cancer from real images. The group identified one candidate that outperformed the rest, showing potential for streamlining clinics and impact breast cancer patient outcomes.  

AI In Cancer: Exciting But Still Room for Improvement  

AI has the power to transform the landscape of healthcare and has come leaps and bounds in the past decade.  

Indeed, the accelerated rate of research is promising, but scientists still face several problems when selecting the optimal type of model to use.  

For example, while deep learning models can help scientists to identify complicated patterns in biological data, they require a large amount of data that might be unavailable in some areas.  

Machine learning (ML) models, on the other hand, typically require less data; they are easier to build, potentially making them more accessible to clinics. 

Finding The Most Suitable Candidate  

With this in mind, researchers at King Khalid University set out to test how ML could be used to detect breast cancer.  

Badar Almarri and his group investigated 5 ML tools, each with different principals:  

1. K-Nearest Neighbors

Instead of using a pre-learned pattern to make predictions, this is a lazy learning algorithm that stores the training dataset, which is only looked at when making predictions.  

The stored data then serves as a point of comparison to make these predictions. When prompted, the AI calculates the distance (for example, Euclidean distance) between new datapoints and the datapoints in the original dataset. It finds the closest ‘neighbor,’ known as the ‘k’ value – a hyperparameter that must be manually set.  

These models work well with small datasets and can adapt quickly to new data, but they are slow and can be expensive when working with large datasets, since the machine must compare datapoints to all stored data.

Additionally, the model’s performance can be highly affected by the value of ‘k.’
If ‘K’ is small, (e.g., 1) then the AI will compare only the closest neighbor to the new datapoint. If the datapoint in the stored dataset is anomalous, this could impede the accuracy of its predictions. 

2. Decision Trees

These break down datasets into smaller subsets, built up of nodes which take on three forms:  

  1. Roots: The initial question that the machine considers, such as ‘are these breast tissue cells malignant?’ 
  2. Branches: Further questions that help the AI reach a conclusion. In this case, the AI would ask about features of cell nuclei, like perimeter, radius, and texture, that indicate malignancy. 
  3. Leaves: The final decision of the model, i.e., ‘the cell is not cancerous.’ 

Thanks to this nodal system, decision trees capture a lot of information, building rich representations of entire datasets. Furthermore, their structure resembles flowcharts, making them easy to understand and interpret. 

On the other hand, because these nodes can represent a lot of datapoints, these models are prone to overfitting

Due to their high specificity, overfitting is a particular problem with large datasets – however, this can be mitigated by ‘pruning,’ where redundant information is removed from the tree.  

3. Random Forest

Built from many decision trees to create an ‘ensemble,’ this type of model uses bootstrapped data for each tree, i.e., sampling with replacement.

At each node, the tree then considers only a small subset of attributes. In this instance, each tree might be trained on individual features of nuclei (size, texture, smoothness). Predictions based on these specific features combine to produce a final decision.

Because random forests average out the decision from many different DTs, these models are less prone to overfitting, but can be costly, time-consuming to train and, because of their complexity, tricky to interpret. 

4. Gaussian Naïve Bayes

These models calculate the probability of outputs based on ‘normal’ (Gaussian) distribution, of continuous data assuming that all features in the dataset are independent (‘naive’), i.e., the attributes have no influence over each other.  

N.B. Naive Bayes models can also handle categorical (non-continuous) data with other distributions.  

In the context of this study, Naive Bayes might calculate the likelihood that patient cells are malignant, given there are some features which meet the criteria for malignancy.

This type of ML is efficient, easy to understand, and works well with small datasets. However, the independence assumption does not always hold true, especially in biological scenarios. For instance, in this study, a cell’s radius is directly proportional to its perimeter, and therefore these features are not independent.  

Despite this, Naive Bayes models can perform well even if the independence assumption is not reflective of real life.   

5. Support Vector Classifiers (SVC)

These aim to find the best possible boundary (hyperplane) between data classes. This hyperplane aims to find the largest possible margin between the closest data points from each class (the support vectors.)

In this study, support vectors might refer to cells with features that are on the border between malignancy or benignity, i.e., it is not immediately clear if they are cancerous or non-cancerous.  

SVC performs well with high-dimensional data and when there is a clear line of separation between classes. However, it can be challenging to configure, as chosen hyperparameters, the kernel and regularization parameter, (C), can significantly affect performance. so that the data can be effectively separated into discrete classes.

While SVCs are deterministic, in that a particular set of conditions will always lead to the same output, they can be complex, although not as “black-box” as neural networks.  

Results

Each of these models was trained with images of cancerous and non-cancerous cells, analyzing physical characteristics of nuclei in each class. 

Once training was complete, the group evaluated the ability of these 5 models to correctly identify malignant or benign cells.  

Random forest outperformed the rest, with an accuracy of 92.55%. 

Implications For the Future 

Despite these results, the group emphasizes how the future of AI-guided medicine should stretch beyond black-and-white diagnoses. They stress the need for models that can identify key factors (like certain genetic traits) that influence a disease’s development, which require careful optimization and powerful computers. 

This study highlights the value of more traditional machine learning algorithms over deep learning models, particularly when efficiently integrating AI into clinics could help clinicians make quick treatment decisions, potentially making a significant impact on patients’ lives.