How to build a basic breast cancer model — machine learning
Breast cancer is a huge killer among women worldwide. An estimated 500,000 women died in 2018 alone. Take about 9 and a half football fields and fill it end-to-end with women. That’s how about how many people died from breast cancer in a year. That’s insane!
But guess what. There’s more.
90,000 of those 508,000 breast cancer patients are misdiagnosed. This is a huge problem because women every day are dying and it could be prevented.
I felt that the current method for diagnosing breast cancer is not as efficient as it should be. I wanted to change that, so I found a way to improve this using Machine learning (ML).
ML is exactly what the name is. A machines ability to learn using classification and data mining methods. Classification and data mining methods are an effective way to classify data. Especially in the medical field.
To learn more about training ML models, check out my other article.
Some main recommended screening guidelines:
Mammography: The most important screening test for breast cancer is a mammogram. A mammogram is an X-ray of breasts. It has the ability to detect breast cancer up to two years before you or the doctor would ever notice.
Women age 40–45 or older that have an average risk of breast cancer should get a mammogram once a year.
Women at high risk should have yearly mammograms along with an MRI starting at age 30.
Risk Factors for Breast Cancer:
These are some of the main risk factors for breast cancer. All though Most cases of breast cancer cannot be linked to one specific cause they are still important.
Age. As you age your risk of getting breast cancer increases. Almost 80% of breast cancer patients are women over the age of 50.
Having a personal history of breast cancer. A woman that had breast cancer in one breast in the past has a higher risk of developing cancer in her other breast.
Having a family history of breast cancer. If a patients mother, sister daughter or other blood-related females have had breast cancer, especially at a young age (before 40) increases the risk of them getting cancer.
Genetic factors. Some women certain genetic mutations. For example, changes to the BRCA1 and BRCA2 genes. These genetic mutations increase the risk of developing breast cancer. Other changes in a patient’s genes may raise the risk of breast cancer as well.
Menstrual history. The older a woman is when she has her first child, the greater her risk of breast cancer is.
Also at higher risk are:
- Women who menstruate for the first time before 12
- Women who go through menopause after the age of 55
- Women who haven’t had children
Data Preparation
If you looked at my other article (linked above) you would know that the first step is always organizing and preparing the data.
We will use the UCI Machine Learning Repository for breast cancer dataset.
Attribute information:
- ID number
- Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for the nucleus of each cell:
- radius (mean of distances from the center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter² / area — 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” — 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Goals:
The goal of this is to observe what features are most useful when trying to predict malignant or benign cancer. This can also help us when we choose the model and in the hyperparameter selection.
Data Exploration
I used Spyder to work on this dataset. First, import the necessary libraries and import our dataset to Spyer.
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd#importing our cancer dataset
dataset = pd.read_csv(‘cancer.csv')
X = dataset.iloc[:, 1:31].values
Y = dataset.iloc[:, 31].values
Then, we can examine the data using the pandas head() method
dataset.head()
Next, we can find the dimensions of the data set we are using by using panda dataset ‘shape’ attribute
print("Cancer data set dimensions : {}".format(dataset.shape))Cancer data set dimensions : (569, 32)
The data set has 569 rows and 32 columns. ‘Diagnosis’ is the column that we will use to predict if the cancer is malignant or benign.M = malignant or B = benign. 1 means the cancer is malignant and 0 means benign. 569 people, 357 are labeled “B” (benign) and 212 as “M” (malignant).
Being able to see that data is an important part of data science. It helps us understand what the data is saying and it makes it easier to explain to other people.
Python has several interesting visualization libraries such as Matplotlib, Seaborn etc. I will be using pandas’ visualization which is built on top of matplotlib, to find the data distribution of the features.
Missing data points:
If there are any missing data points in the data set you can use these panda sets
dataset.isnull().sum()
dataset.isna().sum()
Categorical Data
Categorical data are variables that have label values instead of numeric values. The number of possible values is usually limited to a fixed set. For example, users tend to be described by country, gender, age group etc.
I will use Label Encoder to label the categorical data. Label Encoder is a part of SciKit Learn library in Python. It’s used to convert categorical data, or text data, into numbers, which predictive models can understand better.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Splitting the dataset
The data we use is usually split into 2 datasets. Training data and testing data. The training data set has a known output and the model learns on this data so it can be generalized to other data later on. The test data set is used to test the model's predictions
To do this we will use SciKit-Learn library in Python using the train_test_split method.
# Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
Feature Scaling
Most of the time, your data set will have features highly varying in magnitudes, units, and range. But, we need to bring all of the features to the same magnitudes. To do this we use something called scaling. Scaling is basically transforming your data to for into a certain scale like 0–1 or 0–100.
I’m going to use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Model Selection
In my opinion, model selection is the best part of this process.
Most data scientists tend to use a bunch of diffrent machine learning algorithms but at a high level they can all be classified into two groups:
- Supervised learning
- Unsupervised learning
Supervised learning is a system that both input and desired output data are provided. They are labeled for classification to provide a learning basis for any future data processing. Supervised learning problems can be grouped into either regression problems or classification problems.
A regression problem is when the output variable is a continuous value like a salary or weight.
A classification problem is when the output variable is a category. For example, filtering emails. The two categories would be spam and not spam. Or in our case, classifying whether a tumor is malign or benign.
Unsupervised learning is the algorithm that uses information that is neither classified nor labeled and it allows the algorithm to act on the information by its self.
In this dataset, we have the outcome variable of either M (Malign) or B(Benign). So we will use the Classification algorithm of supervised learning.
There are many diffrent classification models in machine learning:
- Logistic Regression
- Nearest Neighbor
- Support Vector Machines
- Kernel SVM
- Naïve Bayes
- Decision Tree Algorithm
- Random Forest Classification
So, let’s start by applying all of the algorithms:
We will use sklearn library to import all the methods of classification algorithms.
We will use logistic regression method of model selection to use Logistic Regression Algorithm,
#Using Logistic Regression Algorithm to the Training Setfrom sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithmfrom sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)#Using SVC method of svm class to use Support Vector Machine Algorithmfrom sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, Y_train#Using SVC method of svm class to use Kernel SVM Algorithmfrom sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithmfrom sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithmfrom sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)#Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithmfrom sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
Now we can predict the test set results and check the accuracy with each of our models:
Y_pred = classifier.predict(X_test)
To check the accuracy we need to import confusion_matrix method of metrics class. The confusion matrix is a way of arranging the numbers of misclassifications.
rom sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
W will use the Classification Accuracy method to find the accuracy of our models. Classification accuracy is the ratio of a number of correct predictions to the total number of input samples.
To check the correct prediction we have to check the confusion matrix object and add the predicted results diagonally which will be the number of correct predictions and then divide by the total number of predictions.
After applying the different classification models, we got the accuracies with different models below:
1. Logistic Regression — 95.8%
2. Nearest Neighbor — 95.1%
3. Support Vector Machines — 97.2%
4. Kernel SVM — 96.5%
5. Naive Bayes — 91.6%
6. Decision Tree Algorithm — 95.8%
7. Random Forest Classification — 98.6%
In the end, you can see that the Random Forest Classification gave the best results for our data set. Not every model will work with every data set though. Before choosing the model you always need to analyze the data then apply it to the model.
As you may have seen it’s not exactly perfect yet but people are working on it and hopefully, it will be available in the future for doctors to use everywhere.
Now, remember those 9 and a half football fields of women dying? Well, thanks to models like these in a few years this number will shrink a lot and we won’t be losing so many people to misdiagnoses.
Feel free to email me any questions you have
Don’t forget to leave a clap on this article and let me know your thoughts in the comments below. Feel free to check me out on Linkedin as well!