Cervical cancer affects nearly 200,000 women per year in the United States alone.
With the introduction of the HPV vaccine this number has gone down drastically, but it still remains prevalent
and often goes undetected, especially in developing countries.
In this project we aim to identify behavioral patterns that put women at a greater risk of contracting cervical cancer.
We will use behavioral dataset that contains information about age, sexual health and results from four different diagnostic tests to predict the presence of cancerous cells in the cervix of a patient. Feature
importance will be leveraged to construct risk profiles (high, low, medium risk) and validate our
model against treatment schematics used by gynecologists as a course of action. Generation of
such risk profiles will help healthcare institutions/doctors to identify at risk individuals who should
receive immediate attention and a more tailored health plan.
The dataset we used is called “Cervical Cancer Behavior Risk Data Set” and can be found
here: Cervical Cancer Data.
The dataset contains 32 features and 858 instances, each corresponding
to a patient. There are 10 numerical features (age, number of of sexual partners, etc.), and the
rest are categorical relating to clinical markers (STD type, diagnosis, etc.). Given the nature of
the dataset it is not surprising that 11.73 % of the values are missing.
The dataset has 4 target variables (Hinselmann, Schiller, Cytology, and Biopsy), all of which are
diagnostic tests for cervical cancer. These target variables do not contain information about the
metrics of these tests, rather are Boolean variables with 1 indicating a positive test result for
presence of cervical cancer. There are instances when all 4 target variables do not agree which
poses a challenge of correctly annotating a sample. We plan to conduct extensive literature review
so as to factor in the reliability of each of these tests. Considering a new target variable to be the
logical OR of these 4, the dataset is highly imbalanced.
We can plot number of positive samples and negative samples for four target variables,
to prove that it is a highly imbalanced dataset:
This is the data structure:
Due the nature of the dataset, which is regarding cancers, and considering patients's privacy,
it is not surprising that we have to handle a great number of missing values.
We plotted the following bar chart to show the missing value percentage of each feature:
We can see there are huge number of missing values in STD: Time since first and last diagnosis so it makes sense to drop these features. Moreover, there is a clear pattern where there is no information about contraceptives, IUD and any type of STDs. This is probably because the patients were not comfortable disclosing this information.
Then we are able to draw the following conclusions:
1. We decided to drop variables with more than 50% missing values - STDs: Time since first diagnosis and Time since last diagnosis
2. Superfluous columns such as IUD, Hormonal Contraceptives, Smokes and STDs were dropped as their counterparts contained information about instance and the duration of the instance (for e.g. IUD (years))
3. Approximately 13% missing values followed the same pattern with missing entries in all variables related to sexual health of the patient including contraceptives; probably due to privacy concerns
Missing values for numerical and categorical features were imputed differently.
1. Numerical features pertaining to sexual health of a patient, like first sexual encounter, were imputed using the mean calculated for specific age brackets.
2. Missing values for different STDs were imputed by looking at its plurality label - if the majority was greater than 98%, the missing values were imputed by plurality label and if not, a KNNImputer (k=30) was used to fill in for the missing values.
We created a correlation table to see how different features correlated with our four target values. As we can see from the table, four target variables are highly correlated with each other. Besides, Dx:HPV and STDs: Number of diagnosis have close correlation with four target variables, which means HPV and other STDs (sexually transmitted diseases) are important indicators to Cervical Cancer.
We can tell from the correlation table that a couple of groups of features are highly correlated (correlation > 0.6), e.g., Smokes (years) and Smokes (packs/year), STDs: Number of diagnosis and STDs (number), DX: CIN, DX: Cancer, DX: HPV and DX. Hence, we selected to drop Dx, Dx:Cancer, STDs (number) and Smokes (packs/year).
We plotted scatterplots to show trends among different numerical features.
Also, we visualized the relationships between four target variables and two key features: DX:HPV and STDs: Number of diagnosis, using violin plots.
To deal with the highly imbalanced dataset, we used the following two techniques:
1. Applied stratified splitting when splitting the dataset into train, validation and test sets. \
2. Processed the data using Synthetic Minority Oversampling Technique (SMOTE).
We plotted the confusion matrix to visualize how our model is predicting the test data. Also, we make graphs on the feature importances of features to see whether our models make sense intuitively.
In this section we will focus on comparing models only for predicting cytology test results, also known as a pap test, since it is the most widely used screening method for cervical cancer. Plots comparing models for other target variables can be found in the code.
We made a table with all metrics such as negative prediction accuracy (NPA), positive prediction accuracy (PPA), F1 score, to name a few, for all models and target variables is given below.
Since the data is highly imbalanced and the cost of failing to identify a high risk patient (false negatives) is greater than falsely identifying a low risk patient as high risk (false positives), we rank the models on the basis of ROC curves.
We also made the Precision-Recall (PR) Curve for all models
We can make the conclusion that the best performing model is Random Forest since it has the greatest ROC-AUC and relatively high precision for high recall.