Zhuoyu Feng - AI Camp DS Crash Course

Exploration

Data Sources & Structure

The dataset we used is called “Cervical Cancer Behavior Risk Data Set” and can be found here: Cervical Cancer Data. The dataset contains 32 features and 858 instances, each corresponding to a patient. There are 10 numerical features (age, number of of sexual partners, etc.), and the rest are categorical relating to clinical markers (STD type, diagnosis, etc.). Given the nature of the dataset it is not surprising that 11.73 % of the values are missing.

The dataset has 4 target variables (Hinselmann, Schiller, Cytology, and Biopsy), all of which are diagnostic tests for cervical cancer. These target variables do not contain information about the metrics of these tests, rather are Boolean variables with 1 indicating a positive test result for presence of cervical cancer. There are instances when all 4 target variables do not agree which poses a challenge of correctly annotating a sample. We plan to conduct extensive literature review so as to factor in the reliability of each of these tests. Considering a new target variable to be the logical OR of these 4, the dataset is highly imbalanced.

We can plot number of positive samples and negative samples for four target variables, to prove that it is a highly imbalanced dataset:

This is the data structure:

Missing Value Analysis

Due the nature of the dataset, which is regarding cancers, and considering patients's privacy, it is not surprising that we have to handle a great number of missing values.
We plotted the following bar chart to show the missing value percentage of each feature:

We can see there are huge number of missing values in STD: Time since first and last diagnosis so it makes sense to drop these features. Moreover, there is a clear pattern where there is no information about contraceptives, IUD and any type of STDs. This is probably because the patients were not comfortable disclosing this information.

Then we are able to draw the following conclusions:
1. We decided to drop variables with more than 50% missing values - STDs: Time since first diagnosis and Time since last diagnosis
2. Superfluous columns such as IUD, Hormonal Contraceptives, Smokes and STDs were dropped as their counterparts contained information about instance and the duration of the instance (for e.g. IUD (years))
3. Approximately 13% missing values followed the same pattern with missing entries in all variables related to sexual health of the patient including contraceptives; probably due to privacy concerns

Missing values for numerical and categorical features were imputed differently.
1. Numerical features pertaining to sexual health of a patient, like first sexual encounter, were imputed using the mean calculated for specific age brackets.
2. Missing values for different STDs were imputed by looking at its plurality label - if the majority was greater than 98%, the missing values were imputed by plurality label and if not, a KNNImputer (k=30) was used to fill in for the missing values.

Correlation table of data

We created a correlation table to see how different features correlated with our four target values. As we can see from the table, four target variables are highly correlated with each other. Besides, Dx:HPV and STDs: Number of diagnosis have close correlation with four target variables, which means HPV and other STDs (sexually transmitted diseases) are important indicators to Cervical Cancer.

We can tell from the correlation table that a couple of groups of features are highly correlated (correlation > 0.6), e.g., Smokes (years) and Smokes (packs/year), STDs: Number of diagnosis and STDs (number), DX: CIN, DX: Cancer, DX: HPV and DX. Hence, we selected to drop Dx, Dx:Cancer, STDs (number) and Smokes (packs/year).

Scatterplots of trends

We plotted scatterplots to show trends among different numerical features.

Also, we visualized the relationships between four target variables and two key features: DX:HPV and STDs: Number of diagnosis, using violin plots.

Handling Imbalanced Data

To deal with the highly imbalanced dataset, we used the following two techniques:
1. Applied stratified splitting when splitting the dataset into train, validation and test sets. \ 2. Processed the data using Synthetic Minority Oversampling Technique (SMOTE).

Model

In this classification task, We decided to restrict our models to only tree based or probabilistic models. This was motivated by the schematic used by doctors and nurses to identify patients with a risk of cancer. The schematic's structure is exactly like that of a decision tree where in each node specifies a characteristic, for example age of a patient, presence of an STD etc, and each edge describes a course of action. In addition, we chose to use Logistic Regression as the benchmark. Hence, The models we used include - random forests, XGBoost, AdaBoost and logistic regression.

Separate models were trained for all four target variables, but due to the length limit, we only show the results on 'Cytology' here.

Logistic Regression

Logistic regression is a common predictive model used in healthcare research. Its structure makes it easy to deploy and interpret. Moreover, the underlying assumption that all samples are iids and the conditional distribution of the target given a sample is Bernoulli perfectly fits our data.

Two hyperparameters were tuned - maximum number of iterations and the solver.

We used newton-cg solver for learning a classifier for Biopsy results. Logistic regression models had an average accuracy of 75% for all target variables. However, it fared badly when predicting biopsy and cytology results with a recall score of 0.273 and 0.44 respectively.

For all target variables, pelvic inflammatory disease, cervical condylomatosis, HPV infection, HIV, age and first sexual encounter were identified as the most important features

We plotted the confusion matrix to visualize how our model is predicting the test data. Also, we make graphs on the feature importances of features to see whether our models make sense intuitively.

Random Forest

Random forests is a bagging based approach which bootstraps the data by random sub-sampling in each iteration to grow trees. The simplicity, implicit feature selection and speed make it a good model for consideration. When leveraged for healthcare datasets, tree based models are easy to interpret since each node in a tree learns a threshold, similar to a threshold for vital signs or for diagnostic tests. Moreover, for cervical cancer in particular, gynecologists follow a tree-schematic when deciding a course of action for a patient.

Similar to logistic regression models, we performed 5-fold cross validation to find the best setting for two hyperparameters - the maximum depth and the number of trees/estimators.

These models consistently had an accuracy greater than 90% for all diagnostic test with highest for Hinselmann (98.1%) and the lowest for Schiller (90%). The recall scores were also above 0.7 for all models except Schiller that had the lowest recall score of 0.67. Various STDs like HPV, HIV, AIDS, genital herpes and pelvic inflammatory disease were detected as important features by random forests model.

XGBoost

Extreme Gradient Boosting is a type of gradient based boosting that is known to have good performance and speed. XGBoost implements several optimization and regularization techniques and handles sparse data very well. This model is a great fit as most of the data is sparse given that it contains information about more than 10 STDs.

Three hyperparameters including the number of trees, maximum depth and learning rate were tuned using 5-fold cross validation.

XGBoost models consistently fared well for all target variables with an average accuracy of greater than 92% and recall score greater than 0.7.

Interestingly enough, unlike the models discussed so far, XGBoost identified number of pregnancies along with the rest as one of the top features in predicting the risk of cervical cancer which the American Cancer Society also recognized along with smoking.

AdaBoost

AdaBoost is used to build meta classifier by combining several weak classifiers using progressive learning. Since this is a low noise dataset and AdaBoost requires less number of hyperparameters to be tuned, it is a good fit for our task.

Similar to XGBoost, we tuned the number of estimators and learning rate using a grid search and 5-fold cross validation. It was surprising to see AdaBoost models have significantly lower performance compared to all other models and for all target variables.

Clearly, this model architecture isn't right for our task contrary to our initial assumption. Despite the consistent poor performance, these models did identify HPV and pelvic inflammatory disease to be the top predictive features for all target variables

Model Comparision

In this section we will focus on comparing models only for predicting cytology test results, also known as a pap test, since it is the most widely used screening method for cervical cancer. Plots comparing models for other target variables can be found in the code.

We made a table with all metrics such as negative prediction accuracy (NPA), positive prediction accuracy (PPA), F1 score, to name a few, for all models and target variables is given below.

Since the data is highly imbalanced and the cost of failing to identify a high risk patient (false negatives) is greater than falsely identifying a low risk patient as high risk (false positives), we rank the models on the basis of ROC curves.

We also made the Precision-Recall (PR) Curve for all models

We can make the conclusion that the best performing model is Random Forest since it has the greatest ROC-AUC and relatively high precision for high recall.

Conclusion

We trained 16 different models, 4 for each target variable. The selection of models was motivated by their architectural similarity to screening schematics used by healthcare professionals specifically for diagnosing cervical cancer.

We saw that Random forests was the best performing model and all models successfully detected HPV infection, STDs such as HIV, AIDS, pelvic inflammatory disease, age and number of pregnancies as highly predictive features. It was interesting to see despite the performance, random forests failed to identify smoking and number of pregnancies as important features, unlike the other models. Contrary to intuition, previous cancer history was not detected to be important by any of the models.

This project is an attempt to create models that would help healthcare professionals in identifying high risk individuals and possibly decrease the long wait times that every person experiences in the United States and also help developing countries in early detection. A lot of ground-breaking work precedes this project - models that have an even better accuracy and faster deployment. Those have not been discussed as they are beyond the scope of this project.

About

What does the project do?