Periods happen every month for most women, but we are so busy with our daily tasks that we tend to forget our period dates. Many women, like me, have such inconsistent cycles that it is hard to predict the next menstruation. This inconvenience may prevent us from organizing our work at ease or enjoying life. Keeping track of previous period dates and analyzing the historical data may help.
In this project, I dived into the menstrual cycle data. I sought to answer the questions:
- Who are the clients and what menstrual patterns that do they follow?
- Can we predict future cycle length to help clients with confidence?
- Which personal and cycle factors shape these clustering and predictions the most?
Models and results from this project can be used for mentruation tracking apps or websites to identify distinct user groups based on their cycle patterns and to predict what's coming next. Using unsupervised learning techniques such as dimensionality reduction and clustering, I segmented users into groups and uncovered cluster profiles with descriptive analysis and classifier-based explanation. Further, I employed predictive modeling to predict cycle length using personal and cycle-related features. The results of both supervised and unsupervised learning tasks were explained with SHAPLEY values to map the influential features to clustering and predictive decisions.
The dataset consists of 1665 rows, corresponding to 1665 clients, and 80 columns, including ClientID, CycleNumber, Group, CycleWithPeakorNot, LengthofCycle, LengthofLutealPhase and so on. Features are continuous and categorical. Categorical features can be further divided into ordinal variables (eg. MensesScoreDayOne, MensesScoreDayTwo, MensesScoreDayThree), nominal variables (EstimatedDayofOvulation, ReproductiveCategory, FirstDayofHigh) and binary variables (Group, CycleWithPeakorNot, IntercourseInFertileWindow, UnusualBleeding)
There were quite a lot of missing values across features and some rows with many missing values. I decided to remove features with more than 50% missing values and rows with more than 30% missing values. The final data shape is 1663 x 24 (1663 clients and 24 variables).
Among numerical features, only two pairs of highly correlated variables are TotalDayofFertility vs TotalNumberofHighDays, and TotalMensesScore vs LengthofMenses. As we will use dimentionality reduction prior to clustering and Lasso, tree-based predictive models which can deal with multicollinearity, there was no need to remove highly correlated features.
Most of the numerical features are right-skewed. Nominal variables are not evenly distributed. For example, most of the values in EstimatedDayofOvulation are on 13th, 14th, 15th and 16th day, which are quite expected. Ordinal variables like menses scores are also imbalanced across days. For instance, 48% and 60% of the clients gave score of 3 on first and second days of period respectively while decreased scores dominate on later days. This is as expected, as women tend to experience heavy bleeding and severe symptoms on first days while the impact decreases towards the end of the period.
There is also imbalance in binary variables, eg. 91.3% of the clients experienced peak during their cycle, 61.8% of them did not have sexual intercourse during fertile window and 94% had usual bleeding.
Outliers are detected in a number of features, especially TotalNumberofPeakDays.
Dimensionality reduction using UMAP was first utilized to denoise the dataset and transform original features to lower-dimensional feature space, making it useful for visualization in data exploration and evaluation of clustering results. UMAP builds on non-linear representation based on manifold learning, preserving both local and global distance and can be integrated into predictive pipeline.
A K-means model was trained on the UMAP-transformed features to identify client segmentation based on their personal and mentrual behaviors. One requirement of partitional clustering algorithms like K-means is to pre-define the number of clusters. I used Optuna, a hyperparameter tuning algorithm, to search for the best number of clusters with the objective to maximize Silhouette score, ranged from -1 and 1. The higher the Silhouette score is, the better, more compact and well-separate the clusters are.
The final K-means model identified five clusters of clients, with the Silhouette score of 0.4. The clustering results were then interpreted using descriptive analysis, with summary statistics and visualization, and classifier-based explanation. While descriptive analysis helps to profile the clusters by discovering characteristics of each cluster, classifier-based approach helps to map the original features to the clustering assignments, identifying influential features to algorithm decisions.
The descriptive analysis shows that features that differentiate clients between clusters are estimated day of ovulation, number of days of intercourse, total menstruation score, first day of high, total days of fertility, length of cycle, total number of high days and total fertility formula.
From descriptive analysis, we can uncover the patterns/characteristics of each client clusters:
- Cluster 0:
- Most days of intercourse, high chance of having sexual activity during fertility window
- Highest number of peak days, highest likelihood of having peak during cycles
- Above-average first day
- Low total menses score, indicating less suffering from symptoms or heavy bleeding and low total number of high days
- Moderate total days of fertility
- Moderate cycle length
- High chance of h
Clients from cluster 0 are probably women with regular cycles and frequent sexual activity. They monitored their mentrual patterns for fertility or contraception. These women also experienced mild or moderate symptoms.
- Cluster 1:
- Very high cycle number, indicating long tracking history
- Low fertility indicators: least days of fertility, low total fertility formula, lowest menses score, least chance of having intercourse during fertile window and highest chance of unusual bleeding
- Shortest cycle length
Clients from cluster 1 have monitored their menstruation for a long time and have less fertile cycles. They are probably approaching their menopause or simply not interested in fertility.
- Cluster 2:
- Highest fertility indicators: estimated day of ovulation, total fertility formula and total number of high days, total days of fertility and cycle length
- Least days of intercourse
- Lowest total number of peak days
- Low total menses score
- Short tracking history
- Least chance of experiencing peak during cycle
- Highest chance of having intercourse during fertile window
These clients in cluster 2 have long, fertile cycles but experience irregular, Polycystic Ovary Syndrome (PCOS) symptons or natural long cycles.
- Cluster 3:
- Moderate indicators: estimated day of ovulation, number of days of intercourse, total days of fertility, cyclce length, moderate chance of having intercourse during fertile window
- Heavily suffering from menstruation: high total menses score
- Highest chance of having usual bleeding
Clients of cluster 3 have average cycles, but experience discomfort or heavy symptons during menstruation.
- Cluster 4:
- Low indicators overall: estimated day of ovulation, number of days of intercourse, cycle length, first day of high, ottal days of fertility
- Slightly high total menses score
- Moderate chance of having intercourse in fertile window
- Low chance of experiencing unusual bleeding
This group of clients demonstrate less frequent tracking habit, unclear cycle patterns and less distinct behavior overall.
Classifier-based explanation reveals that total fertility formula, cycle length, length of menstruation, group membership, total menstruation score, estimated day of ovulation, total number of peak days, cycle number and total days of fertility are the most influential features in clustering decisions.
Specifically, cluster 0 demonstrate short cycle length, low total fertility, high number of days of intercourse and high first day of high and menses score. Clients in cluster 0 tend to have shorter cycles and high intercourse frequency. They are probably actively tracking cyclces for conception or contraception.
Cluster 1 have long cycle number, short cycle length and fertility formula and high mensses score and length of menses. This group corresponds to expereinced users with shorter, less fertile cycles and possibly mild symptoms. They may be older clients, premenopausall women or those that are not interested in fertility.
Cluster 2 represent high fertility formula, peak days, high days, total fertility days, long cycle and low chance of having peak during cycle. They also experience mild luteal phase and length of menses. These women have long, highly fertile cycles, possibly younger or highly fertile women or those with irregular patterns.
Cluster 3 have high total menses score, length of menses, high fertility formula and low peak days and fertility days. They tend to experience more intense periods. Their cycle lengths and fertility indicators are moderate. This cluster is driven primarily by menstrual experience rather than fertility or tracking behavior.
Cluster 4 depict low cycle number, low fertility formula, menses score and cycle length and slightly high luteal phase and high post peak. The clients in this cluster appears to have weaker or less distinct patterns. They are probably newer app users, casual trackers or users with unclear cycle markers.
For predictive modeling of cycle length, I explored four supervised models: Lasso, Random Forest, XGBoost and CatBoost. The data was first split into training and test set. Missing values were imputed with the median (continuous variables) and mode (categorical variables) from the training set. For Lasso, an extra step of data standardization was implemented to ensure the equal regularization among features. Categorical variables were encoded. Exceptionally with CatBoost, as the model can handle missing values and encode categorical features intelligently, I used raw data to feed to the model.
The models were evaluated using RMSE, resulting in CatBoost as the most performant model, with RMSE of 1.37 on the test set.
SHAPLEY value explanation reveals that EstimatedDayofOvulation, LengthofLutealPhase, TotalFertilityFormula, CycleNumber and NumberofDaysofIntercourse were the most important variables that shape the predictions.
The SHAP summary plot can be read as follows:
- each dot represents a client
- X-axis represents SHAP value, showing the impact on the model predictions of cycle length. Higher value (towards positive direction) equates longer predicted cycle while negative values mean shorter predicted cycle.
- The color (from blue to pink) corresponds to feature value (blue = low, pink = high)
From the SHAP summary plot, we learn that:
- Estimated day of ovulation has strongest impact. Ovulation timing would shifft the whole cycle length
- Longer luteal phase indicates longer cycle length while shorter luteal phase lowers predictions.
- Total fertility formula likely captures overall fertility indicators. Higher fertility formula slightly increase predicted cycle length.
- Clients with more cycle tracked (cycle number) may get slightly shorted predicted cycles. This makes sense as the longer you track, the more stable data you have.
- More days of intercourse slightly reduces cycle length prediction. This can probably relate to more fertile or shorter cycles or better tracking quality.
- Understanding your data better: I couldn't find more information about the data (how the data was collected, what some features like
Groupmean etc). Understanding the data better would help data preparation and modeling choices. - Exploring other clustering models (other partitional models or hierarchical models, ensembles) or pipelines (integrated feature selection, dimensionality reduction internally inside clustering algorithms to extract more cluster-friendly features).
- Besides cycle length prediction, explorating predictions of ovulation length or ovulation day for example.
- Experimenting with other missing value imputation techniques rather than the simple imputation.
- Most importantly, involving experts in result interpretation! Let the expert evaluate if the results are interpretable or really insightful.













