{"id":291,"date":"2024-10-23T01:12:39","date_gmt":"2024-10-23T01:12:39","guid":{"rendered":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/?page_id=291"},"modified":"2025-07-14T17:41:14","modified_gmt":"2025-07-14T17:41:14","slug":"hadlock-lab-project","status":"publish","type":"page","link":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/projects\/hadlock-lab-project\/","title":{"rendered":"Hadlock Lab Project"},"content":{"rendered":"<div id=\"pl-291\"  class=\"panel-layout\" ><div id=\"pg-291-0\"  class=\"panel-grid panel-no-style\" ><div id=\"pgc-291-0-0\"  class=\"panel-grid-cell\" ><div id=\"panel-291-0-0-0\" class=\"so-panel widget widget_text panel-first-child panel-last-child\" data-index=\"0\" >\t\t\t<div class=\"textwidget\"><p><b>Background<\/b>:<\/p>\n<p>Axial spondyloarthritis (axSpA) is a chronic autoimmune disorder characterized by joint and tissue inflammation in a patient\u2019s spine. This leads to pain and discomfort, primarily in the lower back, limited mobility, and impaired ability to carry out daily tasks.\u00a0If left untreated, axSpA can worsen over time, leading to fusion of joints, increasingly limited mobility, and a higher risk of developing comorbidities.<\/p>\n<p>Currently, diagnosis relies on a combination of methods, including imaging, blood tests, physical examination, and genetic profiling. However, many of these criteria are nonspecific\u00a0and not commonly tested for, making early diagnosis challenging. The advent of electronic health records (EHRs) provides an opportunity for healthcare providers to draw upon banks of comprehensive patient records spanning multiple years, if not decades. In this study, we used EHR data acquired from the National Institute of Health\u2019s (NIH) AllofUs database, which contains EHR data for over 1.2 million patients across the United States.<\/p>\n<p>Whole-genome sequencing (WGS) techniques are relatively recent, but greatly expand upon the available analyses that can be done with genetic data. With WGS, numerous aspects of a patient\u2019s genetic makeup can be examined, including mutations, insertions, deletions, and other variants. We also aimed to incorporate WGS data in this project.<\/p>\n<p><b>Research Goal:<\/b><\/p>\n<p>The aim of this project was to improve early prediction of axSpA through the application of machine learning models to data from the NIH AllOfUs database. We worked to develop risk models to predict the occurrence of axSpA at\u00a0the onset of primary symptoms such as back pain \u2013 (1) using only data that exists for most individuals, particularly in EHRs, and (2) using genomics data.\u00a0We aimed to use markers that are commonly measured for a majority of patients \u2013 for example, CRP and ESR, two commonly-used diagnostic markers, are typically measured only if a clinician suspects inflammation, meaning most patients will not have a measurement present in their records. In this study, we combine clinical data with whole-genome sequencing to develop comprehensive risk modeling to identify potential novel indicative biomarkers, and to support clinical decision-making and potentially earlier diagnoses of axSpA.<\/p>\n<p><b>Cohort:<\/b><\/p>\n<p>The inclusion criteria for the study required that all patients had documented low back pain or sacral pain in their electronic health record (EHR) at some point during their medical history. Patients also needed to have documented measurements of at least one of approximately 900 blood and urine biomarkers, at least 2 weeks prior to the first mention of pain. Our case group was comprised of patients who were then diagnosed with axSpA, and our control of patients who were not.<\/p>\n<p><b>Data:<\/b><b><br \/>\n<\/b>Blood and urine biomarkers:\u00a0Blood and urine biomarker measurements recorded at least two weeks prior to the onset of low back\/sacral pain was included for each patient. Biomarkers for which patients had multiple measurements over time were averaged, and those with high missingness (&gt;60\/70%) were excluded.<\/p>\n<p>Condition history:\u00a0Any pre-existing health conditions noted in a patient\u2019s records were also included in the analysis. For this stage of the study, conditions were included if they fell into one of the following categories:<\/p>\n<ul>\n<li>Other autoimmune disorders (ex. multiple sclerosis, rheumatoid arthritis)<\/li>\n<li>Disorders known to be comorbid with axSpA (ex. uveitis, Crohn\u2019s disease)<\/li>\n<li>Commonly acquired illnesses (ex. influenza, pneumonia)<\/li>\n<li>Cardiovascular conditions<\/li>\n<li>Hypertensive disorders<\/li>\n<\/ul>\n<p>Survey responses:\u00a0In the AllofUs database, patient responses to survey questions are recorded. For this analysis, responses to survey questions that fell into the following categories were included:<\/p>\n<p>History of substance use (smoking, alcohol, e-smoking products)<\/p>\n<ul>\n<li>Experiences in healthcare settings<\/li>\n<li>Ex. \u201cHow often are you treated with less respect than other people when you go to a doctor\u2019s office or other healthcare provider?\u201d<\/li>\n<li>Self-reported ratings of mental health, physical health, and autonomy<\/li>\n<li>Family history of conditions (heart\/bowel\/autoimmune disease, etc.)<\/li>\n<\/ul>\n<p><b>Cohort Flow Diagram:<\/b><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-831\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM-300x202.png\" alt=\"\" width=\"443\" height=\"298\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM-300x202.png 300w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM-1024x691.png 1024w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM-768x518.png 768w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM-1536x1037.png 1536w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM-272x182.png 272w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/07\/Screenshot-2025-07-14-at-10.40.24-AM.png 1606w\" sizes=\"auto, (max-width: 443px) 100vw, 443px\" \/><\/p>\n<p><b>Methods \u2013 Risk Modeling:<\/b><\/p>\n<p>Classification<\/p>\n<p>For the initial phase of risk modeling, a classification approach was employed to differentiate between case and control patients. The model used blood and urine biomarker data filtered according to the completeness at various thresholds as mentioned above. The classifier of choice was XGBoost, chosen for four primary reasons:<\/p>\n<ul>\n<li>XGBoost is highly effective when working with tabular data.<\/li>\n<li>Second, XGBoost has the ability to handle missing values directly. In this study, missing values were represented as NaNs, and XGBoost\u2019s capacity to process these NaNs without requiring imputation was a significant advantage.<\/li>\n<li>Additionally, XGBoost is robust in handling multicollinearity, which was a key concern in this dataset. Given that some features in the dataset often represented similar physiological measurements but had been obtained using different techniques or reported in different units, multicollinearity between some features was expected.<\/li>\n<\/ul>\n<p>Class weights and decision threshold tuning were both employed.<\/p>\n<p>For our risk model, we chose a Cox Proportional-Hazards model. The CoxPH model estimates the effect of different variables (covariates) on the risk of an event occurring (in this case, being diagnosed with axial spondyloarthritis). We fit both models on blood and urine biomarkers, patient condition history, and responses to survey data.<\/p>\n<p>&nbsp;<\/p>\n<p><b>Methods \u2013 Whole Genome Sequencing\u00a0<\/b><\/p>\n<p>The genomics analysis was designed to align with the case and control cohorts used for the EHR data analysis. However, as a result of high costs in NIH All of Us for extracting whole genome sequencing data, it was not feasible to include all 4,000 control patients in the genomics analysis.\u00a0As a result of the refined criteria, along with the additional requirement for short-read whole genome sequencing data, the genomics case cohort consisted of 128 patients, while the control cohort was reduced to 579 patients.<\/p>\n<p>For the current phase of the study, the analysis was focused exclusively on the gene ERAP1 as a result of its established relationship with AS. This gene was selected as a starting point because of its known association with the disease, but additional genes will be incorporated.<\/p>\n<p><b>Genomics Data Encoding and Processing<\/b><\/p>\n<p><b>Base Pair Selection:<\/b>\u00a0The analysis concentrated on a sequence of base pairs present in the gene ERAP1 that were also included in the VCF data. For each patient, 1,528 base pairs within this gene were utilized for analysis.<\/p>\n<p><b>Global Allele Mapping:<\/b>\u00a0Each base pair was associated with a global allele list, which represents all known alleles for a given base pair within the dataset.<\/p>\n<p><b>Categorical Encoding:<\/b>\u00a0For each patient, each base pair was encoded categorically according to their mapping to the global allele list. This encoding process involved translating each base pair\u2019s genetic variation into a categorical format based on the alleles present in the global dataset.<\/p>\n<p><b>Patient Cohort:<\/b>\u00a0The analysis included data from 705 patients, with 1,528 base pairs included for each patient.<\/p>\n<p>XGBoost was employed for classification between case and control patients, for the same reasons was it was used with EHR data. Class weights and decision threshold tuning were both used.<\/p>\n<p><b>Methods \u2013 Causal Analysis<\/b><\/p>\n<p>&nbsp;<\/p>\n<p><b>SHAP-Based Feature Selection<\/b><\/p>\n<p>Features were evaluated for how often their SHAP value sign aligned with the true class label. This alignment, where positive SHAP values matched positive AS diagnoses and negatives matched non-AS, was used to assess feature importance by focusing on consistent contributions to correct predictions.<\/p>\n<p><b>Derived Metrics for Feature Ranking:<\/b><\/p>\n<ol>\n<li><b>SHAP-AUC:<\/b>\u00a0Each feature\u2019s SHAP sign was used in a \u201cmini-model\u201d to predict AS or non-AS. The AUC of these mini-models assessed each feature\u2019s discriminative alignment.<\/li>\n<li><b>SHAP-Recall<\/b>: Recall evaluated a feature\u2019s ability to detect AS cases, prioritizing sensitivity by minimizing false negatives.<\/li>\n<\/ol>\n<p><b>Recursive Feature Elimination Using SHAP Derived Metrics:<\/b>\u00a0Features were ranked by SHAP-AUC and SHAP-Recall. The least ranked features were iteratively removed, with models retrained on each new subset. This approach outperformed traditional Gini-based selection in improving performance.<\/p>\n<p><b>Improved Performance with SHAP Rankings:<\/b>\u00a0SHAP-AUC emphasized discriminative capability, while SHAP-Recall enhanced sensitivity. Features ranked highly by these metrics contributed to both predictive power and sensitivity, optimizing model performance.<\/p>\n<p><b>Preliminary Results<\/b><\/p>\n<p><b>Time-to-event modeling (CoxPH):<\/b><\/p>\n<p>Performance for the CoxPH model was measured using concordance index, or c-index, which determines the model\u2019s ability to output risk scores that, relative to one another, align with the actual recorded time to event for patients in the cohort. For example, if Patient A had a longer time to event than Patient B, Patient A\u2019s risk score should be smaller. A c-index of 1 indicates perfect performance. The model obtained a c-index of 0.76, indicating moderately high predictive power. The covariates with the most statistical significance (p &lt; 0.05) are included below. The column exp(coef) corresponds to the hazard ratio.<\/p>\n<p>Most statistically significant covariates (p &lt;0.05), as identified by CoxPH model, with hazard ratios.<\/p>\n<p><b>Classification<\/b><\/p>\n<p>Binary Classification \u2013 EHR Model<\/p>\n<p>Classification Performance<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-822\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.01-PM-300x239.png\" alt=\"\" width=\"300\" height=\"239\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.01-PM-300x239.png 300w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.01-PM-1024x815.png 1024w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.01-PM-768x611.png 768w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.01-PM.png 1284w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><i>Performance of XGBoost Classifier on Validation Set at Various Decision Thresholds and Feature Completeness Thresholds<\/i><\/p>\n<p>The best performing binary classification model achieved an AUC of 0.73, using only features that were not missing in at least 60% of the data.<\/p>\n<p>Feature Importance<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-823\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.37-PM-252x300.png\" alt=\"\" width=\"252\" height=\"300\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.37-PM-252x300.png 252w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.37-PM-861x1024.png 861w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.37-PM-768x913.png 768w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.50.37-PM.png 974w\" sizes=\"auto, (max-width: 252px) 100vw, 252px\" \/><\/p>\n<p><i>Gini Importances of Blood and Urine Markers in Best Performing XGBoost Model (60% completeness threshold; 0.73 AUC)<\/i><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-824\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.18-PM-139x300.png\" alt=\"\" width=\"139\" height=\"300\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.18-PM-139x300.png 139w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.18-PM-473x1024.png 473w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.18-PM.png 592w\" sizes=\"auto, (max-width: 139px) 100vw, 139px\" \/><\/p>\n<p><strong>Feature Importance<\/strong><\/p>\n<p><b>SHAP Analysis \u2013 EHR Model<\/b><\/p>\n<p>&nbsp;<\/p>\n<p><b><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-825\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.48-PM-300x187.png\" alt=\"\" width=\"300\" height=\"187\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.48-PM-300x187.png 300w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.48-PM-1024x639.png 1024w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.48-PM-768x479.png 768w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.51.48-PM.png 1122w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/b><\/p>\n<p><b>SHAP Summary Plot (only top features \u2013 not showing all features)<\/b><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-826\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.26-PM-285x300.png\" alt=\"\" width=\"285\" height=\"300\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.26-PM-285x300.png 285w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.26-PM-974x1024.png 974w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.26-PM-768x808.png 768w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.26-PM.png 1126w\" sizes=\"auto, (max-width: 285px) 100vw, 285px\" \/><\/p>\n<p><b>SHAP Analysis \u2013 Genomics Model\u00a0<\/b><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-827\" src=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.59-PM-300x198.png\" alt=\"\" width=\"300\" height=\"198\" srcset=\"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.59-PM-300x198.png 300w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.59-PM-768x508.png 768w, https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-content\/uploads\/sites\/12\/2025\/06\/Screenshot-2025-06-30-at-1.54.59-PM.png 1022w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><b>SHAP Summary Plot (only top features \u2013 not showing all features)<\/b><\/p>\n<p>&nbsp;<\/p>\n<p><b>SHAP-AUC-Based Recursive Feature Elimination\u00a0<\/b><\/p>\n<p>AUC of Decision-Threshold-Optimized Binary Classifier at Various Numbers of Features Used \u2013 Using Recursive Feature Elimination Based on Individual Feature AUCs Assigned Using SHAP Values<\/p>\n<\/div>\n\t\t<\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Background: Axial spondyloarthritis (axSpA) is a chronic autoimmune disorder characterized by joint and tissue inflammation in a patient\u2019s spine. This leads to pain and discomfort, primarily in the lower back, limited mobility, and impaired ability to carry out daily tasks.\u00a0If left untreated, axSpA can worsen over time, leading to fusion of joints, increasingly limited mobility, [&hellip;]<\/p>\n","protected":false},"author":96,"featured_media":0,"parent":77,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-291","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/pages\/291","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/users\/96"}],"replies":[{"embeddable":true,"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/comments?post=291"}],"version-history":[{"count":7,"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/pages\/291\/revisions"}],"predecessor-version":[{"id":833,"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/pages\/291\/revisions\/833"}],"up":[{"embeddable":true,"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/pages\/77"}],"wp:attachment":[{"href":"https:\/\/baliga.systemsbiology.net\/see-interns\/hs2024\/wp-json\/wp\/v2\/media?parent=291"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}