December 12, 2024
Abstract
INTRODUCTION: The aging society urgently requires scalable methods to monitor cognitive decline and identify social and psychological factors indicative of dementia risk in older adults.
METHODS: Our machine learning models captured facial, acoustic, linguistic, and cardiovascular features from 39 individuals with normal cognition or Mild Cognitive Impairment derived from remote video conversations and classified cognitive status, social isolation, neuroticism, and psychological well-being.
RESULTS: Our model could distinguish Clinical Dementia Rating Scale of 0.5 (vs. 0) with 0.78 area under the receiver operating characteristic curve (AUC), social isolation with 0.75 AUC, neuroticism with 0.71 AUC, and negative affect scales with 0.79 AUC.
DISCUSSION: Our findings demonstrate the feasibility of remotely monitoring cognitive status, social isolation, neuroticism, and psychological well-being. Speech and language patterns were more useful for quantifying cognitive impairment, whereas facial expression and cardiovascular patterns using remote photoplethysmography were more useful for quantifying personality and psychological well-being.
The number of older adults living with Alzheimer’s disease and related dementias (ADRD) is expected to rise by nearly 13 million in the U.S. by the year 2060[1], [2]. Mild Cognitive Impairment (MCI) often precedes ADRD. It is characterized by cognitive decline that is greater than expected for an individual’s age and education level, while the person remains capable of performing daily activities independently. [^3;4^](https://www.overleaf.com/project/662bcacc3a3c94e578ec2f82#_bookmark10). Along with cognition, emotional well-being such as anxiety, loneliness, or depressive symptoms potentially have bidirectional links with cognitive impairments, due to shared neurobiological links and behavioral mechanisms that impair brain functions [3]–[13]. Additionally, mood disorders, social isolation, and negative emotions often co-occur with or even precede MCI and can further accelerate cognitive decline. [12], [14], [15]. Cognitive impairment and psychological well-being significantly impact the lives of elderly individuals, highlighting the importance of early detection, monitoring, and intervention of these symptoms to maintain their quality of life.
Clinical tools, such as the Montreal Cognitive Assessment (MoCA) and Clinical Dementia Rating (CDR), have become widely accepted as standardized methods for evaluating and monitoring cognitive impairment, which assess a range of cognitive functions[16], [17]. However, traditional cognitive assessments are not sensitive enough to identify MCI or monitor the progression during the early MCI stage[18]–[20]. In standard geriatric care, subtle behavioral changes related to psychological well-being, such as reduced social engagement and declining mental health, are often overlooked. These factors, however, can signal early dementia risk and offer opportunities for timely intervention [5], [21], highlighting the urgent need for innovative approaches to monitor them [22]–[26]. The expected shortage of care services further exacerbates this situation: it is expected that over 1 million additional direct care workers will be needed by 2031, and the U.S. must nearly triple its geriatricians by 2050 [1]. The current global shortage of mental health professionals and geriatricians further complicates disparities in care, especially in underserved regions [27], [28]. Even developed countries are facing such challenges, with twenty states in the U.S. identified as "dementia neurology deserts"[29], let alone developing countries.
The recent widespread adoption of video conferencing platforms for telehealth [30] presents an opportunity to utilize these tools to prescreen cognitive impairment and identify other associated factors, including social isolation and psychological well-being of older adults remotely [31]–[34]. Recent advancements in artificial intelligence (AI) have spurred research into using telehealth platforms to quantify psychological-relevant behaviors in individuals with MCI through facial, audio, and text analysis [35]–[37]. Despite the active research in computerized interview analysis [32]–[34], there are still relatively few publications that focus on quantifying cognitive abilities, social isolation, and psychological well-being in older adults automatically through remote interviews.
Furthermore, with the recent emergence of generative AI models, so-called foundation AI models trained on vast amounts of video, audio, and text data from internet sources, it is imperative to explore whether these innovations can be effectively leveraged to quantify behaviors in older adults [38]–[40]. Advances in machine learning in video analysis also enable non-contact assessments of cardiovascular features, such as remote Photoplethysmography (rPPG) from facial videos [41], which could serve as another valuable modality for assessing cognitive impairment and psychological well-being in older adults. For example, rPPG allows the extraction of heart rate variability (HRV) features that are found to be linked with anxiety and other negative emotions [42], [43].
This work aims to investigate the feasibility of quantifying the psychological well-being, social network, and cognitive ability of individuals living with normal cognition or MCI, utilizing digital markers extracted from facial, acoustic, linguistic, and cardiovascular patterns detected by foundation AI models pretrained from large-scale datasets available from the internet. By objectively quantifying various modalities associated with cognitive decline through a scalable, remote, and automated assessment system, we expect this work to provide a step toward enhancing the accessibility, reducing the disparities of mental health and dementia services [44], and promoting evidence-based therapeutics [45].
Type | Clinical Diagnosis | Normal Cognition (NC) | MCI | Combined (NC+MCI) |
---|---|---|---|---|
Demographics | Counts | |||
Sex (F/M) | /2 | /8 | /10 | |
Race (C/A/O) | /1/0 | /3/1 | /4/1 | |
Age | ± 4.39 | ± 4.81 | 80.69 \(\pm\) 4.6 | |
Education (years) | ± 2.12 | ± 2.54 | 15.44 ± 2.34 | |
Cognitive Ability | MoCA | ± 2.51 | ± 3.57 | 24.54 ± 3.61 |
CDR (no / questionable dementia) | /3 | /16 | /19 | |
Social Network,Personality,PsychologicalWell-being | LSNS-6 | ± 6.08 | ± 5.71 | 14.13 ± 5.86 |
Neuroticism | ± 9.16 | ± 7.89 | 16.44 ± 8.41 | |
Negative affect | ± 8.29 | ± 12.82 | 48.57 ± 10.93 | |
Social satisfaction | ± 10.70 | ± 13.12 | 49.02 ± 11.77 | |
Psychological well-being | ± 7.17 | ± 11.12 | 50.87 ± 9.99 |
The data source for this project was the Internet-Based Conversational Engagement Clinical Trial (I-CONECT) (NCT02871921) study [46], [47]. This behavioral intervention aimed to enhance cognitive functions by providing social interactions (conversational interactions) to older subjects living in social isolation. The study was based on the accumulating evidence that social isolation is a risk factor for dementia [48]. The study recruited older adults (>75 years old) with MCI or normal cognition from 2 sites: Portland, Oregon, which focused on Caucasian participants, and Detroit, Michigan, which focused on African American participants. The experimental group participated in 30-minute video chats with trained conversational specialists four times a week, along with weekly 10-minute phone check-ins for 6 months. In contrast, the control group only received weekly 10-minute phone check-ins. Conversations are semi-structured with predetermined themes each day, ranging from historical events to leisure activities, using pictures to promote conversations. The participants with severe depressive symptoms (GDS-15 >=7) [49] were excluded. Exclusion criteria included a clinical diagnosis of dementia. Clinical diagnoses were made through a consensus process involving neurologists and neuropsychologists, using the National Alzheimer’s Coordinating Center (NACC) Uniform Data Set Version 3 (UDS-3). [50], [51]. Inclusion criteria required participants to be socially isolated according to at least one of the following: (1) a score of 12 or less on the 6-item Lubben Social Network Scale (LSNS-6) [52], (2) engaging in conversations lasting 30 minutes or more no more than twice a week, based on self-report, or (3) responding “often” to at least one question on the 3-item UCLA Loneliness Scale [53]. The intervention results showed that the global cognitive functions improved significantly among the intervention group (i.e., video-chats engaged group) compared with the control group after 6 months of intervention with Cohen’s d of 0.73. The topline results of this behavioral intervention were published earlier [47], [54].
The study faced the COVID-19 pandemic during the trial recruitment, and cognitive tests were administered by telephone during the pandemic. MoCA was changed to Telephone MoCA during this period. Out of 94 subjects randomized into the intervention group, 52 subjects had in-person MoCA (as opposed to Telephone MoCA). Among them, 39 participants with all the available data, including transcribed data, personality, and NIH-Toolbox emotional battery assessment (discussed later), were used in the current study. The demographic characteristics of these participants are shown in 1.
In this study, we aimed to classify participants with cognitive assessments derived from various scales. The MoCA scores were dichotomized into ‘high’ and ‘low’ categories using a cutoff of 24, which is based on the median score of our participants. A high score indicates better cognitive function. The Normal Cognition (vs. MCI) assessment is the binary encoding of clinician evaluations (NACC UDS V3, Form D1: Clinical Diagnosis Section 1). This encoding assigns a value of 1 to indicate normal cognition and 0 to indicate MCI. Regarding the CDR, participants had scores of either 0, indicating no cognitive impairment, or 0.5, indicating questionable or very mild dementia. These scores were dichotomized accordingly.
Social network and Psychological well-being assessments included LSNS-6 [52] for the amount of social interaction, neuroticism from the NEO Five-Factor Inventory (Neuro) [55], and NIH Toolbox Emotional Battery (NIHTB-EB) [56]. The latter has three composite scores: negative affect, social satisfaction, and psychological well-being [57]. For the LSNS-6 items version, which is used here, the cutoff score of 12 is the suggested threshold to define social isolation [58]. For neuroticism, our participants’ median score of 16 is used to dichotomize our participants into groups that have higher or lower negative emotional reactivity to stressful stimuli. From NIHTB-EB, negative affect, social satisfaction, and psychological well-being composite scores [57] were dichotomized with medians of 44.10, 48.66, and 53.70, respectively, to group our participants into high and low-score groups.
Our proposed multimodal analysis framework uses facial, acoustic, linguistic, and cardiovascular patterns to quantify the cognitive function and psychological well-being of the participants during remote interviews (i.e., semi-structured conversations). The participant segment of the interview recordings for the facial, vocal, linguistic, and cardiovascular patterns are extracted. The extracted time-series multimodal features are aggregated over the video using temporal pooling or Hidden Markov Model (HMM). The video-level features are processed with binary classifiers, logistic regression, and/or gradient-boosting classifiers to classify the dichotomized (high or low) rating scales of the cognitive, social network, personality, and psychological well-being assessments. The overall pipeline is shown in 1.
Our video data records both the participant’s and moderator’s activities during the remote interview, which requires segmenting the participant portion of the recording for analysis. Our video recording shows the indicator of the speaker on the screen, either as a moderator or participant ID (starting with "C" followed by 4 digits). We used optical character recognition (OCR), namely EasyOCR[60]–[62], and only the frames indicating participant ID are kept for further analysis. From participant video segments, we used RetinaFace[63], a facial detection model, for detecting, tracking, and segmenting participants’ faces. For the language analysis, the participants’ speeches were transcribed with the transcription models [64] specifically developed for older adults.
We extracted generic facial expression features using DINOv2 [38], a foundation model for facial representation with 1024-dimensional visual embeddings. We also extracted facial emotion, landmark, and action units with facial analysis pipelines used for previous mental health studies[65]–[67]. Facial emotion included 7 categories included being neutral, happy, sad, surprised, fearful, disgusted, and angry [66]. Overall, our facial features were extracted at a 1 Hz sampling rate.
rPPG signals indicate physiological information by capturing subtle variations in skin color that result from blood volume changes in peripheral blood vessels. These signals are extracted from video recordings of a person’s face. We extracted rPPG signals using the pyVHR package [68], a rPPG extraction model. To estimate heart rate from rPPG signals, we analyzed the power spectral density of rPPG signals every six-second intervals, advancing 1 second at a time. The final HRV features were derived by taking the 5th, 25th, 50th, 75th, and 95th quantile of the estimated beats per minute, representing statistical properties of HRV during the interview, using the pyVHR package.
We first downsampled the audio to 16k Hz for the acoustic feature extraction. Then, we extracted generic acoustic features from vocal tones using the WavLM[39] model, a foundation model for human speech analysis, every 20ms. We also extracted hand-crafted statistical acoustic features every 100ms using PyAudioAnalysis package [69], such as spectral energy or entropy, which was effective for depression and anxiety analysis [65].
We encoded the entire interview transcript for participants using LLaMA-65B[40] to capture the high-level context of text representations in an 8196-dimensional vector. We also extracted 7 emotions (neutral, happiness, sadness, surprise, fear, disgust, and anger) and positive and negative sentiments in utterance level using RoBERTa models[59], [70], [71]. Both models are large language foundation models (LLMs).
Once all modality features are extracted, they are aggregated over time to represent the entire interview sequence for the participant outcomes. Specifically, we extracted statistical features, such as average or standard deviation, over the entire video sequence. We also trained a two-state Hidden Markov Model (HMM) with Gaussian observation models to capture the temporal dynamics of each biomarker time series using the SSM package [72]. The dimensionality of the observations was determined by the length of the feature set corresponding to the patient with the longest sequence. 1 shows the details of statistical features used in our study.
We applied a late-fusion approach for classification, which was more effective than the early-fusion approaches in previous studies [65]. For each modality, we first used a logistic regression classifier with L2 regularization or a gradient-boosting classifier to classify the dichotomized ratings (high and low) for cognitive impairment and other outcomes. Then, we aggregated the classification scores from all modalities using majority voting or average scores approaches. The majority voting outputs the final prediction by voting classification results from all modalities. The average score first averages the probability of the predicted class for all modalities and makes the final decision with the class with the higher probability.
sectionPreprocessing
For evaluating our multimodal analysis system, we used participant-independent 5-fold cross-validation with 20 repetitions. For each fold, we used 64%, 16%, and 20%, as training, validation, and testing splits, respectively. For the evaluation metric, we used an area under the receiver operating characteristic curve (AUROC or AUC) and accuracy, following previous work [65]. We also evaluated the standard deviation of the performance from all cross-validated models to measure the statistical significance of the model performances.
We evaluated unimodal and multimodal fusion models in various combinations to understand the relevance of each modality and the interplay between the modalities for assessing cognitive function and psychological well-being in older adults. For multimodal fusion, we explored 1) acoustic and language fusion, 2) face and cardiovascular fusion, and 3) all modalities combined. In addition to majority voting and average score, we also studied the selected score voting approach, which was effective in previous work [65] for the late-fusion approach. Differently from majority voting, which considers all modalities, the selected voting only includes the classification results for the modality that achieved AUC > 0.5 for the validation set. This was to exclude noise from classifiers performing poorly due to irrelevant features for multimodal fusion.
We also evaluated the classification performance using demographic variables, such as years of education, gender, age, and race, to understand their predictability in cognitive impairment and other outcomes. Years of education and age are used as continuous real-valued variables, and gender and race are represented as categorical one-hot encoders for our participants. We separately evaluated the age-only classifier in addition to the demographic classifier, according to previous work using age for prescreening MCI[73]. We also included demographic variables for all multimodal fusion analyses, assuming this information is usually available from participants’ input in real-world deployment scenarios.
Our experiment results for quantifying cognitive functions are shown in [table:uni95cog] for each modality and [table:multi95cog] for multimodal fusion analysis. The results for quantifying social network, neuroticism, and psychological well-being are shown in [table:uni95psych] for each modality and [table:multi95psych] for multimodal fusion.
width=0.8
width=0.8
Our results show that differentiating those with high vs. low MoCA scores (4th column in [table:uni95cog] & [table:multi95cog]) is best when using linguistic modality, LLaMA-65B alone (0.64 AUC and 0.63 accuracy). This was followed by acoustic features (0.63 AUC and 0.58 accuracy) and demographic variables (0.62 AUC and 0.59 accuracy). Facial features and HRV features derived from rPPG were not useful indicators of quantifying MoCA. All multimodal fusion approaches in [table:multi95cog] underperformed the best unimodal approach using the linguistic feature, LLaMA-65B.
Quantifying MCI diagnosis (5th column in [table:uni95cog] & [table:multi95cog]) was also most effective when using language-based sentiment and emotion features (0.66 AUC and 0.69 accuracy), followed by acoustic and language fusion model (0.66 AUC and 0.63 accuracy).
For CDR (6th column in [table:uni95cog] & [table:multi95cog]), the acoustic and language fusion model performed the best (0.78 AUC and 0.74 accuracy).
width=
width=
Our results showed that language-based emotion and sentiment features were most effective for quantifying social network score, LSNS (4th column in [table:uni95psych] & [table:multi95psych]) with 0.75 AUC and 0.73 accuracy. Facial emotion, landmark, and action unit features showed the second-best performance with 0.6 AUC and 0.59 accuracy. When fusing modalities, only the all-modality fusion with the majority vote matched the facial model with 0.6 AUC and 0.56 accuracy.
For quantifying neuroticism (5th column in [table:uni95psych] & [table:multi95psych]), multimodal fusion from all modalities performed effectively with 0.71 AUC and 0.65 accuracy. This was followed by features from facial emotion, landmark, and action units (0.69 AUC and 0.66 accuracy), which contributed the most when fusing multiple modalities.
Facial and cardiovascular fusion performed effectively when quantifying negative affect (6th column in [table:uni95psych] & [table:multi95psych]) with 0.79 AUC and 0.75 accuracy. Cardiovascular features alone showed 0.76 AUC and 0.67 accuracy, contributing most when fused with other facial features.
Facial emotion, landmark, and action unit features were most useful when quantifying social satisfaction (7th column in [table:uni95psych] & [table:multi95psych]) during remote interviews (0.68 AUC and 0.63 accuracy).
When quantifying overall psychological well-being (8th column in [table:uni95psych] & [table:multi95psych]), facial emotion, landmark, and action unit features were most effective with 0.66 AUC and 0.61 accuracy. This was followed by cardiovascular features (0.62 AUC and 0.60 accuracy), but fusing facial and cardiovascular features showed no improvement compared to when only using cardiovascular features.
Overall, identifying low MoCA scores and clinical diagnosis of MCI was challenging with our multimodal analysis system, showing 0.64 and 0.66 AUCs, respectively. This is possibly due to the narrow range of MoCA in this group: The majority (68%) of our subjects had MoCA scores between 21 and 28, with a median score of 24, having limited variability of scores to detect distinguishable features associated with MoCA \(\leq24\) vs. \(>24\). Conversely, quantifying CDR (0 vs. 0.5) proved more effective, achieving an AUC of 0.78. The CDR assesses functional outcomes in daily life (e.g., forgetfulness of events, functionality in shopping, and participation in volunteer or social groups), reflecting cognitive performance but not necessarily linked to cognitive testing scores. These functional aspects are likely better captured by acoustic and linguistic features from remote interviews compared to measures like MoCA and clinical MCI diagnoses." It is also worth noting that, while age has been a significant factor for predicting MoCA scores[74], [75] in the general population, it does not show any predictive capacity in our analysis. This discrepancy may be due to the participants’ narrow age range (Overall, 80.69 \(\pm\) 4.6) shown in 1.
Our analysis indicates that the language- and audio-based approaches were effective for quantifying cognitive impairment, compared to facial and HRV features. This is consistent with previous work reporting the effectiveness of speech and text analysis for quantifying cognitive impairment [76]–[81]. The proposed pipeline can potentially serve marginalized communities without internet connectivity for video transmission, such as rural areas or low- and middle-income countries[82]–[84], for prescreening older adults with high risk of cognitive impairment, using only acoustic and linguistic features. Regarding the facial analysis, previous work reported that facial expressions are significantly different and heterogeneous among individuals with MCI [85], [86] We consider this heterogeneity made the model underperform through facial features.
Overall, quantifying social isolation (LSNS) was most effective using language features. This is consistent with prior research indicating the utility of text sentiment analysis in understanding the social health of individuals [81], [87], [88]. Other scales for psychological well-being required facial videos to quantify them, and cardiovascular measures were most effective at quantifying negative affect, similar to those reported in mental health studies for various populations using wearable-based cardiovascular health monitoring [89]–[91]. Negative affects, such as mood disturbances, anxiety, and depression, can be an early sign of dementia, and these emotional changes often precede noticeable cognitive decline and may be linked to neurodegenerative processes in the brain [92]. Our study demonstrates that contactless cardiovascular measures have the potential to quantify behavioral symptoms often accompanied by MCI, which calls for further exploration.
Our findings show that facial, acoustic, and linguistic features from foundation models (DINOv2, WavLM, and LLaMA-65B) significantly underperformed, with an average absolute decrease of 22% AUC compared to the best models using other features from facial expression, acoustic, and linguistic-based emotion signals. Those foundation models are trained to capture generic facial appearance, acoustic waveforms, and language embeddings by training the models with a large-scale database available from the internet that contains individuals with various demographics and contexts. We suspect such features, when directly used, are not designed to capture specific behavior patterns in cognitive impairments or psychological well-being in older adults, especially in the MCI population. In future work, we will study the effect of transfer learning for those foundation models for capturing behavior patterns in older subjects with MCI [93].
The proposed work studies the association of facial, acoustic, linguistic, and cardiovascular patterns with cognitive impairment and the associated social and psychological well-being in older adults with normal cognition or MCI. Our study demonstrated that features extracted from remotely conducted conversations can detect a broad range of symptoms linked to cognitive decline, including social engagement and emotional well-being. This remote assessment approach holds promise for the early identification of individuals at risk for cognitive decline, ultimately creating opportunities for timely interventions to slow or prevent further deterioration. Our study has several limitations.
First, this study includes subjects with normal cognition and MCI, which helps to quantify and contrast the behavior characteristics related to the early signs of symptoms of MCI. However, the proposed method is evaluated on a small number of participants (N=39), mostly whites. Our findings may not fully apply to larger populations with diverse gender, race, and ethnic backgrounds. Also, comprehensively quantifying cognitive impairment requires including various subtypes of MCI, such as amnestic or non-amnestic and single-domain or multi-domain MCI [94]–[96]. We excluded individuals diagnosed with ADRD. Their behavioral patterns could differ significantly from those examined in this study. [97], [98]. Individuals also experience various co-morbidity that influence behaviors or progression in cognitive impairments [99]–[101]. This requires our future studies to expand to diverse ethnic, racial, and gender groups with a more significant number of participants with varying conditions associated with aging, MCI, and ADRD to test the generalizability of our findings.
Second, the model was validated on the conversations captured during the 1st week out of 48 weeks of intervention, i.e., cross-sectional study. With cognitive decline, older adults can manifest behavior changes over time [102], [103]. Our next exploration will take advantage of all the data and aim to identify longitudinal changes in outcomes using the features examined here. For this, we will explore model adaptation methods to tackle potential model degradation due to behavior changes over time [104]. Moreover, model personalization needs to be studied, as each patient’s rate of cognitive decline and behavior changes could depend on their personality and background [105], [106].
Third, few video recordings had significantly low quality with low resolutions or pixelation due to weak internet connectivity. The effect of those internet connectivity issues on model bias and performance degradation needs to be quantified, as internet connectivity can be less than ideal when the system is deployed in low-resource communities.
Fourth, we utilized the baseline interview data recorded during the first week of the trial to capture behaviors closest to the time of assessments on our subjects’ cognitive function and psychological well-being. However, it is important to note that participants were still becoming familiar with the study procedures during the initial sessions. As a result, participants may have exhibited heightened levels of nervousness or unfamiliarity, potentially influencing their behavior and responses in ways that do not fully represent their typical cognitive or emotional state. This consideration was not factored into our analysis. Future studies will explore data from later sessions to understand and mitigate this potential bias.
We will also explore other methods for modeling temporal dynamics of facial, cardiovascular, audio, and language features. In this work, we mainly used HMM, which showed varying performance across modalities. Contrary to our original hypothesis that modeling temporal dynamics would increase the performance, we observed significant changes in absolute AUC from -29% ([table:uni95cog], Acoustic) to +30% ([table:uni95psych], Acoustic) in performances after adding HMM-based features. We chose a two-state HMM with Gaussian observation to ensure the convergence of the model when trained on our datasets. Yet, this model can be too simple and suboptimal for modeling complex temporal dynamics of facial, cardiovascular, audio, and language variation over the entire interview period relating to various rating scales used in this study. To investigate this, we will explore state-of-the-art sequential models such as recurrent neural networks to extract temporal features in future work [107].
Hyeokhyen Kwon, Salman Seyedi, Bolaji Omofojoye, and Gari Clifford are partially funded by the National Institute on Deafness and Other Communication Disorders (grant # 1R21DC021029-01A1). Hyeokhyen Kwon and Gari Clifford are also partially supported by the James M. Cox Foundation and Cox Enterprises, Inc., in support of Emory’s Brain Health Center and Georgia Institute of Technology. Gari Clifford is partially supported by the National Center for Advancing Translational Sciences of the National Institutes of Health (NIH) under Award Number UL1TR002378. Gari Clifford and Allen Levey are partially funded by NIH grant #R56AG083845 from the National Institute on Aging. Hirko Dodge is funded by NIH grants and serves as the CEO of the I-CONNECT Foundation, a 501(c)(3) non-profit organization. The I-CONECT study received funding from the NIH: R01AG051628 and R01AG056102. The authors extend their gratitude to the participants of the I-CONECT study.