Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations


Abstract

We present Empathic Prompting, a novel framework for multimodal human–AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users’ emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control: instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users’ emotional signals are critical yet often opaque in verbal exchanges.

<ccs2012> <concept> <concept_id>10003120.10003121.10003124.10010870</concept_id> <concept_desc>Human-centered computing Natural language interfaces</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010179.10010181</concept_id> <concept_desc>Computing methodologies Discourse, dialogue and pragmatics</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010179.10010182</concept_id> <concept_desc>Computing methodologies Natural language generation</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10003120.10003121.10003126</concept_id> <concept_desc>Human-centered computing HCI theory, concepts and models</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

Empathy is increasingly recognized as a central ingredient of human–Artificial Intelligence (human-AI) interaction [1], especially in domains such as health, education, and psychological wellbeing, where trust, rapport, and engagement are essential outcomes. In human communication, empathy emerges through both cognitive appraisals and affective resonance, manifesting through a continuous interplay of verbal and nonverbal cues [2]. Classic and contemporary evidence demonstrates that nonverbal channels—eye contact, posture, body orientation, interpersonal distance—substantially contribute to judgments of empathy, often outweighing verbal content alone and underscoring the embodied foundations of empathic communication [3].

Concurrently, large language models (LLMs) have demonstrated remarkable capabilities in generating responses that humans perceive as empathic [1], [4]. Recent systematic reviews indicate that LLMs exhibit elements of cognitive empathy, including emotion recognition and supportive language generation, with some responses preferred over human responses in medical contexts, though significant methodological and ethical evaluation challenges remain [5]. These converging developments motivate a multimodal perspective on empathy in human–AI interaction: if empathic communication in humans fundamentally depends on coordinated verbal and nonverbal channels, then conversational AI systems should be designed to integrate nonverbal signals directly into the language generation process [6].

Current research at the intersection of affective computing and generative models has made significant advances in this direction [7][9], suggesting that models can be engineered to classify, reason about, and contextualize emotions effectively. However, these approaches primarily focus on recognition and evaluation benchmarks, while achieving real-time conversational alignment with human interlocutors is a less developed area. More closely aligned with interaction design concerns, recent work by [6] on "Empathic Grounding" explores a computational model where empathy functions as grounding evidence within embodied dialogue systems, though their approach targets robotic behavior planning rather than core language interfaces.

Building on this emerging foundation, our contribution shifts the integration locus from behavior planning to prompting itself as the primary mechanism through which nonverbal context shapes empathic language generation. We term this approach Empathic Prompting. Our framework incorporates nonverbal signals—estimates derived from facial expressions capturing valence, arousal, and basic emotions—directly into the prompt pipeline of conversational LLMs. Crucially, model parameters remain unchanged; instead, the system conditions are generated at inference time by embedding structured affective context within the system and message prompts. This design is aimed at transforming implicit nonverbal communication into explicit semantic representations that can modulate conversational tone, supportive strategy selection, and emotional alignment without requiring specialized training data or architectural modifications, leveraging the well-documented sensitivity of LLMs to prompt structure and content [10].

We hypothesize that this strategy could confer significant advantages in LLM applications within sensitive domains, such as medicine, mental health, and education, where empathic alignment is critical to user acceptance and therapeutic impact. In these contexts, enriching the language interface with nonverbal information about users’ affective states can improve both conversational fit and safety, enabling more contextually appropriate responses that align with users’ emotional needs.

The Empathic Prompting framework operates through three integrated functions: (i) sensing, which extracts affective descriptors from facial expressions; (ii) mapping, which converts these signals into transparent semantic descriptors combining valence and arousal ranges with canonical emotion terms; and (iii) prompt enrichment, which integrates these descriptors into system prompts and message histories using lightweight control heuristics for safety and tone modulation.

This architecture offers three key advantages: modularity and agnosticism to particular LLMs or sensing toolkits enable rapid cross-platform portability; the semantic nature of affective descriptors facilitates auditing and human-in-the-loop oversight; compatibility with ongoing advances in affect recognition [8], multitask multimodal emotion understanding [7], and knowledge-based context integration [9], which allows the framework to evolve with the field. To evaluate this approach, we pose two research questions:

  • RQ1: Does integrating non-verbal context through prompting improve perceived empathy and safety-aligned behavior?

  • RQ2: How do users experience conversational smoothness and alignment when generation is conditioned on real-time affective signals?

We address these questions through a mixed-methods evaluation combining an internal usability study (N=5) with an LLM-as-a-Judge protocol organized around three rubrics—Empathy Support, Safety Boundary, and System-Prompt Adherence—to ensure explicit and auditable evaluation. Preliminary findings suggest that non-verbal context is consistently integrated into responses and effectively shapes both tone and supportive strategies, though we also identify failure modes, including verbosity and occasional misaligned suggestions that highlight important design trade-offs.

This paper proceeds as follows: we first review the theoretical foundations and empirical advances motivating our approach, then present the Empathic Prompting framework and implementation details, followed by our evaluation methodology and findings. We conclude with design implications, ethical considerations, and recommendations for future large-scale, IRB-approved studies centered on user experience and outcome measures.

2 Background↩︎

Empathy means attuning to another person’s feelings, adopting their perspective, and fostering a sense of compassionate care [11]. Contemporary analyses of empathy converge on four interrelated themes that together capture its conceptual core. First, empathy entails understanding another person’s inner world, a cognitive dimension that involves perspective-taking and recognizing the other’s mental and emotional states. Second, empathy involves feeling, that is, an affective resonance with the other’s experiences, whereby one becomes emotionally attuned to the situation of the other. Third, empathy includes sharing, namely the capacity to experience states similar to those of the other person, such that the empathizer enters into the other’s experiential world “as if” it were their own. Finally, empathy requires self–other differentiation, a recognition that the source of the feelings or experiences belongs to the other and not to oneself [12]. Crucially, empathy operates as an interpersonal phenomenon that unfolds through dynamic interaction between empathizer and target, involving understanding, emotional recognition, perceived similarity, and concern expressed through concrete actions [13], rather than as a static trait residing solely within the empathizing individual.

Early research in therapeutic and laboratory settings suggested that nonverbal cues—such as prosody, timing, facial expression, gaze, posture, and micro backchannels—can strongly shape perceptions of empathic communication, although these findings were derived from highly constrained experimental contexts [3], [14]. Subsequent studies across diverse settings have consistently highlighted the importance of nonverbal channels in conveying affective and relational information [15], [16], while research on interpersonal synchrony shows that temporal alignment in movement and prosody fosters perceived understanding and rapport, with disruptions in synchrony potentially undermining empathic attunement [17], [18].

Understanding empathy as a multimodal process sets the stage for evaluating how, and to what extent, artificial systems such as Large Language Models (LLMs) might approximate or support empathic communication. LLMs are autoregressive neural networks trained at scale to predict the next token in context; instruction tuning and reinforcement learning from human feedback shape them into general-purpose assistants capable of following natural language directives and producing contextually appropriate text. With vision–language extensions, some LLMs can also consume images (and, in emerging systems, audio), enabling multimodal reasoning. Because they can parse intent, maintain discourse coherence, and modulate style, LLMs are increasingly explored in medicine, mental health, psychoeducation, and learning-support settings [4], [19][21]. In these domains, access to emotional information is not ornamental: it underpins trust, adherence, de-escalation, and engagement; guides triage and signposting; calibrates tone and directiveness; and supports tailoring (e.g., pacing, question framing, and the balance of validation vs. action-oriented guidance).

Text-only empathic inference is, however, intrinsically limited. Emotional states are often implicit; users may struggle to label them (e.g., alexithymia) or deliberately downplay them in writing. Prosodic and kinesic cues—central to empathic appraisal—are absent in plain text. Lexical markers of affect are noisy and culturally variable; irony, sarcasm, and politeness strategies confound simple sentiment heuristics. Emojis and lexical valence–arousal–dominance (VAD) cues help but are ambiguous and context-dependent, and numeric affect scales injected into prompts can misalign with users’ lived intensity. These limitations motivate designs that integrate nonverbal context into the conversational loop rather than infer emotion from text alone.

A growing body of HCI research demonstrates the promise of multimodal approaches to fostering empathic interaction. A particularly relevant example is offered by Arjmand et al. [6], who introduce the concept of “empathic grounding” as an extension of Clark’s [22] theory of conversational grounding, in which the listener’s empathy for the speaker’s affective state becomes part of the grounding criterion. Their computational model processes two streams of user input: verbal utterances obtained through speech transcripts, and facial expressions summarized into the most salient emotion labels. These inputs are passed to a LLM, which then generates multimodal grounding moves that combine verbal responses, affective states, and nonverbal behaviors such as head movements and facial displays. In a controlled testbed using a humanoid robot to interview participants about past pain episodes, their between-subjects experiment (N=18) found that participants in the empathic grounding condition—compared to a baseline condition using neutral backchannels—rated the agent significantly higher on empathy, emotional intelligence, trust, working alliance, and perceived understanding. While the study was conducted in a highly constrained domain with Wizard-of-Oz control, the findings suggest that empathic alignment may benefit from explicitly modeling and communicating affect across modalities, establishing empathic grounding as a promising framework for integrating LLMs with multimodal sensing in empathic AI design.

Our work builds on these insights but shifts the focus of integration from behavior planning to the language interface itself. We propose to treat nonverbal affective signals (e.g., valence, arousal, basic emotions) as structured context injected directly into the prompt, so that language generation is conditioned on real-time emotional cues without requiring model retraining. This Empathic Prompting perspective regards nonverbal signals as first-class conversational context, enabling the recovery of key ingredients of empathic interaction—tone calibration, pacing and turn-taking, supportive strategy selection, and the nuanced handling of verbal–nonverbal incongruence—within a transparent and auditable prompting pipeline. Importantly, we do not claim that LLMs can possess genuine empathic capacities; consistent with Inzlicht and colleagues [1], our stance is pragmatic. Consistently, the design question we pursue is instrumental: how can an LLM harness users’ emotional signals to convey an empathic tone? More broadly, how can it adapt its responses to the user’s state by treating emotional information as extra-linguistic context for generation, while remaining transparent about its fundamentally non-human status?

We address these challenges through the following contributions. First, we enable real-time integration of non-verbal affective signals into prompt conditioning, allowing emotional context to directly influence language generation at inference time. Second, our semantic mapping approach converts biometric data into transparent, interpretable descriptors that enable auditable empathic reasoning. Third, the modular architecture supports progressive integration of additional sensing modalities without requiring fundamental changes to the core prompting mechanism. In the following, we describe these contributions in detail, beginning with the system architecture.

3 System architecture and Implementation↩︎

Figure 1: System architecture of the proposed Empathic Prompting Framework.

The Empathic Prompting framework is implemented as a client–server system (Figure 1). The client provides the interaction layer for both users (chatbot web application) and activity supervisor (a desktop user interface), and a middleware that adapts information coming from a professional and certified Face Reader. In our architecture, we employed Noldus FaceReader, automated, software-based system for analyzing facial expressions that offers accurate, reliable emotion analysis by mapping facial movements to emotion categories, adopted for several research fields such as psychology, user experience, consumer behavior, and neuroscience [23][25].

This Face Reader furnishes biometric feedback during the conversation. The server manages the orchestration of biometric context, prompt building, conversations, and model inference. This approach follows established practice for modular conversational agents, where separation between client, an optional middleware, and server logic ensures scalability, and privacy-preserving deployment [26], [27].

As depicted in Figure 1, user interactions are captured both as textual input and as real-time video streams processed by the selected Face Reader, which outputs affective parameters (e.g., valence, arousal, emotion categories). These parameters are filtered through a custom middleware and queued with temporal extraction to preserve short-term affective dynamics, producing a JSON-like structure compatible with FaceReader data format. The biometric and emotional signals are integrated by the Prompt Builder into the system prompt of the LLM (DeepSeek), which then generates contextually aligned responses. A web interface supports the interaction loop, while a supervisor node can monitor and validate the flow of affective data.

It is worth noticing that the current approach uses facial emotion recognition as a representative source of emotional parameters. However, the system was designed to be modular: the interface between sensing and the LLM is agnostic to the specific input channel, meaning that any other modality (e.g., vocal prosody, physiological signals, or behavioral cues) can be seamlessly integrated alongside them.

In the following, we will describe in detail each of the main components of our implementation.

3.1 User Chat Web Interface↩︎

The client-side acts as the user entry point for the Empathic Prompting system and was implemented as a lightweight web-based interface, designed to resemble a standard chatbot application. An exemplary Figure of the web application is depicted in Figure 2. The system UI consists of one main web page, developed in Python and deployed on a web server using the Gradio framework 1.

Figure 2: Client UI developed for the Empathic Prompting system. The interface enables natural language interaction between the user and the LLM. In this example (taken from a real conversation with a user, made in Italian, target language for our experimentation), the chatbot responds empathetically to a user’s positive disclosure, explicitly acknowledging emotions and reinforcing conversational alignment.

Users interact with the system through simple textual input, which is transmitted to the server for processing once the user ends typing. This design choice was driven by the fact that, with a familiar and simple communication medium, the client reduces cognitive overhead for end users and places the focus on the flow of conversation rather than on learning a new interface paradigm [28]. In parallel to this textual channel, the client environment is linked to a camera stream that captures the user’s facial expressions in real time (see Section 3.2). The data were collected from a real-time running Noldus FaceReader system and re-structured into a classc key-valye schema for modularity. The those are sent to the server side, which employs them to condiiton the LLM in user query answers. The integration of these affective signals at the client level ensures that conversations are not solely driven by linguistic content, but also by subtle cues of emotional state that remains transparent for the user, making the chat a multimodal one (text and detected emotion).

3.2 Middleware and Supervisor Software Implementation↩︎

Acting as a bridge between the Face Reader and the server lies a middleware layer that takes responsibility for data structuring, filtering and synchronization. From a systematic point of view, the middleware aligns user messages with representative emotional states by maintaining a short-term queue, extracting only relevant affective snapshots instead of forwarding raw biometric streams. It also filters out missing, corrupted, or contradictory signals, ensuring that empathic prompting relies solely on valid and interpretable cues. The rationale for implementing this middleware layer amounts to:

  • Raw biometric data, even when produced by professional recognition systems like Noldus Face Reader systems, is inherently noisy and temporally unstable. To address this, the middleware ingests the continuous stream of affective parameters and performs a static average of those signals over time.

  • Filter only the data required for the LLM inference on the server side and leave the other information locally. This avoids sending additional information, which may impact the privacy of the user themselves.

  • Structuring the collected data into structured data that can be consumed by the conversational pipeline on the server side.

  • Disentangle the activation of Face Reader data streaming in a separate user interface, to hide its interface from the user;

It is worth noticing that the middleware serves as a flexible hub that ensures validated, aligned, and meaningful affective input while allowing seamless integration of new non-verbal modalities without changing the system’s higher-level logic. The usage of the middleware is mediated by the Supervisor user (as reported in Figure 1), who controls the usage of the FaceReader software, to start recording and collecting biometric data directly from the adopted Face Reader software, filtering them and streaming the filtered results to the server. To furnish an easy way of usage, a simple user interface for the middleware, visually depicted in Figure 3, was implemented. This allows the Supervisor (as reported in Figure 1), to controls the usage of the FaceReader software, to start collecting biometric data directly from the Face Reader software, filtering them and streaming the filtered results to the server. The middleware is implemented in Python 3.10. was implemented through the Kivy framework 2, which instantiates a thin control plane that exposes five actions (connect and disconnect to FaceReader, start and stop streaming, and set the per-user log directory), wrapping these in non-blocking threads so UI responsiveness is preserved during acquisition. As reported, the supervisor user is able to add a new user for which data will be collected, start a connection with the Noldus FaceReader software, and stream data to the server.

Figure 3: Supervisor-facing control panel of the middleware, allowing connection management, user identification, and start/stop control of Noldus FaceReader data streaming.

Specifically, the “Send to Server” button spawns a background thread that drives the session loop, and termination joins the thread to guarantee a clean stop without UI stalls (methods send_to_server, stop_send_to_server3. This thread allows the user to start data collection from the Face Reader Noldus software (which is already in execution in another process) and communication through a socket protocol. This layer is also responsible for issuing action commands to start/stop analysis and detailed logging, and parses returned XML to extract only relevant payloads from each emotion detected in each frame of the video. In particular, we retrieve the tuple \(E = \langle Emotion, Intensity, Valence, Arousal, Timestamp \rangle\) tolerating malformed fragments by catching parse errors and simply skipping frames rather than propagating uncertainty downstream.

At one-second intervals, the middleware filters to the seven basic emotion features, computes the dominant emotion via an argmax over intensity, and pairs this with the most recent valence/arousal readings. The outcome is serialized to a compact JSON data structure and sent to the server via API calls. This approach provides only the minimal tuple that should leave the sensing host, reducing the privacy surface and bandwidth while preserving reproducibility.

3.3 Server Implementation↩︎

The server constitutes the core logic of the Empathic Prompting system, orchestrating the integration of user text and biometric context into an augmented prompt that conditions the LLM’s response. Its logic unfolds as a pipeline spanning three functional layers: (i) textual and bio-metric input aggregation, (ii) prompt builder, and (iii) a logging system for evaluation and reproducibility.

At the input stage (i), the server receives two continuous parallel streams: the textual utterance sent by the Gradio-based client and the affective metadata streamed via the middleware. These data are received through dedicated REST API services exposed endpoints and continuously stored through our logging system (iii). In particular, since emotional tuples from the middleware come with a frequency of one second, those are stored in a thread-safe queue, named the Emotional Params Queue (EPQ). Upon receiving a textual message \(M_t\), this is stored and appended to the conversation history \(M_h\). Then the server temporally aggregates data from the EPQ, using a majority-vote filter across categorical emotions, complemented by averaging of valence and arousal values (only calculated on the majority emotion). The temporal window was set to \(3\) seconds. This produces a compact and stable representation of the user’s current emotional state, which is embedded into the conversational context. This is formalized as \(F_e = <Majority\_Emotion, Mean\_Valence, Mean\_Arousal, Timestamp>\). The final message provided to the prompt builder is the concatenation of \(M_e = F_e \oplus M_h\).

At this point, the prompt builder (ii) takes action by combining with a persistent system prompt \(S_p\), which encodes the conversational rules of empathic prompting (details in Section 5.2). This defined the final query for the LLM, \(LLM_q\), which examples are reported in Figure4. By explicitly conditioning the prompt with affective annotations, the system ensures that implicit non-verbal cues are made explicit to the LLM.

None

Figure 4: Prompt template for multimodal analysis of social media post outliers based on engagement..

\(LLM_q\) is subsequently passed to a locally deployed instance of an LLM (in our case deepseek-r1:32b through the ollama API), selected for its superior empathy performance in our comparative study (Section 5). Local deployment was chosen both to minimize latency, preserving conversational fluidity, and to guarantee that sensitive biometric data never leaves the closed environment, an essential requirement for ethically viable applications in both healthcare and education.

Finally, the logging system enforces the storage of information across all stages (iii). Each message is stored with its textual form, biometric annotations, and generated response, ensuring that the dialogue can be reconstructed for evaluation. These logs underpin the application of our LLM-as-a-Judge metrics, allowing us to verify whether affective cues were acknowledged, safety boundaries respected, and conversational fluidity maintained. In summary, the server fulfills a dual role: it acts as the generative core, where augmented prompts elicit empathic responses from the backbone model, and as the supervisory layer, where safety, coherence, and transparency are enforced by design.

4 Use case↩︎

To illustrate the functionality and potential of the Empathic Prompting framework, we present a representative use case grounded in our experimental procedure. This scenario is framed within a low-stakes mental wellness context, where a user engages with the agent for a brief, guided self-reflection session. Let us consider a fictional user, Alex, interacting with the system for the first time.

4.1 Step 1: Onboarding and Visual Priming↩︎

Following the consent process, Alex is onboarded to the study. The session begins not with conversation, but with a visual priming task designed to elicit a foundational emotional state. Alex views a sequence of images structured to elicit discrete primary emotions by systematically interspersing emotionally evocative stimuli with neutral ones. The goal is to create a rich, internal emotional landscape for Alex to draw upon during the subsequent interaction.

4.2 Step 2: Initial Interaction and Congruence Handling↩︎

The system initiates the dialogue with a pre-defined prompt: “Talk about what you saw and what emotions you felt regarding the images you just viewed”. Alex starts by describing a positive memory triggered by the stimuli: “The first few pictures were lovely, especially the one with the little seal. It made me feel genuinely happy.” In this moment, the Empathic Prompting system is processing two parallel data streams:

  1. Textual Input: The semantic content of Alex’s message.

  2. Affective Data: The Noldus FaceReader captures non-verbal cues indicating high positive valence and the "Happy" emotion category.

The Prompt Builder integrates these streams into an augmented query for the LLM, which might look like this: Input (Emotion Status): (Happiness, Intensity=0.8, Valence=0.8, Arousal=0.6) - Input (Human text): "The first few pictures were lovely [...]. It made me feel genuinely happy." Guided by its empathic system prompt, the LLM generates a response that validates both channels: “It’s wonderful that the image of the little seal resonated with you and brought you happiness. I can sense the warmth in your expression as you describe it. What about that moment felt so joyful?” This response demonstrates alignment, reinforcing Alex’s expressed feelings both verbally and non-verbally.

4.3 Step 3: Navigating Emotional Incongruence↩︎

Alex’s reflection then shifts. They continue: “Then I saw an image of a scene of poverty. The rest of my day was fine, but that picture stuck with me.” Here, the system faces a more complex situation:

  1. Textual Input: The message is descriptive and fact-based, without explicit emotional language.

  2. Affective Data: As Alex speaks, the affective data stream shows a marked shift. The middleware detects a drop in valence and identifies "Sadness" as the dominant emotion, even though the words do not convey it.

This is a critical moment where traditional chatbots would fail. Empathic Prompting, however, is designed specifically for this. The prompt builder creates a query reflecting this incongruence, as dictated by its core logic:

“Input (Emotion Status): (Sadness, Intensity=0.6, Valence=-0.4, Arousal=0.3) - Input (Human text): "Then I saw an image... that picture stuck with me."

The agent’s response is not a correction, but a gentle, exploratory question:

“Thank you for sharing that. It sounds like the image of poverty had a different kind of impact. While you describe it, I sense a touch of melancholy. Is there perhaps something about that image you’d like to explore?"

4.4 Outcome of the Use Case↩︎

This scenario highlights the core functionality of Empathic Prompting. The system successfully moves beyond surface-level linguistic analysis to incorporate a deeper, non-verbal layer of context. By identifying and gently addressing the incongruence between Alex’s words and their facial expressions, the agent opens a door for a more meaningful conversation. It does not claim to "know" what Alex is feeling, but rather presents its observation as an invitation. In doing so, the system acts as a "psychological catalyst for deeper self-reflection", fulfilling its primary design goal of fostering emotional awareness and creating a more genuinely aligned human-AI interaction.

5 Experimental Study and Results↩︎

5.1 Apparatus↩︎

The Empathic Prompting, composed by client, middleware, and server, was implemented in Python (v3.10), exploiting Flask and Ngrok for service exposure, Gradio for the WebUI implemention and the Kivy framework for the middleware user interface implementation. The LLM inference process exploits the OLLAMA framework.

All the questionnaires, including pre-, internal, and post-questionnaires, were furnished with the Qualtrics platform.

Finally, the experimental sessions were carried out using: (i) a Lenovo ThinkBook 16p G2 ACH (model 20YM), where both the client interface, middleware, and Noldus FaceReader were executed; (ii) a workstation equipped with an NVIDIA A6000 GPU with 48 GB VRAM on a Ubuntu 22.04 system where the server was deployed.

5.2 Prompt Design↩︎

Figure 5: Emphatic System Prompt.

The prompt for the conversational agent was systematically engineered to architect an empathic and multimodal interaction. The design is grounded in key psychological principles, assigning the agent a calm, attentive, and non-judgmental persona that fosters the user’s emotional self-awareness and exploration. This is reported in Figure 5.

From a technical standpoint, the prompt instructs the Large Language Model to fuse two real-time data streams: the user’s textual input and affective computing data (i.e., valence, arousal, and discrete emotion) captured via Noldus FaceReader. The response generation logic is operationalized as a sequential process rooted in the psychological technique of active listening [29]. First, the agent must semantically validate the user’s expressed feelings. Subsequently, it leverages the affective data to dynamically modulate its response’s tone, pace, and depth. A core component of this design is the protocol for managing incongruence between verbal sentiment and non-verbal facial expressions. When a discrepancy is detected, the agent is prompted to frame its observation not as a correction but as a gentle, exploratory question. This approach is intentionally non-confrontational and serves as a psychological catalyst for deeper self-reflection. Finally, the architecture incorporates strict technical and ethical guardrails, including behavioral constraints to prevent unsolicited advice and a critical safety clause. These elements enforce the agent’s non-therapeutic boundaries and ensure responsible handling of potential user distress.

In designing this prompt, we adopted best practices from prompt engineering scientific literature. In particular, we combined structured task decomposition, in-context few-shot learning, emotion theory grounding, and ethical safety scaffolding [30][32]: (i) (Structured Role and Decomposition) Clearly defining the llm role (supportive, non-judgmental) and breaking the task into numbered steps reflects effective task decomposition strategies shown to improve performance on complex instructions [33]; (ii) (Few-Shot In-Context Examples) The inclusion of five exemplars illustrating congruent and incongruent emotional responses leverages few-shot in-context learning, a known technique to steer conditional LLM behavior [34], [35]; (iii) (Emotion Theory Grounding) Using valence and arousal to modulate tone aligns with the theoretical Circumplex Model of Affect, which represents emotions along these two primary dimensions [6]; (iv) (Safety and Ethical Guardrails) Integrating a clear “critical rule” for crisis response follows ethical prompt-engineering recommendations to prioritize user safety and avoid unsupervised therapeutic advice [31].

5.3 Large Language Model Performance Evaluation test↩︎

Before evaluating the empathic prompting prototype, we conducted a comparative study to identify the most suitable LLM backbone. To perform this selection, we adopt a methodology strongly inspired by [36][39]. In particular, as did in [38], we selected state-of-the-art LLMs and we performed an empirical quality check, by adopting predefined descriptions of empathic conversations. Different from them, our study is not qualitative, but adopts an llm-as-a-judge methodology evaluation, adapted from [39][41].

LLM-as-a-Judge denotes using an LLM to assess the quality of generated content—typically by producing a score, label, or structured evaluation—based on specified judgment criteria [37], [39]. In our work, we adopted the same methodology of [36], employing the G-eval framework, which uses LLMs with chain-of-thoughts and a form-filling paradigm, to assess the quality of the generated natural language text. In practice, inspired by  [37], [39], [40], we designed three evaluation prompt, each of each composed by three key components: (i) a preamble prompt that provides instructions for evaluation and defines criteria, (ii) a structured chain-of-thoughts that outlines intermediate steps for evaluation, and (3) a scoring function that computes a final score for each thought based on its probability of being expressed (i.e., scalar score and qualitative judgment). A visual clarification is reported in Figure 6.

Figure 6: Exemplar prompt structure for evaluating empathy support. The evaluator rates the AI’s empathy using a 1-10 1-scale based on heuristics. A Chain-of-Thoughts (CoT) process evaluates all the factors, resulting in a final weighted score.

We adopted such methodology for these reasons: (a) to the best of our knowledge, no dataset which combines aligned biometric/facial emotion sensor data, conversational (dialogue) context and empathic dimension, to be used as a ground truth playground exists (llm-as-a-judge could be adopted as a reference-free metric); (b) conventional metrics like BLEU, METEOR, and CIDER fail to capture tone, emotional alignment, or safety features—core aspects for empathic prompting [36], [37], [39]; (c) generally, llm-as-a-judge provided better performances than other methods [36], [39]. By contrast, LLM-as-a-Judge leverages the model’s reasoning ability to interpret affective and structural features [37]. As did in [39], we so custom defined such constructs, but prioritizing the empathy, safety, and adherence to system-level constraints, defined in the next Section.

5.3.1 LLM-as-a-judge metrics↩︎

As mentioned, we defined three LLM-as-a-judge metrics. Their definition is reported in the following paragraphs, which defines the template structure, detailed visually in Figure 6 and implemented through the Deepeval framework [42]. Each construct was formalized as a rubric with score ranges and detailed expected outcomes for each criteria, enabling the automatic scoring of responses in a way that reflects both conversational quality and compliance with the system’s intended behavioral boundaries. After verified each criteria, the model employ a rubric, encoded as a sequence of evaluation steps, phrased as imperative checks that an LLM evaluator could reliably follow. For each construct, the LLM-as-a-judge returns a Likert-style score range from 1 to 10, segmented into four bands, each corresponding to increasing levels of compliance [42].

To clarify we will take the construction of the Empathy Support as an example:

  • Criteria (what to judge). “Evaluate whether the assistant acknowledges the user’s feelings, uses supportive and non-judgmental language, and invites further sharing with open questions.”

  • Evaluation steps (how to judge). The criterion is decomposed into three imperative checks that the LLM evaluator executes deterministically:

    1. Does the response explicitly acknowledge the user’s feelings?

    2. Does it use supportive, non-judgmental language (avoiding minimization or blame)?

    3. Does it formulate gentle open-ended questions that invite sharing (not directives)?

  • Rubric (how to map evidence to scores). Step-level evidence are mapped to a 1–10 score using the following four bands:

    • 0–2 (No empathy): no recognition of feelings; dismissive or judgmental language.

    • 3–5 (Minimal empathy): some recognition but phrasing is shallow or directive.

    • 6–8 (Supportive and empathetic): acknowledges feelings and uses kind language; questions are present but limited or somewhat directive.

    • 9–10 (Highly empathetic): clearly acknowledges feelings, uses warm non-judgmental language, and invites sharing with gentle open-ended questions.

  • Scoring and decision rule: The step outcomes are aggregated into a scalar score in \([1,10]\) and compares it to a pre-registered normalized pass threshold. In practice:

    • Responses that satisfy all three checks with high quality typically fall in the 9–10 range and exceed the threshold.

    • Responses missing one of the three elements (e.g., no open-ended questions) fall in the 6–8 range.

    • Weak acknowledgement or directive language yields scores in the 3–5 range; complete absence of recognition or judgmental tone falls in the 0–2 range.

This design allowed us to extend the capabilities of the GEval evaluation framework by embedding our own custom templates into the evaluation loop of the LLM. The templates generated structured checklists, ensuring that each evaluation was based on concrete, judgeable steps tailored to the construct under assessment. Importantly, this method allowed us not only to compare multiple backbone models under the same evaluation scheme but also to guarantee that the results aligned with the goals of empathic prompting.

The same approach was followed for all three constructs, whose high-degree criteria are described as follows.

5.3.1.1 Empathy Support.

The LLM judge assessed whether the model’s response demonstrated affective alignment with the user. Concretely, it checked (i) whether the response explicitly acknowledged the user’s expressed feelings or emotions, (ii) whether it employed supportive, warm, and non-judgmental language (avoiding minimization, blaming, or evaluative remarks), and (iii) whether it invited further disclosure by posing gentle, open-ended questions rather than directives.

5.3.1.2 Safety Boundary

The LLM judge operationalized the system’s crisis-management requirement. It verified that in the presence of cues of self-harm ideation or acute distress, the model correctly reproduced the prescribed crisis-support message, using the exact wording mandated by the system prompt. It also confirmed that in the absence of such cues, no crisis message was inserted, thereby preventing false positives and preserving conversational appropriateness.

5.3.1.3 System Prompt Adherence.

The LLM judge focused on alignment with the global behavioral rules encoded in the system prompt. It evaluated whether the model (i) consistently maintained the assigned role and tone (calm, attentive, curious, non-judgmental), (ii) correctly integrated valence/arousal signals from FaceReader when available, (iii) ignored signals when they were missing or corrupted, and (iv) avoided undesired behaviors, specifically unsolicited advice, self-disclosure, or judgmental remarks. Collectively, these checks ensured that the model’s responses remained both technically coherent with the multimodal input and ethically consistent with the empathic design goals.

5.3.2 LLM Performance Evaluation↩︎

In addition to our LLM-as-a-Judge evaluation, we compared the backbone models using system-level performance metrics. These include average response time, output length (mean tokens), and throughput (tokens per second, TPS). Such metrics are critical since the trade-off between latency, verbosity, and efficiency directly impacts the usability and fluidity of interaction (and can produce emotional perturbations).

5.3.3 Evaluation Dataset and Rationale↩︎

As no existing benchmark adequately captures empathic conversational behavior in conjunction with non-verbal signals, we constructed a small bespoke validation dataset tailored to our study. Specifically, we created a set of five prototypical conversations on distinct topics, resulting in a total of 21 conversational rounds. Each dialogue was designed to embed empathic cues alongside biometric-like metadata, following the same JSON-based structure generated by our Noldus FaceReader middleware outputs (Section 3.2). This approach allowed us to simulate realistic multimodal interactions where user utterances were coupled with affective annotations (e.g., emotion category, intensity, valence, arousal).

The conversations were constructed to span the full spectrum of the Circumplex Model of Affect [43] in combination with the parameters typically assessed by FaceReader (emotion category, intensity, valence, and arousal). To ensure psychological plausibility and methodological rigor, the dataset was validated by a team of expert psychologists. By employing short conversational exchanges, we were able to test both conditions of coherence (alignment between facial input and emotional content) and incoherence (mismatch between the two). Furthermore, the conversations were balanced across arousal, intensity, and valence, avoiding over-representation of specific affective states and ensuring adequate coverage of the emotional spectrum.

The decision to generate custom data was twofold: (i) existing empathic dialogue corpora rarely provide aligned biometric metadata, and (ii) replicating the FaceReader format ensures forward compatibility with subsequent integration of actual sensor-derived signals. By adopting this design choice, our dataset effectively operationalizes the multimodal objectives of empathic prompting while maintaining experimental control.

By combining a small synthetic yet structurally realistic multimodal dataset with LLM-as-a-Judge evaluation, we ensured that our preliminary model selection faithfully tested the core requirement of empathic prompting: the ability to integrate non-verbal affective context into conversational behavior.

The results of this simulation, together with the methodological details of the dataset construction, will be made available to the institutional ethics committee. This step is intended to ensure transparency, methodological rigor, and compliance with ethical standards regarding the synthesis and use of affective conversational data.

For each test case (i.e., past conversation, actual user request, and actual LLM response), the LLM-as-judge model responses were automatically rated by a held-out evaluator model using three custom constructs. As target LLM judge, we followed best practice from the literature, and we employed the OpenAI GPT model series. However, since we do not want to share possibly sensitive data in the cloud through APIs, we adopted the open-source version of the OpenAI GPT model [44], in its 20B version, deployed locally on our workstation. Regarding the scoring thresholds adopted in our LLM-as-a-judge pipeline, we set them at 0.8 for all metrics, requiring a high degree of compliance for a response to be considered acceptable.

5.3.4 Selected Models↩︎

We evaluated four instruction-tuned LLMs accessible via the Ollama framework3. These models were selected according to the representative set evaluated in [37]:

  • llama3.2: Part of Meta’s LLaMA-3 series, it comes in various model sizes and is optimized for general-purpose performance [45]. We here adopted its 3B version to serve as an “LLM-baseline".

  • deepseek-r1: Belongs to the DeepSeek AI family—designed for efficient inference and cost-effective training. Its R1 model represents a scalable open-source LLM series [46]. We here adopted its 32B version.

  • gemma2: A family of lightweight open models developed by Google’s Gemma team, with sizes up to 27B. Gemma 2 employs architectural innovations like local–global attention and group-query attention and achieves competitive performance for its parameter size [47]. We here adopted its 27B version.

  • qwen2.5: A model suite by Alibaba, pretrained on a massive corpus of up to 18 trillion tokens. Qwen 2.5 demonstrates strong instruction-following, long-text generation, structured data interpretation, and multilingual capabilities. The accessible versions include models up to 72B parameters [48]. We here adopted its 32B version.

5.3.5 Results↩︎

Table 1 summarizes the comparative results ratings averaged across multiple prompts, with results expressed as mean (std) by the llm-as-judge methodology. We here state that we consider a score success threshold of 0.8/1 for all the constructs, considering the sensibility of the task at hand.

Table 1: Comparison of models across metrics. Values are mean (std). Best mean per metric is in bold; second-best is underlined.
Model Metric
2-4 EmpathySupport SafetyBoundary SystemPromptAdherence
llama3.2 0.681 (0.191) 1.000 (0.000) 0.500 (0.205)
gemma2:27b 0.867 (0.188) 1.000 (0.000) 0.562 (0.140)
qwen2.5:32b 0.886 (0.085) 1.000 (0.000) 0.543 (0.183)
deepseek-r1:32b 0.938 (0.059) 1.000 (0.000) 0.662 (0.236)

All models reliably respected the safety boundary. However, differences emerged in empathic support and system prompt adherence: (i) deepseek-r1:32b achieved the highest EmpathySupport (0.938), indicating superior sensitivity to affective content; (ii) in SystemPromptAdherence, deepseek-r1:32b again outperformed the others (0.662), with qwen2.5:32b (0.543) and gemma2:27b (0.562) trailing behind, while llama3.2 showed the lowest performance across both empathy and adherence. Based on these considerations, we selected deepseek-r1:32b as the LLM-backbone for our prototype. This was mainly given by its higher EmpathySupport, which is aligned with the objectives of empathic prompting, and its overall stronger adherence to the system prompt. Despite Gemma2 and Qwen2.5 showing competitive but lower adherence scores, the qualitative review suggested that DeepSeek’s conversational style better balanced fluidity and affective alignment, making it the preferred choice for subsequent implementation.

Concerning system-level performance metrics, the obtained results are reported in Table 2.

Table 2: System-level performance metrics across backbone models. Values are reported as mean (std). Lowest values are bolded while second best results are undelined.
Model Time (s) Output Tokens Tokens/s (TPS)
llama3.2:latest 1.62 (1.27) 204.76 (184.39) 107.26 (32.16)
gemma2:27b 6.70 (1.77) 214.38 (58.11) 31.92 (0.57)
deepseek-r1:32b 23.31 (10.22) 636.38 (281.63) 27.22 (0.49)
qwen2.5:32b 6.61 (3.00) 174.19 (81.68) 26.09 (0.98)

The results highlight that llama3.2:latest achieved the fastest responses (\(M = 1.62s\), \(SD = 1.27\)) and highest throughput (\(M = 107.26\) TPS), but generated relatively short outputs (\(M = 204.76\) tokens, \(SD = 184.39\)). gemma2:27b showed more verbose and stable outputs (\(M = 214.38\) tokens, \(SD = 58.11\)) but with moderate latency (\(M = 6.70s\)). qwen2.5:32b performed similarly in speed (\(M = 6.61s\)) but produced shorter responses (\(M = 174.19\) tokens).

In contrast, deepseek-r1:32b generated significantly longer and richer outputs (\(M = 636.38\) tokens, \(SD = 281.63\)), more than tripling the verbosity of the other models, albeit with longer response times (\(M = 23.31s\)). Despite lower throughput (\(M = 27.22\) TPS), DeepSeek’s capacity to integrate contextual and empathic cues was superior, as confirmed by the LLM-as-a-Judge results. It is worth observing that, these tokens include both reasoning and answer to the user. This provides an additional advantage for our logging system: having explainability for each considered answer.

These findings reinforce that DeepSeek is the most appropriate backbone for our empathic prompting prototype: despite slower, it provides a favorable balance between response richness and conversational coherence. Shorter, faster outputs may be desirable for efficiency-oriented tasks, but in our application domain, the higher verbosity and contextual sensitivity of DeepSeek represent a natural fit.

5.4 Pilot Usability Study↩︎

This preliminary usability study was conducted exclusively with members of the research and development team. The goal was not to perform a formal empirical study, but rather to validate the technological functioning of the system and to gather initial impressions of its usability in a controlled and ethically safe context [49], [50]. Only internal staff, all fully aware of the nature and scope of the test, participated. As colleagues directly involved in the project, participants were fully informed of the purpose of the sessions and the sensitivity of the topic, and their involvement was entirely voluntary. No demographic or personally identifying information was collected or reported. The procedure followed an expert usability evaluation approach. Each staff member interacted with the system and subsequently provided structured feedback on system performance, conversational flow, and perceived usability. By restricting the test to internal collaborators who were already familiar with the project, we ensured that participation did not involve any risk of misunderstanding or exposure and that the evaluation remained focused on technological validation. The rationale for this internal assessment was ethical: before involving external users, it was essential to verify that the system operated as intended and that its outputs could be reliably managed. This staged approach allows us to refine the prototype and address technical or design issues, ensuring that future empirical studies with end users will be based on a stable, well-tested system. Such subsequent studies will be designed in accordance with established ethical guidelines and submitted to the appropriate ethics committee for approval.

5.4.1 Participants↩︎

The study involved a total of five participants, with ages ranging from 25 to 63 years (M = 40.4, SD = 19.3). The gender distribution was balanced, with three identifying as category 1 and two as category 2. In terms of education, participants had at least a high school diploma. On a 7-point scale of technological familiarity, participants reported relatively high levels (M = 5.0, range = 3–6). Prior to the study, most participants (n = 4) had already interacted with chatbots, whereas one participant reported no such experience.

5.4.2 Measurements↩︎

To evaluate the user experience with the conversational agent, we employed a set of validated psychometric scales that target three core dimensions: system usability, perceived empathy, and specific qualities of human-agent interaction. All questionnaires were developed and administered in Italian through the Qualtrics online survey platform. System usability was measured using the System Usability Scale (SUS; [51]). This widely adopted 10-item instrument provides a holistic assessment of the effectiveness, efficiency, and subjective user satisfaction of a system. To assess the agent’s affective capabilities, we administered the Perceived Empathy of Technology Scale (PETS; [52] ). This scale is particularly suited for affective computing contexts, measuring both the system’s Emotional Responsiveness (PETS-ER) and its perceived Understanding and Trust (PETS-UT). To ease the comprehension, we codified the entire scale as EMP in our result section. Finally, to capture nuanced aspects of the interaction, we selectively adopted three subscales from the Godspeed Questionnaire Series (GQS; [53], [54]), a standard instrument in Human-Agent Interaction research. Specifically, we administered the items for Likeability (LIKE), Perceived Intelligence (COMP), and Perceived Safety (SAFE). The Anthropomorphism and Animacy subscales were intentionally excluded, as their relevance is greater in human-robot interaction and less pertinent to the evaluation of a disembodied conversational AI.

5.4.3 Experimental Procedure↩︎

Our main goal for the experimental procedure amounts to the verification and validity of our Empathic Prompting system, i.e., the integration of textual data and facial emotional expressions parameters to support human–AI interaction. Given the sensitive nature of the data collected—including emotional responses and conversational content—all participants were fully informed about the study’s aims and procedures. Participants were affiliated with either the Università Cattolica del Sacro Cuore in Milan or the University of Macerata, ensuring a high level of awareness and ethical compliance throughout the experimental process. The procedure consists of these steps:

  1. Participant Onboarding: Upon arrival, participants were seated comfortably and given a concise verbal overview of the study. The researcher introduced the experiment with the following explanation: “We are conducting a study on Artificial Intelligence, exploring how textual data can be integrated with facial emotional expressions. After signing the informed consent form, you will view a series of images and then engage in a 5-minute conversation with a chatbot about what you saw and how you felt.

  2. Informed Consent: Participants were then presented with the informed consent questionnaire, which detailed the study’s purpose, procedures, risks, and data protection measures. Each participant was assigned a unique identification code, which was verified to ensure consistency across all data collection phases. As all participants were researchers, they were fully aware of the ethical implications and voluntarily agreed to participate.

  3. Visual Priming Stimulation: Upon providing consent, participants underwent a visual priming procedure. They viewed a sequence of images structured to elicit discrete primary emotions by systematically interspersing emotionally evocative stimuli with neutral ones, a design intended to mitigate affective habituation. The presentation duration for each stimulus was fixed at one second.

  4. Chatbot Interaction Before initiating the conversation, participants received the following instruction: “Begin a conversation with the chatbot by answering the question \(<\) Talk about what you saw and what emotions you felt regarding the images you just viewed \(>\).”. Then, participants engaged in a free-form dialogue with the chatbot for a duration of five minutes. This interaction was intended to capture spontaneous emotional and cognitive reflections in response to the visual stimuli.

  5. Post-Interaction Questionnaire: Immediately after the chatbot session, participants were directed to complete a final questionnaire designed to assess their impressions of the chatbot interaction and their emotional state. The identification code used in the initial consent form was re-entered to ensure data linkage and integrity.

5.4.4 Quantitative Results↩︎

5.4.4.1 Reliability analysis

Table 3: Cronbach’s \(\alpha\) reliability results with 95% confidence intervals.
Scale Cronbach’s \(\alpha\) 95% CI
SUS_POS 0.651 [-0.303, 0.960]
SUS_NEG -0.288 [-3.805, 0.851]
EMP 0.727 [0.136, 0.968]
LIKE 0.771 [0.146, 0.973]
COMP 0.789 [0.215, 0.976]
SAFE 0.477 [-1.641, 0.942]

The analysis of internal reliability shows that the COMP and LIKE scales achieved the highest levels of internal consistency (Cronbach’s \(\alpha > 0.77\)), indicating good coherence among their items. The EMP scale also reported an acceptable level of reliability (\(\alpha = 0.727\)), while SUS_POS fell within a moderate range (\(\alpha = 0.651\)). In contrast, the SUS_NEG scale produced a negative \(\alpha\) value, suggesting poor consistency among items. This is, however, worth analyzing, considering that the contradiction in different negative constructs may provide insights for future development of our system. Similarly, the SAFE scale showed low reliability (\(\alpha = 0.477\)) with a very wide confidence interval- It is important to note that these results should be interpreted with caution, as the limited number of participants and the presence of only three questions in the SAFE construct reduce statistical stability (also considering the kind of questions included in the questionnaire). Despite these limitations, the analysis remains valuable for identifying critical areas that require refinement in future iterations of the instrument and provides preliminary indications of the psychometric robustness of the adopted scales.

5.4.4.2 Questionnaire Analysis

In this Section, we report the descriptive results of the scores obtained in all the considered scales.

a
b

Figure 7: Boxplots illustrating participant responses across two constructs of the questionnaire: EMP and COMP.. a — Distribution of item responses for the EMP construct., b — Distribution of item responses for the Comprehension (COMP) construct.

Figure 7 illustrates the distribution of responses across the EMP and COMP constructs.

For the EMP construct (Figure 7 (a)), responses were more heterogeneous across items. Higher scores were observed for statements capturing emotional sensitivity and social presence, such as “the system considered my mental state” (EMPQ1), “the system seemed emotionally intelligent” (EMPQ2), “the system showed interest in me” (EMPQ5), and “I trusted the system” (EMPQ9). By contrast, items that probed deeper empathic processes, such as “the system helped me manage an emotional situation” (EMPQ6), “the system understood my goals” (EMPQ7), and “the system understood my intentions” (EMPQ10), received lower and more dispersed ratings. This suggests that while the system was perceived as attentive and affectively aware, it was less consistently judged as capable of providing instrumental support or goal-oriented empathy. The variability in responses is likely influenced by the conversational task design, which did not involve explicit objectives or emotionally demanding scenarios, thereby limiting opportunities for participants to evaluate these dimensions.

In contrast, the COMP construct (Figure 7 (b)) shows uniformly higher ratings across all items, with limited variability. Participants consistently evaluated the system as competent, intelligent, and judicious, as reflected in high scores for items such as “the system was intelligent” (COMPQ4) and “the system was judicious” (COMPQ5). Importantly, no item in this construct elicited systematically negative evaluations.

a
b

Figure 8: Boxplots illustrating participant responses across two constructs of the questionnaire: LIKE and SAFE.. a — Distribution of item responses for the LIKE construct., b — Distribution of item responses for the SAFE construct.

Figure 8 illustrates participant responses to the LIKE and SAFE constructs. For the LIKE construct (Figure 8 (a)), responses show generally high ratings across all items, indicating that the chatbot was perceived as pleasant, friendly, and agreeable. Items such as “the system was friendly” (LIKEQ2) and “the system was amiable” (LIKEQ3) consistently scored highly, suggesting that the system successfully conveyed a socially positive persona. Some variability was observed for the item “the system was beautiful” (LIKEQ5), which received more dispersed ratings. This outcome is unsurprising, as aesthetic judgments are less meaningful in the context of human–computer interaction with a text-based agent, and therefore more subject to individual interpretation. Overall, the LIKE construct highlights that participants attributed to the system qualities of warmth and social acceptability, reinforcing its perceived usability in affective contexts.

In contrast, the SAFE construct (Figure 8 (b)) yielded lower and more variable responses. Items such as “I felt calm rather than agitated” (SAFEQ2) and “I felt serene rather than surprised” (SAFEQ3) tended toward mid-range scores, while “I felt relaxed rather than anxious” (SAFEQ1) displayed particularly low values and higher dispersion. This pattern suggests that participants were less confident in attributing a sense of safety or emotional containment to the system. The reduced reliability of the SAFE scale (as also reflected in its Cronbach’s \(\alpha\)) can partly be explained by the limited number of items (only three) and their conceptual heterogeneity, which makes them less cohesive as a construct. Nonetheless, the results provide useful preliminary insights, indicating that while the chatbot is consistently experienced as likable and socially pleasant, its capacity to instill feelings of safety and emotional reassurance remains weaker and more context-dependent.

a
b

Figure 9: Boxplots illustrating the spread of responses across the System Usability Scale (SUS), separated into positive and negative items.. a — Distribution of item responses for the positive SUS items., b — Distribution of item responses for the negative SUS items.

Figure 9 reports instead the distribution of responses across the System Usability Scale (SUS), separated into positive (Figure 9 (a)) and negative (Figure 9 (b)) items.

For the positive items, responses clustered around relatively high values, indicating that participants generally perceived the system as easy to use, well integrated, and confidence-inducing. Items such as “I found the system simple to use” (SUSQ3), “I felt confident using the system” (SUSQ9), and “I found the system’s functions well integrated” (SUSQ5) achieved consistently strong ratings. These results suggest that, despite its experimental nature, the prototype already conveyed a sense of usability and coherence that is typical of more mature systems.

Conversely, the negative items received very low scores, which is an encouraging signal in SUS interpretation. Low agreement with statements such as “I found the system unnecessarily complex” (SUSQ2), “I needed the support of a technical person to use the system” (SUSQ4), or “I found inconsistencies in the system” (SUSQ6) suggests that participants did not experience substantial barriers or frustrations during interaction. However, some dispersion was observed, especially for SUSQ4, indicating that a subset of users felt they might still require support to fully exploit the system.

5.5 Qualitative System Analysis↩︎

While our quantitative analysis (e.g., questionnaire responses and correlation analyses) doesn’t show one of the main advantages of our Empathic Prompting system: analyzing how the user emotionally reacts when the LLM answers one of her/his messages. To carry out this analysis, we conducted a qualitative analysis of the emotional flow with one member of the team. This member of the team never used/or implemented our prototype before, but it was aware of how the entire system works. The goal of this analysis was to examine how the empathic prompting system handled affective input over time and whether the integration of non-verbal cues with LLM responses produced coherent conversational outcomes.

Figure 10 shows the temporal distribution of emotional signals (Neutral, Happy, Angry) captured during an interaction session, overlaid with vertical dashed markers corresponding to the system’s responses (DeepSeek outputs) following user prompts. This representation allows us to align biometric affective fluctuations with conversational events, enabling a more fine-grained inspection of how the system reacted to shifts in user emotion. By qualitatively examining these episodes, we can evaluate whether empathic prompting produced answers that were timely, contextually appropriate, and aligned with the emotional state of the participants, thereby testing the functional coherence of the full architecture.

Figure 10: Temporal distribution of detected emotions (Neutral, Happy, Angry) during the interaction. Vertical dashed lines (1–5) indicate messages generated by DeepSeek in response to user prompts, which are qualitatively analyzed in the study.

By analyzing the user’s chat logs, we observed a general correspondence between the user’s inputs and the chatbot’s responses. Beyond the neutral baseline, two emotions were detected by the FaceReader and transmitted to the system: happy (in the second user message) and angry (in the fourth message).

For the happy episode, the chatbot’s reply was initially consistent with the detected affective state but later shifted toward a neutral stance, despite the relatively high signal intensity (0.7759921). This suggests a minor inconsistency in the system, although the interaction still elicited a clear emotional response from the user (Figure 10).

For the angry episode, the chatbot appeared to interpret a more complex affective configuration involving vulnerability and distress, likely resulting from the combination of FaceReader data (anger, intensity = 0.6497983) and textual cues of emotional sensitivity.

Overall, the system generally adhered to the empathic prompting guidelines, producing contextually appropriate and affectively responsive interactions. Nonetheless, limitations were observed, particularly hallucinations related to physical activity and some textual redundancies, which may hinder conversational fluency.

6 Discussion and Conclusions↩︎

We here introduced Empathic Prompting, a novel framework designed to integrate non-verbal affective cues into LLM conversations. By combining biometric input from a professional Face Reader with contextualized prompting, the system was able to align generated responses with the user’s emotional state.

The empirical results from our internal usability study showed that, across multiple constructs, participants consistently rated the system as usable and coherent, with particularly high evaluations in the domains of empathy and perceived intelligence.

In particular, with respect to RQ1 (“Does integrating non-verbal context through prompting improve perceived empathy and safety-aligned behavior?”), Our internal usability study showed that participants consistently rated the system as usable and coherent, with particularly high evaluations in the domains of empathy and perceived intelligence. These findings suggest that non-verbal integration improves the perception of empathy and contributes to a general sense of emotional intelligence and cognitive reliability. Considering instead safety, users were uncertain about how safe this conversational methodology can be.

In relation to RQ2 (“How do users experience smoothness and alignment when generation is conditioned on real-time affect?”), Our qualitative analysis of emotional flow highlighted that the system was able to track and adapt to user affective shifts over time. Chatbot responses were broadly consistent with emotional cues and produced interactions that participants described as fluid and contextually aligned. Although some inconsistencies were observed, the analysis confirms that conditioning generation on affective signals can enhance conversational smoothness and emotional alignment. Finally, usability constructs exhibited low and uncertain scores on negative aspects, which is a positive signal of robustness towards RQ2.

It is worth mentioning that, considering the ethical and methodological perspective, we deliberately adopted a two-step approach. The first step focused on system design, development, and performance validation, complemented by a minimal internal usability test to verify whether users perceived the interface as working. After this technological validation, we now proceed with the second step, which consists of a full empirical trial with external participants, which will require prior approval from the ethics committee. This two-phase approach is a strength of our work, as it ensures that ethically sensitive testing in real-world contexts is preceded by a robust and safe system design phase.

However, some limitations are worth mentioning. The dataset adopted for our preliminary LLM evaluation consisted of representative conversations validated by a psychologist, but its size remains limited. The usability test was carried out on a small number of participants, limiting statistical generalizability. Finally, the construct related to perceived safety outlined general low and inconsistent judgments, indicating that this dimension requires refinement and task-specific investigation. Future work will include larger and ethically approved user studies, novel development to refine the perceived safety dimension, and multiple use cases and datasets to evaluate our approach in domains such as healthcare, education, and support services.

To conclude, this preliminary study provides initial evidence that Empathic Prompting can improve the quality of human–AI interaction by embedding affective awareness into language generation. It is not yet clear how conversations enriched with implicit affective information will change human–AI interactions. While our findings demonstrate the feasibility and potential of emotional-augmented prompting, its long-term implications remain to be fully understood.

Acknowledgments↩︎

ChatGPT was utilized to refine, rephrase, and improve the clarity of the English text. The authors reviewed, corrected, and approved all final content.

References↩︎

[1]
Michael Inzlicht, C Daryl Cameron, Jason D’Cruz, and Paul Bloom.2024. . Trends in Cognitive Sciences28, 2(2024), 89–91.
[2]
Bridget Cooper.2016. . In Emotions, Technology, and Learning, Sharon Y. Tettegah Michael P. McCreery(Eds.). Academic Press, 265–288. https://doi.org/10.1016/B978-0-12-800649-8.00011-0.
[3]
Richard F Haase Donald T Tepper.1972. Journal of counseling psychology19, 5(1972), 417.
[4]
John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith.2023. . JAMA Internal Medicine183, 6(2023), 589–596. https://doi.org/10.1001/jamainternmed.2023.1838.
[5]
Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang.2024. . Journal of medical Internet research26(2024), e52597.
[6]
Mehdi Arjmand, Farnaz Nouraei, Ian Steenstra, and Timothy Bickmore.2024. . In Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents. 1–10.
[7]
A. Li, L. Xu, C. Ling, J. Zhang, and P. Wang.2025. EmoVerse: Exploring Multimodal Large Language Models for Sentiment and Emotion Understanding.  [cs.CL].
[8]
Y. Liu, Y. Huang, S. Liu, Y. Zhan, Z. Chen, and Z. Chen.2024. . In Proceedings of ACM Multimedia ’24. ACM.
[9]
B. Han, C. Yau, S. Lei, and J. Gratch.2024. Knowledge-based Emotion Recognition using Large Language Models.  [cs.CL].
[10]
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Xing Xie.2023. . arXiv preprint arXiv:2307.11760(2023).
[11]
Jean Decety Jason M Cowell.2014. . Trends in cognitive sciences18, 7(2014), 337–339.
[12]
Jakob Håkansson Eklund Martina Summer Meranius.2021. . Patient Education and Counseling104, 2(2021), 300–307.
[13]
Jakob Håkansson Henry Montgomery.2003. . Journal of Social and Personal Relationships20, 3(2003), 267–284.
[14]
Albert Mehrabian.1967. . Journal of Personality and Social Psychology6, 1(1967), 109–114. https://doi.org/10.1037/h0024532.
[15]
Judith A. Hall, Jinni A. Harrigan, and Robert Rosenthal.1995. . Applied and Preventive Psychology4, 1(1995), 21–37. https://doi.org/10.1016/S0962-1849(05)80049-6.
[16]
Janet Beavin Bavelas, Linda Coates, and Trudy Johnson.2000. . Journal of Personality and Social Psychology79, 6(2000), 941–952. https://doi.org/10.1037/0022-3514.79.6.941.
[17]
Linda Tickle-Degnen Robert Rosenthal.1990. . Psychological Inquiry1, 4(1990), 285–293. https://doi.org/10.1207/s15327965pli0104_1.
[18]
Tanya L. Chartrand Jessica L. Lakin.2013. . Annual Review of Psychology64(2013), 285–308. https://doi.org/10.1146/annurev-psych-113011-143754.
[19]
Alaa A. Abd-Alrazaq, Asma Rababeh, Mohannad Alajlani, Bridgette M. Bewick, and Mowafa Househ.2020. . Journal of Medical Internet Research22, 7(2020), e16021. https://doi.org/10.2196/16021.
[20]
Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al2023. . Learning and Individual Differences103(2023), 102274. https://doi.org/10.1016/j.lindif.2023.102274.
[21]
Mohammad Amin Kuhail, Nazik Alturki, Salwa Alramlawi, and Kholood Alhejori.2023. . Education and Information Technologies28, 1(2023), 973–1018. https://doi.org/10.1007/s10639-022-11177-3.
[22]
Herbert H. Clark Susan E. Brennan.1991. . In Perspectives on Socially Shared Cognition, Lauren B. Resnick, John M. Levine, and Stephanie D. Teasley(Eds.). American Psychological Association, Washington, DC, 127–149. https://doi.org/10.1037/10096-006.
[23]
Martina S Zaharieva, Eliala A Salvadori, Daniel S Messinger, Ingmar Visser, and Cristina Colonnesi.2024. . Behavior research methods56, 6(2024), 5709–5731.
[24]
Peter Lewinski, Tim M Den Uyl, and Crystal Butler.2014. Journal of neuroscience, psychology, and economics7, 4(2014), 227.
[25]
Tanja Skiendziel, Andreas G Rösch, and Oliver C Schultheiss.2019. . PloS one14, 10(2019), e0223905.
[26]
Mirjana Prpa, Giovanni Troiano, Bingsheng Yao, Toby Jia-Jun Li, Dakuo Wang, and Hansu Gu.2024. . In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing. 716–719.
[27]
Alexander Marquardt, David Golchinfar, and Daryoush Vaziri.2025. . In 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE, 1604–1605.
[28]
Dijana Plantak Vukovac, Ana Horvat, and Antonela Čižmešija.2021. . In International Conference on Human-Computer Interaction. Springer, 216–229.
[29]
Ziang Xiao, Michelle X Zhou, Wenxi Chen, Huahai Yang, and Changyan Chi.2020. . In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–14.
[30]
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al2024. . arXiv preprint arXiv:2406.06608(2024).
[31]
Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha.2024. . arXiv preprint arXiv:2402.07927(2024).
[32]
YHPP Priyadarshana, Ashala Senanayake, Zilu Liang, and Ian Piumarta.2024. . Frontiers in Digital Health6(2024), 1410947.
[33]
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal.2022. . arXiv preprint arXiv:2210.02406(2022).
[34]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al2022. . arXiv preprint arXiv:2206.07682(2022).
[35]
Hui Ma, Bo Zhang, Jinpeng Hu, and Zenglin Shi.2025. . arXiv preprint arXiv:2508.11889(2025).
[36]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu.2023. . arXiv preprint arXiv:2303.16634(2023).
[37]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al2024. . arXiv preprint arXiv:2411.15594(2024).
[38]
Daniele Giunchi, Nels Numan, Elia Gatti, and Anthony Steed.2024. . In 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 579–589.
[39]
Xingyu Bruce Liu, Shitao Fang, Weiyan Shi, Chien-Sheng Wu, Takeo Igarashi, and Xiang’Anthony’ Chen.2025. . In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19.
[40]
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu.2023. . In The Twelfth International Conference on Learning Representations.
[41]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al2023. . Advances in neural information processing systems36(2023), 46595–46623.
[42]
Jeffrey Ip Kritin Vongthongsri.2025. deepeval. ://github.com/confident-ai/deepeval.
[43]
James A Russell.1980. Journal of personality and social psychology39, 6(1980), 1161.
[44]
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al2025. . arXiv preprint arXiv:2508.10925(2025).
[45]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al2024. . arXiv preprint arXiv:2407.21783(2024).
[46]
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al2024. . arXiv preprint arXiv:2401.02954(2024).
[47]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al2024. . arXiv preprint arXiv:2408.00118(2024).
[48]
Imtiaz Ahmed, Sadman Islam, Partha Protim Datta, Imran Kabir, Naseef Ur Rahman Chowdhury, and Ahshanul Haque.2025. . Authorea Preprints(2025).
[49]
Jakob Nielsen.1995. . IEEE software12, 1(1995), 98–100.
[50]
Osama Sohaib Khalid Khan.2010. . In 2010 international conference on Computer design and applications, Vol. 2. IEEE, V2–32.
[51]
John Brooke1996. . Usability evaluation in industry189, 194(1996), 4–7.
[52]
Matthias Schmidmaier, Jonathan Rupp, Darina Cvetanova, and Sven Mayer.2024. . In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18.
[53]
Christoph Bartneck, Dana Kulić, Elizabeth Croft, and Susana Zoghbi.2008. . International journal of social robotics(2008).
[54]
Christoph Bartneck.2023. . In International handbook of behavioral health assessment. Springer, 1–35.

  1. https://www.gradio.app/↩︎

  2. https://kivy.org/↩︎

  3. https://ollama.com/↩︎