User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios


Abstract

Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health and legal questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs’ ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users’ perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM-generated responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreements with each other on the privacy-preservation quality and helpfulness of the LLM-generated response. Further, we found high agreements among five proxy LLMs, while each individual LLM had low correlation with users’ evaluations. These results indicate that the privacy and helpfulness of LLM-generated responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs’ ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users’ perceived privacy and utility.

1 Introduction↩︎

Large language models (LLMs) are rapidly being adopted for everyday tasks such as drafting emails, summarizing meetings, and answering health or legal questions [1][5]. In such uses, users may need to share private information, such as email history, contact details, and health records, with LLMs [6], [7]. Prior work has raised concerns about LLMs’ ability to keep users’ information secret, noting risks of data memorization and leakage [8][12]. Researchers have developed benchmarks to test LLMs’ ability to preserve user privacy [12], [13]. Mireshghallah et al.found that LLMs can identify sensitive information when asked direct questions (e.g., “How sensitive would people consider their Social Security numbers to be?”) [12]. Yet in scenarios with rich context (e.g., meeting transcripts), multi-turn user interactions (e.g., chat history), and nuanced private information (e.g., medical history shared through emails) LLMs still violate expectations, revealing sensitive details even when explicitly instructed not to [12]. Such failures may stem from the contextual nature of privacy, which involves not only the information being shared but also depends on the context (e.g., who sends and receives the information) and the potential consequences of disclosure [14][16].

To assess LLMs’ capabilities to preserve privacy under complex scenarios, Shao et al.developed PrivacyLens, a set of real-life scenarios with private information and tasks for LLMs to complete [13]. With the scenarios from PrivacyLens, researchers have evaluated and proposed ways to improve LLMs’ ability to preserve users’ privacy [13], [17], [18]. Additionally, prior work has pointed out that privacy and helpfulness need to be evaluated together because an LLM can generate a response that keeps all potentially sensitive information private, but at the cost of not being helpful at all [9], [13]. Although several prior works called for involving humans in evaluations of LLM-generated content [19][21], so far research on LLMs’ ability to preserve privacy and provide utility has relied on judgements provided by another LLM (so called proxy LLM[12], [13], [18], [22]. The lack of human evaluations makes it unclear whether proxy LLMs’ judgements can be used to estimate users’ perceptions of privacy and helpfulness.

In this paper, we present a study investigating users’ privacy and helpfulness of LLM-generated responses, and to what extent proxy LLMs can approximate users’ perceptions. More specifically, we answer the following research questions:

  • RQ1 How do users perceive the privacy-preserving quality and helpfulness of LLM-generated responses in privacy-sensitive scenarios?

  • RQ2 Are proxy LLM evaluations of helpfulness and privacy-preserving quality aligned with users’ perceptions?

  • RQ3 How do proxy LLM justifications of their evaluations compare with user justifications?

We answer these research questions by first designing a survey to collect evaluations of privacy-preservation quality and helpfulness of LLM-generated responses. We randomly selected 90 scenarios from PrivacyLens, generated a response to each scenario, and recruited 94 online participants to provide their evaluations. We further used five proxy LLMs to complete the same survey and compared their evaluations with participants’ evaluations. Prior work has shown LLMs can respond differently when given the same input multiple times [20], [23], [24]; therefore, we asked each proxy LLM to evaluate every scenario five times (we call these five runs) to check for consistency. In the survey, we also asked participants and proxy LLMs to provide an explanation for each of their choices to help us understand the reasons behind their evaluations.

Across all scenarios, we found participants rated LLM-generated responses as helpful and privacy-preserving over 75% of the time. However, participants often disagreed with each other when assessing the same scenario (Krippendorff’s \(\alpha=0.36\)). From participants’ explanations, we discovered that participants had different views on what constitutes a helpful response in the scenario. For example, when reviewing the same scenario, one participant found the LLM-generated response to effectively complete the given task, while another participant found the response to miss key details. Further, different participants disagreed on what information needs to be kept secret. One participant, for example, found the LLM-generated email to not respect privacy norms because it included information they believed should only be discussed “face-to-face, and not in an electronic communication”. Another participant found the same email to respect privacy norms because it did not share sensitive details and thus acceptable to share (4.1). In contrast, individual proxy LLMs often agree with themselves across five runs (\(\alpha>0.88\)). The five proxy LLMs also had a moderate agreement with each other’s evaluations (\(\alpha=0.78\)) on both helpfulness and privacy-preservation quality. More importantly, we found proxy LLMs to be poor estimates of participants’ evaluations, correlating weakly with mean human judgements across 90 scenarios (Spearman’s \(0.24\leq \rho\leq 0.68\)). We further found when evaluating each scenario, proxy LLMs are not able to estimate the wide range of evaluations participants had (4.2). Our qualitative analysis revealed that proxy LLMs sometimes missed contextual information (e.g., public post on Facebook), failed to recognize private information (e.g., credit card numbers), or were not aligned with participants’ privacy views (e.g., determined clients’ first names as sharable, non-sensitive information) (4.3).

With our findings, we argue that human-centered evaluation should remain essential in privacy- and utility-related LLM evaluations. We further suggest exploring the possibility of improving proxy LLM-based evaluations by broadening the range of generated evaluations. Additionally, as a complementary approach, we suggest personalization—aligning proxy LLMs’ evaluations to users’ individual privacy and utility preferences. Last, we underscore the need to distinguish tasks where consistency is appropriate (e.g., objective questions) from those where diversity is desirable (preferences and perceptions). Establishing such taxonomies can guide study design, metric selection, and inform both users and researchers on when to rely on or go beyond deterministic LLM-generated responses (5).

2 Background and Related Work↩︎

In this section, we review prior work closely related to our study. We start with prior research on contextual privacy as that provides the foundation for our study of users’ privacy perceptions (2.1). We additionally summarize prior research of the helpfulness and privacy of LLM-generated content (2.2). Finally, we draw attention to the importance of conducting user studies in evaluation of LLMs (2.3).

2.1 Contextual Privacy↩︎

Contextual Integrity frames privacy as the appropriateness of information flows given five parameters: sender, recipient, subject, information type, and transmission principle [14], [15]. Empirical work shows that users’ privacy judgements relies on these contextual variables [25]. Researchers have applied contextual integrity as a framework to understand online privacy policies [26], [27], users’ perceptions of smart home devices privacy [28], [29], as well as mobile app permission systems [30][32]. Recently, LLMs have been rapidly adopted into users’ everyday workflows (e.g., summarizing meetings, drafting emails, answering health or legal questions). Unlike earlier studies that often examined bounded ecosystems (e.g., smart homes or app permissions), LLM-generated responses could be passed between tools and peoples, increasing the chance of sensitive details reaching unintended audiences. Mireshghallah et al.investigated LLMs’ ability to keep information secret through a benchmark (ConfAIde) rooted in contextual integrity [12]. Using the benchmark, they found LLMs are capable of identifying private information according to the social norm in simple multiple-choice questions. However, they discovered LLMs, when tasked to generate a meeting summary, surfaced sensitive details that contextual integrity would deem inappropriate. Their findings together with the complex nature of contextual privacy lay the groundwork for further investigation into LLMs’ ability to preserve privacy and how users perceive privacy in LLM interactions.

2.2 Helpfulness and Privacy-Preservation Ability of LLM-Generated Content↩︎

Much prior work has evaluated and established LLMs’ ability to help users in various domains (e.g., summarizing text, drafting emails, answering medical questions, recommending products)  [1], [2], [24], [33][38]. Yet, privacy preserving behavior at inference time has received comparatively less attention [8], [12], [13]. ConfAIde adapts contextual integrity into a benchmark and shows that LLMs disclose sensitive details in open-ended generation [12]. More recently, PrivacyLens builds on ConfAIde by collecting 493 privacy-sensitive seeds grounded in regulations, prior privacy literature, and crowdsourcing, then expanding them into expressive scenarios. These scenarios capture complex, task-oriented contexts (e.g., multi-turn message histories, email exchanges) in which LLMs are asked to complete a user task (e.g., drafting a social media post from prior messages). Using this framework, the authors report that LLMs can leak sensitive information in 26–39% of the evaluated scenarios [13]. Further, in LLM-assisted tasks, a privacy—helpfulness trade-off arises: aggressive redaction or evasive replies can reduce disclosure but undermine utility, whereas detailed answers improve usefulness while increasing risk [9], [39], [40]. Because neither extreme is desirable, evaluations taking both into consideration are essential.

2.3 User Evaluations in Privacy Preservation and Helpfulness↩︎

Work across human-computer interaction, usable security and privacy, as well as natural language processing shows that privacy and helpfulness are both context-dependent and user-perceived outcomes. Perceived privacy is often subjective—people judge acceptability by social norms, roles, purposes, and audiences [14], [15], [41]. Usability research shows that notice-and-choice mechanisms do not reliably predict users’ privacy comfort without empirical feedback [42], [43]. Similarly, evaluations of privacy-preservation quality and helpfulness on LLM-generated content should also involve user input. In both ConfAIde and PrivacyLens, the authors used LLMs to generate responses to the given task before asking a proxy LLM to judge the privacy-preservation quality and helpfulness. These proxy LLMs are used as stand-ins for human evaluations, but it is unclear whether assessments provided by proxy LLMs faithfully estimate users’ perceptions.

Therefore, our study fills the gaps in prior work through a user study of privacy-preservation quality and helpfulness of LLM-generated responses. Additionally, we compare participants’ evaluations to proxy LLM assessments to understand the alignment between human and LLM judgements.

3 Methods↩︎

To understand users’ perceptions of how LLMs can be helpful while preserving privacy in privacy-sensitive scenarios (RQ1), we designed a survey with five-point likert scale questions on privacy perception and utility and asked participants to explain their choices (3.1). Prior work used proxy LLMs (explained in 1) to automatically evaluate LLM-generated responses [12], [13]. We wanted to know how well the proxy LLMs can estimate human judgements in privacy-sensitive scenarios (RQ2). Therefore, we used five different proxy LLMs to complete the same survey as in the user study (3.2). Lastly, we describe how we compared participants’ answers to proxy LLMs’ evaluations (3.3). We provide an overview of our study design in 1.

Figure 1: Overview of our study design.

3.1 User Study↩︎

3.1.0.1 Scenarios and LLM-generated responses

For the user study, we first needed a series of privacy-sensitive scenarios. We randomly selected 90 scenarios from the PrivacyLens dataset [13], a dataset developed to help with evaluating LLMs’ ability to preserve private information in everyday contexts used by many prior works (e.g., [17], [44][47]). Each scenario includes contextual information (e.g., notes taken during a meeting, email history) and a task (e.g., writing a meeting summary, replying to an email). Both contextual information and the task were in json format from PrivacyLens. We converted the scenarios into HTML without changing the content in order to present them to participants in a readable format. We provide an example of what participants saw in 6.

For each scenario, we used OpenAI’s ChatGPT-5 (GPT-5) to generate a response with the privacy-enhancing prompt from the PrivacyLens paper [13]. We chose the privacy-enhancing prompt because Shao et al.have shown that it lowers the likelihood of LLMs leaking secrets. GPT-5 was chosen as it is one of the newest and best performing LLMs, according to the LMArena1, that we had access to.

3.1.0.2 Recruitment

We recruited participants using Prolific, an online crowdsourcing platform used for recruiting research participants2. Participants had to be at least 18 years old, fluent in English, and located in the United States to be eligible. We received 95 completed surveys and rejected one due to the participant providing text that did not answer the question to all open-ended questions.

3.1.0.3 Survey

Potential participants clicked on our study through the link on Prolific first encountered an informed consent form. Participants who agreed to participate and met all the requirements proceeded to the survey. Within the survey, we first showed participants an example scenario, and LLM-generated response, along with instructions on how to complete the survey. Participants were then asked to answer seven survey questions, three on helpfulness and four on privacy-preserving quality, about the LLM-generated response to the scenario. We provide an example of what participants saw in 6. For each participant, this process repeated with five different scenarios. Each scenario was shown to five participants, so we could collect evaluations from participants with varying perspectives.

For each scenario, we asked participants seven questions in the survey. Three assessed helpfulness: (H1) a Yes/No question on whether the LLM-generated response completed the task, plus two five-point likert questions (H2) rating perceived helpfulness and (H3) whether the participant would use the response if they were the person in the scenario. Four more five-point likert questions assessed privacy-preserving quality: (P1) perceived sensitivity of the information in the scenario, (P2) likeliness to share such information with an LLM, (P3) judgement on the response respecting privacy norms, and (P4) judgement on the response respecting personal privacy preferences. For all survey questions but H1, we asked participants to provide an explanation for their answers.

At the end of the survey, we collected participants’ demographic information. Participants in our study came from a wide range of backgrounds in terms of age, gender, education level, and income. 46% of the participants were female and 51% were male. The most represented age group was between 35 and 44 years old (35% of responses), followed by 25 to 34 years old (23%) and 45 to 54 years old (21%). We provide a detailed breakdown of participants demographic information in Table 3.

3.2 Using Proxy LLMs to Estimate User Judgements↩︎

3.2.0.1 Proxy LLMs

In addition to human participants, we used five proxy LLMs to complete the same survey: OpenAI ChatGPT-5 (GPT-5), Llama-3.3-80B-Instruct (Llama-3.3), Gemma-3-27B (Gemma-3), Qwen-3-30B-A3B-IT (Qwen-3), and Mistral-7B-Instruct-v0.3 (Mistral). GPT-5 was selected because it is the newest close-ended model we had access to. For open-weights models, Llama-3.3, Gemma-3, Qwen-3, and Mistral were selected because they are the newest model at the time of our experiment and the largest model that fits on our A-100 GPU. We used these models to answer the same survey questions from the user study for all 90 scenarios. This approach follows prior work that used LLMs as proxies for human judgements in evaluating the privacy-preserving qualities of LLM outputs [12], [13], [17]. We used the default parameters (e.g., temperature) for each model for two reasons: (1) we did not have the ability to change GPT-5’s certain parameters as it is a closed-source model, and (2) we wanted to keep the model’s behavior as how users would normally use them. Since LLMs can produce different responses when given identical inputs, we collected five or more evaluations per scenario from each model to capture variability in responses.

3.3 Data Analysis↩︎

To understand users’ perceptions of the privacy and utility of LLMs’ responses, we first examined participants’ survey answers collectively without considering individual scenarios. We aggregated participants’ choices by question and we share our findings in 4.1.1. Since at least five participants evaluated each scenario, we treat the participants as different coders and compute ordinal Krippendorff’s correlation coefficient (\(\alpha\)[48], [49] to assess the level of agreement between participants. We share our findings on scenario-level agreement between participants in 4.1.2. In the same manner, we used Krippendorff’s \(\alpha\) to measure intra- and inter-proxy LLM agreement to assess their self-consistency and cross-model agreement, respectively [48], [50][52]. We share results of these analyses in 4.2

3.3.0.1 Qualitative coding

Beyond quantitative analysis of how users and proxy LLMs evaluated the privacy and helpfulness of LLM-generated responses, we also sought to understand the rationale behind the evaluations. For that reason, in the survey (see 3.1), we asked participants to provide an open-response explanation of their choice. Two researchers coded participants’ answers using a thematic analysis approach, where one researcher first coded 10 answers for each question, producing a codebook with themes. After that, two researchers reviewed the codes and their applications together, discussed and resolved any disagreements by updating the codes or their applications. Next, two researchers split the remaining answers and applied the codes from the agreed-upon codebook, creating and discussing new codes as they emerged. We share qualitative results throughout 4 to support our findings.

3.4 Ethical Considerations↩︎

For the user study, we provided all participants with an informed consent form explaining the purpose of our study, expected length, risks and benefits, and compensation for completing the survey. Only participants who give their consent proceeded to the survey. For the survey, participants took an average of 47 minutes to complete and were compensated $10 for their time. The study procedure was reviewed and approved by an internal ethics committee.

4 Results↩︎

In this section, we first share results from the user study that helps us understand how users judged the privacy-preservation quality and helpfulness of LLM responses (addressing RQ1; see 4.1). We then compared participants’ judgements to LLMs’ assessments to check whether proxy LLMs provide a good approximation of how people might react to the privacy preservation quality and helpfulness of LLM-generated responses (addressing RQ2; see 4.2). Lastly, we examine in detail how users’ explanations of their evaluations compare to LLMs’ explanations (addressing RQ3; see 4.3). We use the term evaluations to refer to a single participant’s evaluation of one scenario. With 94 participants each evaluating five scenarios, there is a total of 470 evaluations.

4.1 Users Perceptions of LLM’s Helpfulness and Privacy Preservation↩︎

We first report on participants’ evaluations of each scenario to answer how users perceived the privacy-preservation quality and helpfulness of LLM-generated responses (4.1.1). After that, we report on the level of agreement between participants per scenario (4.1.2).

4.1.1 Participants’ Perceptions to All Scenarios↩︎

Figure 2: Participants found LLM-generated responses completed the given task over 90% of the time. Participants found LLM-generated responses to be helpful and would use the response most of the time.

Individual participants generally found LLM-generated responses to privacy-sensitive scenarios helpful and expressed interests in using them. Over 90% of participants’ evaluations indicated the response completed the task given in the scenario. 87% of the evaluations indicated the response was helpful; participants further said they would use the response to complete the task 84% of the time (see 2).

Figure 3: 78% of the time, participants found the LLM-generated response to comply with the privacy norm. Over 82% of the time, participants found the response respects their personal privacy preferences.

Participants indicated that GPT-5’s responses mostly or completely complied with privacy norms 78% of the time. When asked about whether the response respects their personal privacy preferences, 83% of the time participants indicated it did (see 3).

With that said, 13% of evaluations indicated GPT-5’s response did not respect the privacy norm, with one participant explaining,

The response directly shares identifiable client information… –P78

Participants who said a response did not fit their personal privacy preferences found it to have excessive private information, such as medical history or login credentials. For example, one participant reacting to the LLM-generated response to scenario 15, which involved password sharing, explained that,

…[the response did] not respect my personal privacy preferences because it shares login credentials. –P13

4.1.2 Scenario-Level (Dis)Agreements Between Participants↩︎

While collectively, participants found GPT-5’s responses helpful and mostly privacy-preserving, there were notable disagreements between participants when rating the same scenario. Within each scenario, treating every participant (\(\geq 5\) per scenario) as an individual coder, we found participants have low agreements with each other (\(\alpha=0.3573\)[48].

Figure 4: For each scenario, we show the range of judgements collected from participants and proxy LLMs. We compute the range across the \geq 5 participants and 5 runs per proxy LLM for each scenario. Completely agree means every rating is the same on the five-point likert scale. Two points apart means one rating is strongly agree while another is neutral with the remaining participants selecting something in between.

With that said, participants’ disagreements varied by scenario. For example, on 6% of scenarios, participants completely agreed with each other on the level of helpfulness (see Helpfulness of Response in the Participants subfigure of 4). At least two participants completely disagreed with each other on whether they would use the response on 20% of scenarios (see Would use Response in the Participants subfigure of 4). At least one participant had a different rating in all 90 scenarios when asked about the sensitivity of the information in the scenario and whether they would share such information with an LLM (see Information Sensitivity and Would Share Information w/ LLM in the Participants subfigure of 4). When asked about the response respecting privacy norms and personal privacy preferences, we saw in 88% and 95% of the scenarios, respectively, at least one participant had a different judgement (see Complies with Privacy Norms and Respects Personal Privacy in the Participants subfigure of 4).

To better understand the quantitative results, we qualitatively coded participants’ explanations of their choices and found that participants had reasonable justification for their assessments, even when their assessments greatly disagreed when evaluating the same scenarios. For example, when judging the helpfulness of scenario 42—writing an email to a colleague about team morale—one participant said the response was very helpful because

It provides enough information to inform while maintaining a casual and approachable style. –P50

while a different participant said the response is very unhelpful because

The response did not mention the most important and specific information about the business… –P63

Participants also had different evaluations on the privacy-preservation quality of identical scenarios. P21 believed the response to scenario 10—sending a message about Emily’s school experience, completely follows privacy norm because

…only relevant information was shared… –P21

In contrast, another participant thought the same response not at all respected privacy norms, explaining

The response includes too much private info about Emily and the difficulties she is facing… –P15

Our quantitative and qualitative analysis showed that participants have diverse judgements on privacy-preserving quality and helpfulness of LLM-generated responses in privacy-sensitive scenarios. Further, individual participants had reasonable explanations for their judgements, which often depend on the context of the scenario. We discuss implications of our findings in 5.

Result 1: Individual participants found LLM responses to privacy-sensitive scenarios helpful and privacy preserving most of the time. However, when evaluating the same scenario, participants disagree with each other on both privacy and utility of LLM responses. Through participants’ explanation of their choices, we observe such differences came from nuances in individual perspectives and contextual factors within scenarios.

4.2 LLMs’ Ability to Estimate Users’ Judgements↩︎

We asked five LLMs (Gemma-3, GPT-5, Llama-3.3, Mistral, and Qwen-3) to answer the same set of questions given to participants in the user study. We refer to these five as proxy LLMs to distinguish them from the LLM, GPT-5, that was used to generate responses to each scenario. We compared participants’ judgements to proxy LLMs’ evaluations to determine how well they can estimate user perceptions.

Table 1: Krippendorff’s \(\alpha\) shows low agreements between participants, high agreements within each proxy LLM, and moderate agreements across all five proxy LLMs when evaluating identical scenarios. \(\alpha<0.67\) indicates poor agreements, \(0.67\leq\alpha\leq 0.79\) indicates moderate agreements, and \(\alpha\geq 0.8\) indicates satisfactory level of agreements between raters [53].
For each scenario Krippendorff \(\alpha\)
Between \(\geq\) 5 participants’ evaluations 0.3573
Gemma-3 0.9788
GPT-5 0.9328
Llama-3.3 0.9830
Mistral 0.8823
Qwen-3 0.9848
Across the 5 LLMs (between 25 evaluations) 0.7770

4.2.0.1 Proxy LLMs’ evaluations are self-consistent

Prior work has documented that LLMs may respond to the same prompt in different ways [20], [23], [24] across multiple runs, so we asked each proxy LLM to rate each scenario five times (i.e., five runs), totaling \(5\times 90 = 450\) evaluations per proxy LLM. We found proxy LLMs to be quite consistent in their evaluations. For each proxy LLM, treating one evaluation per run as an individual coder, we found that the lowest Krippendorff’s \(\alpha\) was 0.88 (Mistral), and the highest was 0.98 (Qwen-3), as shown in 1. In addition to being self-consistent, we found that proxy LLMs also moderately agreed (\(\alpha=0.77\)) with each other across the 25 evaluations per scenario.

Table 2: Proxy LLMs’ evaluations moderately correlate with an average participants’ perception privacy preservation. The correlation between participants’ perception and proxy LLMs is weak on helpfulness. \(0.4\leq\rho<0.7\) is interpreted as moderate, and \(\rho<0.4\) is interpreted as weak [54]. *** indicates statistical significance at p<0.001.
Spearman \(\rho\) Gemma-3 GPT-5 Llama-3.3 Mistral Qwen-3
Helpfulness Questions 0.31*** 0.34*** 0.24*** 0.25*** 0.24***
Privacy Questions 0.60*** 0.68*** 0.27*** 0.07 0.62***

4.2.0.2 Proxy LLMs’ evaluations weakly or moderately correlate with the average evaluations of participants

In 4.1.2, we shared that participants often disagreed with each other when evaluating the same scenarios. That said, we calculated the mean evaluations from participants per scenario per survey question, and wanted to see whether proxy LLMs could estimate that. We computed the Spearman’s rank coefficient (\(\rho\)) between the mean participant evaluations and proxy LLM evaluations and only found a weak or moderate correlation (\(0.24\leq\rho\leq 0.68\)) for all survey questions (see 2).

Figure 5: With cumulative density curves, we show the cumulative percentage of scenarios (x-axis) with a certain standard deviation (y-axis) of the likert scale evaluation. Proxy LLMs show a lower standard deviation than participants’ evaluations on more scenarios. However, the differences vary by proxy LLM and the question asked.

4.2.0.3 The disagreements between proxy LLMs and participants are question specific

We convert participants and proxy LLMs evaluations on the five-point likert scale to numeric values (-1 meaning completely disagree, 1 meaning completely agree, and 0 meaning neutral). Using the values, we calculate the standard deviation of participants’ and proxy LLMs’ evaluations within each scenario. Proxy LLMs had no disagreements across five runs on over 70% of the scenarios when asked about the helpfulness of a response or whether they would use the response given the scenario. In contrast, participants only fully agreed with each other on 10% or fewer scenarios (see Helpfulness of Response and Would Use Response of 5). When asked about the sensitivity of the information in the scenario, participants never fully agreed with each other. Proxy LLMs, on the other hand, showed consistent evaluations on over 50% of the scenarios. Interestingly, Gemma-3, GPT-5, Llama-3.3, and Qwen-3 fully agreed on whether they would share the information in the scenario with an LLM in over 94% of the scenarios (see Information Sensitivity and Would Share Information w/ LLM of 5). Further, when asked to evaluate the LLM-generated responses’ compliance with privacy norms, we found Mistral fully agreed with itself on 8% fewer scenarios than participants. In contrast to participants fully agreeing on 12% of the scenarios, Gemma-3, GPT-5, Llama-3.3, and Qwen-3 fully agreed on 71% of the scenarios (see Complies with Privacy Norms of 5). Finally, participants fully agreed with each other on only 4% of the scenarios when asked about the response respecting their personal privacy preferences. Proxy LLMs’ evaluations are much more consistent, fully agreeing on 91% (Gemma), 80% (GPT-5), 89% (Llama-3.3), 44% (Mistral), and 100% (Qwen-3) of the scenarios.

Our results reveal that even with multiple proxy LLMs, users’ wide range of perceptions regarding privacy and utility of LLM-generated responses in privacy-sensitive scenarios are often not captured. We make suggestions for future evaluations and potential improvements to better align proxy LLMs with user perceptions in 5.

Result 2: Proxy LLMs’ assessments of helpfulness and privacy preservation are generally consistent both within and across models (intra-model \(\alpha\ge 0.88\); inter-model \(\alpha=0.77\)). However, proxy LLMs’ assessments align only weakly to moderately with participants’ assessments (\(0.24\leq \rho\leq 0.68\)). Moreover, proxy LLMs’ assessments fail to capture the substantial differences in how different participants assess the same scenarios.

4.3 (Mis)alignments Between Participants and Proxy LLMs↩︎

In 4.1.2, we investigated why participants disagreed with each other by examining their explanations. Prior work has shown that LLMs can explain their reasoning processes [17], [36]. Hence, just as we asked participants to explain their choices, we asked proxy LLMs to explain their choices when evaluating each scenario. We qualitatively analyzed explanations in scenarios where proxy LLMs and participants disagreed the most. We found that while proxy LLMs’ explanations were on the surface reasonable justifications for their choices, they often missed key details (e.g., who the information is being shared with, nuances in the given task) within the scenario, leading to misalignment with participant evaluations.

4.3.1 Misalignments on Evaluation of Helpfulness↩︎

We found proxy LLM sometimes overlook nuances (e.g., audience of a message, intended task to complete) when evaluating LLM-generated responses to privacy-sensitive scenarios, leading to misaligned evaluations with participants. Scenario 20 asks an LLM to act as Jane and generate a Facebook post about her recent life. The LLM generated a post including details about Jane’s interview with Alex for her work. Qwen-3, as the proxy LLM, indicated it was extremely likely to use the LLM-generated post if it was acting as Jane because

…The post drafted aligns perfectly with the task of posting on my Facebook page, as it highlights a personal story… –Qwen-3

For the same scenario, four out of five participants indicated they were extremely or somewhat unlikely to use the response with one participant explaining,

…it shares confidential interview information that could harm Alex professionally and personally… –P38

Similarly, GPT-5 as the proxy LLM missed the nuances in the task given in scenario 65. In this scenario, three college students (John, Emily, and Alex) are discussing their experiences finding a job. Acting as John, with the task of replying to Emily’s last message about her nervousness, the LLM generated a response talking about Alex’s interview time. GPT-5 as the proxy LLM indicated it was extremely likely to use the response in the scenario because,

…It directly answers [Emily’s] question using known information… –GPT-5

In contrast, four out of five participants indicated they were unlikely to use the response with one participant explaining,

Emily asked for job hunting tips and the response just summarized the previous conversations… –P81

4.3.2 Misalignments on Evaluations of Privacy↩︎

By comparing proxy LLMs’ and participants’ explanations, we found the misalignments in evaluations of privacy caused by proxy LLMs failing to identify private information (e.g., credit card information), having different perceptions on what constitutes privacy norms, and not capturing the diversity of participants’ privacy preferences.

4.3.2.1 Proxy LLM failing to identify information as sensitive

When asked to evaluate the sensitivity of the information in the scenario, proxy LLM can fail to recognize crucial private details. In scenario 78, Sarah sent credit card details (i.e., card number, expiration date, and three-digit CVV) to John through messaging. The proxy LLM (Mistral) indicated the information shared in the scenario was moderately sensitive, completely missed the credit card details and explained,

…there is no explicitly sensitive information like financial data, location, or private conversations… –Mistral

Participants, on the other hand, found the credit card information to be very or extremely sensitive, with one participant explaining

[The message] includes full credit card details, which is highly private and should never be shared. –P70

4.3.2.2 Proxy LLM not aligning with participants on privacy norms

In the survey, we asked proxy LLMs and participants to evaluate how well the LLM-generated responses comply with privacy norms. Through their explanations, we found proxy LLM (Llama-3.3) to have different interpretations of privacy norms. For example, in scenario 50, the LLM generated a response as John sharing his day at work with friend Mary. The response included the amount of a tax refund John was able to help his client get. The proxy LLM (Llama-3.3) evaluated the response to be somewhat respectful of privacy norms, stating

…the mention of the $3,250 tax refund could be seen as slightly sensitive…Nonetheless, it doesn’t seem to significantly breach any privacy norms… –Llama-3.3

In contrast, four out of five participants indicated such a response not at all respects privacy norms explaining that

[The response] is mentioning the [dollar amount] and the name of the client…–P54

In Scenario 60, Jane wanted to send her friend Emily a message about her day. The LLM, acting as Jane, generated a message that included her clients’ names, Sarah and Tom. The proxy LLM (Qwen-3), indicated the generated message to mostly respect privacy norms, explaining that

The response shares general, non-sensitive information about the day’s work, such as the names of clients (Sarah and Tom), their progress, and the emotional tone of the day…. –Qwen-3

When evaluating the same scenario, two participants found the LLM-generated message to be not at all respectful of privacy norms, with one of them explaining

The response directly shares identifiable client information (e.g., “Sarah”, “Tom”, and the “New Beginnings” program) in a casual text message, which violates core privacy norms in support-based professions. –P78

4.3.2.3 Proxy LLMs did not capture participants’ diverse privacy preferences

We established quantitatively in 4.1 that participants have wide-ranging views on the privacy-preserving quality of LLM-generated responses. Through qualitative analysis, we found proxy LLMs to often have semantically similar explanations across runs, lacking the nuance to capture differences shown by participants.

For example, we previously shared that P78 found the LLM-generated message to scenario 60 not at all respect privacy norms. In contrast, P81 evaluated the same message as completely respecting privacy norms, saying no privacy norm was breached. Moreover, in scenario 11, Jane tasked an LLM to draft an email to her previous advisor for advice on a legal case she is working on. Three participants strongly agreed that the email respects their personal privacy preferences with one explaining,

The response is overall very generic when referencing the case… –P22

However, two other participants found the LLM-generated email violates their privacy preferences because they found sensitive details were revealed,

While the response avoids sharing attachments or client communications, it still reveals sensitive strategies and evidence details. I’d prefer only general requests for strategic advice, excluding confidential defense tactics and case specifics. –P12

The diverse preferences of participants were not captured by proxy LLMs. Across five runs, each proxy LLM rated this scenario the same across five runs. Gemma-3 consistently found the LLM-generated email to respect their privacy preferences, explaining,

The generated email is well-written, …, and respects confidentiality. It is exactly the type of email I would expect… –Gemma-3

The qualitative analysis of explanations strengthens our findings in proxy LLMs’ not being good estimates of participants’ evaluations. We discuss implications in detail in 5.

Result 3: Proxy LLMs provided reasonable explanations for their evaluations on the privacy-preservation quality and helpfulness of LLM-generated responses in privacy-sensitive scenarios. However, occasionally, proxy LLMs missed key details and/or nuances in the scenario, which led to misalignments between proxy LLMs’ and participants’ evaluations. Further, proxy LLMs provided narrow range of evaluations and explanations, failing to capture the diverse judgements of users.

5 Discussion↩︎

The diverse opinions from participants’ evaluations and the weak-to-moderate correlation between participants’ and proxy LLMs’ evaluations highlight the limited ability of proxy LLMs to estimate human judgements regarding privacy and utility. In this section, we discuss the implications of our findings and suggest several directions for future work to improve proxy LLMs’ ability to estimate human judgements in these areas. We further raise the question of when consistency is desirable versus when diversity is more appropriate in LLM responses.

5.0.0.1 Involving users in the evaluations of LLM-generated content

Our findings indicate that proxy LLMs are poor stand-ins for users in the types of scenarios we studied. Across scenarios, proxy LLMs produced a narrow band of evaluations over repeated runs, whereas participants expressed a more diverse range of judgements. This divergence suggests that the common practice of relying on proxy LLMs to assess LLM-generated content is inadequate, particularly for privacy and utility decisions where individual differences matter. In line with prior work advocating human-centered evaluations [19][21], we argue that human evaluations should remain a key component in assessing LLM-generated content.

5.0.0.2 Improving proxy LLMs’ capability

We identify two complementary directions for future work. First, improve proxy LLMs’ capability to better approximate the broad range of human judgements. Prior work has typically configured proxy LLMs to produce deterministic outputs (e.g., temperature = 0) and collapsed evaluations to a single statistic (e.g., a leakage rate). Our results show that, while proxy LLMs are less variable than humans, they still produce differing evaluations for up to 55% of scenarios. Researchers should characterize and report this range and compare it with the range of human opinions. Future work could explore ways to expand proxy LLMs’ range of evaluations of privacy and utility, potentially through training, prompting, adjusting model settings (e.g., temperature), or other inference-time techniques, to better reflect the diversity in human perceptions.

Another potential direction for future work is personalizing proxy LLMs to better reflect individuals’ privacy and utility preferences. Prior work has shown that LLMs can be adapted to individual users [55][57]. In the context of privacy and utility, researchers have also found ways to learn users’ privacy preferences and provide users with privacy-related options for online tracking and smart home devices [58][62]. Personalizing proxy LLMs could therefore lead them to match individual privacy and utility preferences more closely. One potential method, drawing from computer vision and named entity recognition (NER) fields of studies [63], is to investigate the importance of different components of the scenario in informing proxy LLMs’ judgements. Certain components of the scenario, such as type of data being shared, the purpose or intent of sharing, or the audience most influence a proxy LLM’s judgements. Identifying these components can help researchers understand which elements are most critical for accurate privacy and utility assessments, and in turn inform the alignment between proxy LLMs and human evaluations. Improving the fidelity of user preference estimation would make proxy LLM evaluations more practical and cost-effective for researchers (as they are often easier to set up and cost less than user studies).

5.0.0.3 Consistency or diversity?

While much prior work has emphasized the importance of consistency in LLM-generated content, our results highlight domains where convergence on a single answer is not necessarily appropriate. In settings where there may be a clear, objective answer (e.g., answering factual questions, providing straightforward medical diagnoses, supplying best practices for computer security), consistency is desirable. In contrast, privacy and perceived utility are subjective, determined by the context and individual preferences (e.g., a fitness app tracking users’ location) [14], [41]. When an LLM is positioned as a proxy for human judgements in such domains, producing a range of evaluations rather than a single judgement could better approximate the population it stands in for. We call for clearer taxonomies separating objective-answer tasks from preference- and perception-sensitive tasks. Making these distinctions can help researchers in guiding study designs and evaluation metrics, as well as when to rely on or go beyond deterministic proxy LLMs evaluations. Such taxonomies can also help users interpret whether an LLM’s output varies because the input was ambiguous or because the task has inherently subjective answers.

6 Conclusion↩︎

In this study, we collected 94 participants’ evaluations on the privacy and helpfulness of LLM-generated responses to 90 PrivacyLens scenarios. We found participants disagreed with one another when judging the same scenario. We further collected proxy LLMs’ evaluations of the same scenarios and found proxy LLMs were very consistent with themselves and moderately consistent with each other. Yet, we found only weak to moderate correlations between participants’ and proxy LLMs’ evaluations. Qualitative analysis of explanations provided by participants and proxy LLMs revealed where and why proxy LLMs’ evaluations diverged from participants’ evaluations (e.g., overlooking audience, not recognizing sensitive information). These results indicate that privacy and utility are individualized and context-dependent, and that proxy LLMs are unreliable at approximating user perceptions in privacy-sensitive scenarios. With our findings, we call for human involvement in evaluating LLM-generated responses, especially in the privacy and utility domains. We further discussed the potential for improving proxy LLMs’ ability to estimate human judgements in privacy and utility by exploring methods such as expanding the diversity of outputs and personalizing LLMs to match particular users’ needs and preferences.

Appendix↩︎

In the appendix, we provide the demographics information for the 94 participants who completed the survey in 3. We provide an example scenario from PrivacyLens through a screenshot to show what we presented to participants in our survey 6.

Table 3: Demographics information of 94 survey participants.
Demographic Count Percentage
Age
18 - 24 2 2.13%
25 - 34 22 23.40%
35 - 44 33 35.11%
45 - 54 20 21.28%
55 - 64 10 10.64%
65 - 74 6 6.38%
75 - 84 1 1.06%
Gender
Female 44 46.81%
Male 48 51.06%
Prefer to self-describe 2 2.13%
Education
No high school degree 1 1.06%
High school graduate, diploma or the equivalent 9 9.57%
Some college credit, no degree 17 18.09%
Trade, technical, vocational training 3 3.19%
Associate’s degree 12 12.77%
Bachelor’s degree 29 30.85%
Master’s degree 15 15.96%
Professional degree 1 1.06%
Doctorate degree 7 7.45%
Income
Under $25,000 9 9.57%
$25,000 to $49,999 22 23.40%
$50,000 to $74,999 18 19.15%
$75,000 to $99,999 10 10.64%
$100,000 or more 35 37.23%
Figure 6: An example privacy-sensitive scenario from the PrivacyLens dataset as presented to participants in our survey.

References↩︎

[1]
Yusuke Miura, Chi-Lan Yang, Masaki Kuribayashi, Keigo Matsumoto, Hideaki Kuzuoka, and Shigeo Morishima.2025. . In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(Yokohama Japan, 2025-04-26). ACM, Yokohama Japan, 1–20. https://doi.org/10.1145/3706598.3714016.
[2]
Sumit Asthana, Sagi Hilleli, Pengcheng He, and Aaron Halfaker.2025. . In Proceedings of the ACM on Human-Computer Interaction(2025-05-02), Vol. 9. ACM, Yokohama Japan, 1–29. https://doi.org/10.1145/3711074.
[3]
Xin Sun, Yunjie Liu, Jos A. Bosch, and Zhuying Li.2025. . In Proceedings of the ACM on Human-Computer Interaction(2025-05-02), Vol. 9. ACM, Yokohama Japan, 1–32. https://doi.org/10.1145/3711014.
[4]
Eike Schneiders, Tina Seabrooke, Joshua Krook, Richard Hyde, Natalie Leesakul, Jeremie Clos, and Joel E Fischer.2025. . In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(Yokohama Japan, 2025-04-26). ACM, Yokohama Japan, 1–14. https://doi.org/10.1145/3706598.3713470.
[5]
Zhiping Zhang, Chenxinran Shen, Bingsheng Yao, Dakuo Wang, and Tianshi Li.2025. . In Proceedings of the ACM on Human-Computer Interaction(2025-05-02), Vol. 9. ACM, Yokohama Japan, 1–26. https://doi.org/10.1145/3711061.
[6]
Hye Sun Yun Timothy Bickmore.2025. . In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems(New York, NY, USA, 2025-04-25) (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3706599.3720239.
[7]
Samuel Rhys Cox, Rune Møberg Jacobsen, and Niels van Berkel.2025. . In Proceedings of the 7th ACM Conference on Conversational User Interfaces(New York, NY, USA, 2025-07-07) (CUI ’25). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3719160.3736617.
[8]
Niloofar Mireshghallah, Maria Antoniak, Yash More, Yejin Choi, and Golnoosh Farnadi.2024. . ://openreview.net/forum?id=tIpWtMYkzU.
[9]
Zhiping Zhang, Michelle Jia, Hao-Ping (Hank) Lee, Bingsheng Yao, Sauvik Das, Ada Lerner, Dakuo Wang, and Tianshi Li.2024. . In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems(New York, NY, USA, 2024-05-11) (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–26. https://doi.org/10.1145/3613904.3642385.
[10]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang.2023. . In The Eleventh International Conference on Learning Representations. ://openreview.net/forum?id=TatRHT_1cK.
[11]
Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr.2022. . In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(New York, NY, USA, 2022-06-20) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2280–2292. https://doi.org/10.1145/3531146.3534642.
[12]
Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi.2023. . ://openreview.net/forum?id=gmg7t8b4s0.
[13]
Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang.2024. . Advances in Neural Information Processing Systems37(Dec.2024), 89373–89407. ://proceedings.neurips.cc/paper_files/paper/2024/hash/a2a7e58309d5190082390ff10ff3b2b8-Abstract-Datasets_and_Benchmarks_Track.html.
[14]
Helen Nissenbaum.2004. . 79, 1(2004), 119. ://digitalcommons.law.uw.edu/wlr/vol79/iss1/10.
[15]
Helen Nissenbaum.2019. . 20, 1(2019), 221–256. https://doi.org/10.1515/til-2019-0008.
[16]
Chaoran Chen, Daodao Zhou, Yanfang Ye, Toby Jia-Jun Li, and Yaxing Yao.2025. . In Proceedings of the 30th International Conference on Intelligent User Interfaces(New York, NY, USA, 2025-03-24) (IUI ’25). Association for Computing Machinery, New York, NY, USA, 277–297. https://doi.org/10.1145/3708359.3712156.
[17]
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, and Maarten Sap.2025. 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning. https://doi.org/10.48550/arXiv.2508.07667.
[18]
Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan.2025. Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents. https://doi.org/10.48550/arXiv.2509.17488.
[19]
Tianshi Li, Sauvik Das, Hao-Ping Lee, Dakuo Wang, Bingsheng Yao, and Zhiping Zhang.2024. Human-Centered Privacy Research in the Age of Large Language Models. https://doi.org/10.48550/arXiv.2402.01994.
[20]
Xiaoyuan Wu, Weiran Lin, Omer Akgul, and Lujo Bauer.2025. Estimating LLM Consistency: A User Baseline vs Surrogate Metrics. https://doi.org/10.48550/arXiv.2505.23799.
[21]
Zhiping Zhang, Bingcan Guo, and Tianshi Li.2024. Can Humans Oversee Agents to Prevent Privacy Leakage? A Study on Privacy Awareness, Preferences, and Trust in Language Model Agents. https://doi.org/10.48550/arXiv.2411.01344.
[22]
Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, and Kamalika Chaudhuri.2025. AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents. https://doi.org/10.48550/arXiv.2503.09780.
[23]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar.2022. . ://openreview.net/forum?id=VD-AYtP0dve.
[24]
Weiran Lin, Anna Gerchanovsky, Omer Akgul, Lujo Bauer, Matt Fredrikson, and Zifan Wang.2025. . In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(CHI ’25). Association for Computing Machinery, New York, NY, USA, 1–24. https://doi.org/10.1145/3706598.3714025.
[25]
Kirsten Martin Helen Nissenbaum.2015. . Columbia Science and Technology Law Review, Forthcoming2709584(2015). https://doi.org/10.2139/ssrn.2709584.
[26]
Yan Shvartzshnaider, Noah Apthorpe, Nick Feamster, and Helen Nissenbaum.2019. . 7(2019), 162–170. https://doi.org/10.1609/hcomp.v7i1.5266.
[27]
Yan Shvartzshnaider, Noah Apthorpe, Nick Feamster, and Helen Nissenbaum.2018. Analyzing Privacy Policies Using Contextual Integrity Annotations. https://doi.org/10.48550/arXiv.1809.02236.
[28]
Noura Abdi, Xiao Zhan, Kopo M. Ramokapane, and Jose Such.2021. . In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama Japan, 2021-05-06). ACM, 1–14. https://doi.org/10.1145/3411764.3445122.
[29]
Alisa Frik, Xiao Zhan, Noura Abdi, and Julia Bernd.2025. . (2025). ://petsymposium.org/popets/2025/popets-2025-0091.php.
[30]
Primal Wijesekera, Arjun Baokar, Ashkan Hosseini, Serge Egelman, David Wagner, and Konstantin Beznosov.2015. . 499–514. ://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/wijesekera.
[31]
Primal Wijesekera, Arjun Baokar, Lynn Tsai, Joel Reardon, Serge Egelman, David Wagner, and Konstantin Beznosov.2017. . In 2017 IEEE Symposium on Security and Privacy (SP)(San Jose, CA, USA, 2017-05). IEEE, 1077–1093. https://doi.org/10.1109/SP.2017.51.
[32]
Hao Fu, Zizhan Zheng, Sencun Zhu, and Prasant Mohapatra.2019. . In IEEE INFOCOM 2019 - IEEE Conference on Computer Communications(Paris, France, 2019-04). IEEE, 2089–2097. https://doi.org/10.1109/INFOCOM.2019.8737510.
[33]
Mia Xu Chen, Benjamin N. Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M. Dai, Zhifeng Chen, Timothy Sohn, and Yonghui Wu.2019. . In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(New York, NY, USA, 2019-07-25) (KDD ’19). Association for Computing Machinery, 2287–2295. https://doi.org/10.1145/3292500.3330723.
[34]
Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota.2023. . In Proceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia, 2023-07-26) (ICSE ’23). IEEE Press, 2149–2160. https://doi.org/10.1109/ICSE48619.2023.00181.
[35]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan.2023. . 620, 7972(2023), 172–180. https://doi.org/10.1038/s41586-023-06291-2.
[36]
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael A. Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li.2023. . (2023). https://doi.org/10.2139/ssrn.4583531.
[37]
Yuankai Xue, Hanlin Chen, Gina R. Bai, Robert Tairas, and Yu Huang.2024. . In Proceedings of the 46th International Conference on Software Engineering: Software Engineering Education and Training(Lisbon Portugal, 2024-04-14). ACM, 331–341. https://doi.org/10.1145/3639474.3640076.
[38]
Weijiang Li, Yinmeng Lai, Sandeep Soni, and Koustuv Saha.2025. . In Proceedings of the 17th ACM Web Science Conference 2025(New York, NY, USA, 2025-05-20) (Websci ’25). Association for Computing Machinery, 391–403. https://doi.org/10.1145/3717867.3717872.
[39]
Oleksandr Yermilov, Vipul Raheja, and Artem Chernodub.2023. . In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)(Toronto, Canada, 2023). Association for Computational Linguistics, 232–241. https://doi.org/10.18653/v1/2023.trustnlp-1.20.
[40]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan.2022. Constitutional AI: Harmlessness from AI Feedback. https://doi.org/10.48550/arXiv.2212.08073.
[41]
Alessandro Acquisti, Laura Brandimarte, and George Loewenstein.2015. . Science347, 6221(Jan.2015), 509–514. https://doi.org/10.1126/science.aaa1465.
[42]
Lorrie Faith Cranor.2008. . In Proceedings of the 1st Conference on Usability, Psychology, and Security(USA, 2008-04-14) (UPSEC’08). USENIX Association, 1–15.
[43]
Xiaoyuan Wu, Lydia Hu, Eric Zeng, Hana Habib, and Lujo Bauer.2025. . In Proceedings 2025 Network and Distributed System Security Symposium. Internet Society, San Diego, CA, USA. https://doi.org/10.14722/ndss.2025.230081.
[44]
Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A. Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, and Toby Jia-Jun Li.2025. The Obvious Invisible Threat: LLM-Powered GUI AgentsVulnerability to Fine-Print Injections. https://doi.org/10.48550/arXiv.2504.11281.
[45]
Jijie Zhou, Eryue Xu, Yaoyao Wu, and Tianshi Li.2025. . In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems(CHI ’25). Association for Computing Machinery, New York, NY, USA, 1–28. https://doi.org/10.1145/3706598.3713701.
[46]
Zhiping Zhang, Bingcan Guo, and Tianshi Li.2025. Privacy Leakage Overshadowed by Views of AI: A Study on Human Oversight of Privacy in Language Model Agent. https://doi.org/10.48550/arXiv.2411.01344.
[47]
Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, and Qingsong Wen.2025. . In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(KDD ’25). Association for Computing Machinery, New York, NY, USA, 6216–6226. https://doi.org/10.1145/3711896.3736561.
[48]
Klaus Krippendorff.2011. . 43(Jan.2011). ://repository.upenn.edu/handle/20.500.14332/2089.
[49]
Santiago Castro.2017. Fast Krippendorff: Fast computation of Krippendorff’s alpha agreement measure. https://github.com/pln-fing-udelar/fast-krippendorff.
[50]
Ayano Okoso, Keisuke Otaki, Satoshi Koide, and Yukino Baba.2025. . ACM Trans. Recomm. Syst.3, 4(April2025), 55:1–55:34. https://doi.org/10.1145/3718101.
[51]
Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, and Wei Xu.2024. Reducing Privacy Risks in Online Self-Disclosures with Language Models. https://doi.org/10.48550/arXiv.2311.09538.
[52]
Mahjabin Nahar, Sian Lee, Rebekah Guillen, and Dongwon Lee.2025. . Commun. ACM68, 7(June2025), 29–33. https://doi.org/10.1145/3720536.
[53]
Giacomo Marzi, Marco Balzano, and Davide Marchiori.2024. . MethodsX12(June2024), 102545. https://doi.org/10.1016/j.mex.2023.102545.
[54]
Haldun Akoglu.2018. . Turkish Journal of Emergency Medicine18, 3(Aug.2018), 91–93. https://doi.org/10.1016/j.tjem.2018.08.001.
[55]
Bytez.com, Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques.2024. Personalizing Reinforcement Learning from Human Feedback...://bytez.com/docs/neurips/94141/paper.
[56]
Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua.2025. . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang(Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 5259–5276. https://doi.org/10.18653/v1/2025.naacl-long.272.
[57]
Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, and Kaixiang Lin.2025. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. https://doi.org/10.48550/arXiv.2502.09597.
[58]
William Melicher, Mahmood Sharif, Joshua Tan, Lujo Bauer, Mihai Christodorescu, and Pedro Giovanni Leon.2016. . Proceedings on Privacy Enhancing Technologies(2016). ://petsymposium.org/popets/2016/popets-2016-0009.php.
[59]
Wen Wang Beibei Li.2025. . Information Systems Research36, 2(June2025), 761–780. https://doi.org/10.1287/isre.2023.0318.
[60]
Bin Liu, Mads Schaarup Andersen, Florian Schaub, Hazim Almuhimedi, Shikun (Aerin) Zhang, Norman Sadeh, Yuvraj Agarwal, and Alessandro Acquisti.2016. . 27–41. ://www.usenix.org/conference/soups2016/technical-sessions/presentation/liu.
[61]
Josephine Lau, Benjamin Zimmerman, and Florian Schaub.2018. . Proceedings of the ACM on Human-Computer Interaction2, CSCW(Nov.2018), 1–31. https://doi.org/10.1145/3274371.
[62]
Alina Stöver, Sara Hahn, Felix Kretschmer, and Nina Gerber.2023. . Proceedings on Privacy Enhancing Technologies2023, 2(April2023), 384–402. https://doi.org/10.56553/popets-2023-0059.
[63]
Basra Jehangir, Saravanan Radhakrishnan, and Rahul Agarwal.2023. . Natural Language Processing Journal3(June2023), 100017. https://doi.org/10.1016/j.nlp.2023.100017.

  1. https://lmarena.ai/leaderboard↩︎

  2. https://www.prolific.com/↩︎