October 23, 2025
Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health and legal questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs’ ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users’ perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM-generated responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreements with each other on the privacy-preservation quality and helpfulness of the LLM-generated response. Further, we found high agreements among five proxy LLMs, while each individual LLM had low correlation with users’ evaluations. These results indicate that the privacy and helpfulness of LLM-generated responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs’ ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users’ perceived privacy and utility.
Large language models (LLMs) are rapidly being adopted for everyday tasks such as drafting emails, summarizing meetings, and answering health or legal questions [1]–[5]. In such uses, users may need to share private information, such as email history, contact details, and health records, with LLMs [6], [7]. Prior work has raised concerns about LLMs’ ability to keep users’ information secret, noting risks of data memorization and leakage [8]–[12]. Researchers have developed benchmarks to test LLMs’ ability to preserve user privacy [12], [13]. Mireshghallah et al.found that LLMs can identify sensitive information when asked direct questions (e.g., “How sensitive would people consider their Social Security numbers to be?”) [12]. Yet in scenarios with rich context (e.g., meeting transcripts), multi-turn user interactions (e.g., chat history), and nuanced private information (e.g., medical history shared through emails) LLMs still violate expectations, revealing sensitive details even when explicitly instructed not to [12]. Such failures may stem from the contextual nature of privacy, which involves not only the information being shared but also depends on the context (e.g., who sends and receives the information) and the potential consequences of disclosure [14]–[16].
To assess LLMs’ capabilities to preserve privacy under complex scenarios, Shao et al.developed PrivacyLens, a set of real-life scenarios with private information and tasks for LLMs to complete [13]. With the scenarios from PrivacyLens, researchers have evaluated and proposed ways to improve LLMs’ ability to preserve users’ privacy [13], [17], [18]. Additionally, prior work has pointed out that privacy and helpfulness need to be evaluated together because an LLM can generate a response that keeps all potentially sensitive information private, but at the cost of not being helpful at all [9], [13]. Although several prior works called for involving humans in evaluations of LLM-generated content [19]–[21], so far research on LLMs’ ability to preserve privacy and provide utility has relied on judgements provided by another LLM (so called proxy LLM) [12], [13], [18], [22]. The lack of human evaluations makes it unclear whether proxy LLMs’ judgements can be used to estimate users’ perceptions of privacy and helpfulness.
In this paper, we present a study investigating users’ privacy and helpfulness of LLM-generated responses, and to what extent proxy LLMs can approximate users’ perceptions. More specifically, we answer the following research questions:
RQ1 How do users perceive the privacy-preserving quality and helpfulness of LLM-generated responses in privacy-sensitive scenarios?
RQ2 Are proxy LLM evaluations of helpfulness and privacy-preserving quality aligned with users’ perceptions?
RQ3 How do proxy LLM justifications of their evaluations compare with user justifications?
We answer these research questions by first designing a survey to collect evaluations of privacy-preservation quality and helpfulness of LLM-generated responses. We randomly selected 90 scenarios from PrivacyLens, generated a response to each scenario, and recruited 94 online participants to provide their evaluations. We further used five proxy LLMs to complete the same survey and compared their evaluations with participants’ evaluations. Prior work has shown LLMs can respond differently when given the same input multiple times [20], [23], [24]; therefore, we asked each proxy LLM to evaluate every scenario five times (we call these five runs) to check for consistency. In the survey, we also asked participants and proxy LLMs to provide an explanation for each of their choices to help us understand the reasons behind their evaluations.
Across all scenarios, we found participants rated LLM-generated responses as helpful and privacy-preserving over 75% of the time. However, participants often disagreed with each other when assessing the same scenario (Krippendorff’s \(\alpha=0.36\)). From participants’ explanations, we discovered that participants had different views on what constitutes a helpful response in the scenario. For example, when reviewing the same scenario, one participant found the LLM-generated response to effectively complete the given task, while another participant found the response to miss key details. Further, different participants disagreed on what information needs to be kept secret. One participant, for example, found the LLM-generated email to not respect privacy norms because it included information they believed should only be discussed “face-to-face, and not in an electronic communication”. Another participant found the same email to respect privacy norms because it did not share sensitive details and thus acceptable to share (4.1). In contrast, individual proxy LLMs often agree with themselves across five runs (\(\alpha>0.88\)). The five proxy LLMs also had a moderate agreement with each other’s evaluations (\(\alpha=0.78\)) on both helpfulness and privacy-preservation quality. More importantly, we found proxy LLMs to be poor estimates of participants’ evaluations, correlating weakly with mean human judgements across 90 scenarios (Spearman’s \(0.24\leq \rho\leq 0.68\)). We further found when evaluating each scenario, proxy LLMs are not able to estimate the wide range of evaluations participants had (4.2). Our qualitative analysis revealed that proxy LLMs sometimes missed contextual information (e.g., public post on Facebook), failed to recognize private information (e.g., credit card numbers), or were not aligned with participants’ privacy views (e.g., determined clients’ first names as sharable, non-sensitive information) (4.3).
With our findings, we argue that human-centered evaluation should remain essential in privacy- and utility-related LLM evaluations. We further suggest exploring the possibility of improving proxy LLM-based evaluations by broadening the range of generated evaluations. Additionally, as a complementary approach, we suggest personalization—aligning proxy LLMs’ evaluations to users’ individual privacy and utility preferences. Last, we underscore the need to distinguish tasks where consistency is appropriate (e.g., objective questions) from those where diversity is desirable (preferences and perceptions). Establishing such taxonomies can guide study design, metric selection, and inform both users and researchers on when to rely on or go beyond deterministic LLM-generated responses (5).
In this section, we review prior work closely related to our study. We start with prior research on contextual privacy as that provides the foundation for our study of users’ privacy perceptions (2.1). We additionally summarize prior research of the helpfulness and privacy of LLM-generated content (2.2). Finally, we draw attention to the importance of conducting user studies in evaluation of LLMs (2.3).
Contextual Integrity frames privacy as the appropriateness of information flows given five parameters: sender, recipient, subject, information type, and transmission principle [14], [15]. Empirical work shows that users’ privacy judgements relies on these contextual variables [25]. Researchers have applied contextual integrity as a framework to understand online privacy policies [26], [27], users’ perceptions of smart home devices privacy [28], [29], as well as mobile app permission systems [30]–[32]. Recently, LLMs have been rapidly adopted into users’ everyday workflows (e.g., summarizing meetings, drafting emails, answering health or legal questions). Unlike earlier studies that often examined bounded ecosystems (e.g., smart homes or app permissions), LLM-generated responses could be passed between tools and peoples, increasing the chance of sensitive details reaching unintended audiences. Mireshghallah et al.investigated LLMs’ ability to keep information secret through a benchmark (ConfAIde) rooted in contextual integrity [12]. Using the benchmark, they found LLMs are capable of identifying private information according to the social norm in simple multiple-choice questions. However, they discovered LLMs, when tasked to generate a meeting summary, surfaced sensitive details that contextual integrity would deem inappropriate. Their findings together with the complex nature of contextual privacy lay the groundwork for further investigation into LLMs’ ability to preserve privacy and how users perceive privacy in LLM interactions.
Much prior work has evaluated and established LLMs’ ability to help users in various domains (e.g., summarizing text, drafting emails, answering medical questions, recommending products) [1], [2], [24], [33]–[38]. Yet, privacy preserving behavior at inference time has received comparatively less attention [8], [12], [13]. ConfAIde adapts contextual integrity into a benchmark and shows that LLMs disclose sensitive details in open-ended generation [12]. More recently, PrivacyLens builds on ConfAIde by collecting 493 privacy-sensitive seeds grounded in regulations, prior privacy literature, and crowdsourcing, then expanding them into expressive scenarios. These scenarios capture complex, task-oriented contexts (e.g., multi-turn message histories, email exchanges) in which LLMs are asked to complete a user task (e.g., drafting a social media post from prior messages). Using this framework, the authors report that LLMs can leak sensitive information in 26–39% of the evaluated scenarios [13]. Further, in LLM-assisted tasks, a privacy—helpfulness trade-off arises: aggressive redaction or evasive replies can reduce disclosure but undermine utility, whereas detailed answers improve usefulness while increasing risk [9], [39], [40]. Because neither extreme is desirable, evaluations taking both into consideration are essential.
Work across human-computer interaction, usable security and privacy, as well as natural language processing shows that privacy and helpfulness are both context-dependent and user-perceived outcomes. Perceived privacy is often subjective—people judge acceptability by social norms, roles, purposes, and audiences [14], [15], [41]. Usability research shows that notice-and-choice mechanisms do not reliably predict users’ privacy comfort without empirical feedback [42], [43]. Similarly, evaluations of privacy-preservation quality and helpfulness on LLM-generated content should also involve user input. In both ConfAIde and PrivacyLens, the authors used LLMs to generate responses to the given task before asking a proxy LLM to judge the privacy-preservation quality and helpfulness. These proxy LLMs are used as stand-ins for human evaluations, but it is unclear whether assessments provided by proxy LLMs faithfully estimate users’ perceptions.
Therefore, our study fills the gaps in prior work through a user study of privacy-preservation quality and helpfulness of LLM-generated responses. Additionally, we compare participants’ evaluations to proxy LLM assessments to understand the alignment between human and LLM judgements.
To understand users’ perceptions of how LLMs can be helpful while preserving privacy in privacy-sensitive scenarios (RQ1), we designed a survey with five-point likert scale questions on privacy perception and utility and asked participants to explain their choices (3.1). Prior work used proxy LLMs (explained in 1) to automatically evaluate LLM-generated responses [12], [13]. We wanted to know how well the proxy LLMs can estimate human judgements in privacy-sensitive scenarios (RQ2). Therefore, we used five different proxy LLMs to complete the same survey as in the user study (3.2). Lastly, we describe how we compared participants’ answers to proxy LLMs’ evaluations (3.3). We provide an overview of our study design in 1.
For the user study, we first needed a series of privacy-sensitive scenarios. We randomly selected 90 scenarios from the PrivacyLens dataset [13], a dataset developed to help with evaluating LLMs’ ability to preserve private information in everyday contexts used by many prior works (e.g., [17], [44]–[47]). Each scenario includes contextual information (e.g., notes taken during a meeting, email history) and a task (e.g., writing a meeting summary, replying to an email). Both contextual information and the task were in json format from PrivacyLens. We converted the scenarios into HTML without changing the content in order to present them to participants in a readable format. We provide an example of what participants saw in 6.
For each scenario, we used OpenAI’s ChatGPT-5 (GPT-5) to generate a response with the privacy-enhancing prompt from the PrivacyLens paper [13]. We chose the privacy-enhancing prompt because Shao et al.have shown that it lowers the likelihood of LLMs leaking secrets. GPT-5 was chosen as it is one of the newest and best performing LLMs, according to the LMArena1, that we had access to.
We recruited participants using Prolific, an online crowdsourcing platform used for recruiting research participants2. Participants had to be at least 18 years old, fluent in English, and located in the United States to be eligible. We received 95 completed surveys and rejected one due to the participant providing text that did not answer the question to all open-ended questions.
Potential participants clicked on our study through the link on Prolific first encountered an informed consent form. Participants who agreed to participate and met all the requirements proceeded to the survey. Within the survey, we first showed participants an example scenario, and LLM-generated response, along with instructions on how to complete the survey. Participants were then asked to answer seven survey questions, three on helpfulness and four on privacy-preserving quality, about the LLM-generated response to the scenario. We provide an example of what participants saw in 6. For each participant, this process repeated with five different scenarios. Each scenario was shown to five participants, so we could collect evaluations from participants with varying perspectives.
For each scenario, we asked participants seven questions in the survey. Three assessed helpfulness: (H1) a Yes/No question on whether the LLM-generated response completed the task, plus two five-point likert questions (H2) rating perceived helpfulness and (H3) whether the participant would use the response if they were the person in the scenario. Four more five-point likert questions assessed privacy-preserving quality: (P1) perceived sensitivity of the information in the scenario, (P2) likeliness to share such information with an LLM, (P3) judgement on the response respecting privacy norms, and (P4) judgement on the response respecting personal privacy preferences. For all survey questions but H1, we asked participants to provide an explanation for their answers.
At the end of the survey, we collected participants’ demographic information. Participants in our study came from a wide range of backgrounds in terms of age, gender, education level, and income. 46% of the participants were female and 51% were male. The most represented age group was between 35 and 44 years old (35% of responses), followed by 25 to 34 years old (23%) and 45 to 54 years old (21%). We provide a detailed breakdown of participants demographic information in Table 3.
In addition to human participants, we used five proxy LLMs to complete the same survey: OpenAI ChatGPT-5 (GPT-5), Llama-3.3-80B-Instruct (Llama-3.3), Gemma-3-27B (Gemma-3), Qwen-3-30B-A3B-IT (Qwen-3), and Mistral-7B-Instruct-v0.3 (Mistral). GPT-5 was selected because it is the newest close-ended model we had access to. For open-weights models, Llama-3.3, Gemma-3, Qwen-3, and Mistral were selected because they are the newest model at the time of our experiment and the largest model that fits on our A-100 GPU. We used these models to answer the same survey questions from the user study for all 90 scenarios. This approach follows prior work that used LLMs as proxies for human judgements in evaluating the privacy-preserving qualities of LLM outputs [12], [13], [17]. We used the default parameters (e.g., temperature) for each model for two reasons: (1) we did not have the ability to change GPT-5’s certain parameters as it is a closed-source model, and (2) we wanted to keep the model’s behavior as how users would normally use them. Since LLMs can produce different responses when given identical inputs, we collected five or more evaluations per scenario from each model to capture variability in responses.
To understand users’ perceptions of the privacy and utility of LLMs’ responses, we first examined participants’ survey answers collectively without considering individual scenarios. We aggregated participants’ choices by question and we share our findings in 4.1.1. Since at least five participants evaluated each scenario, we treat the participants as different coders and compute ordinal Krippendorff’s correlation coefficient (\(\alpha\)) [48], [49] to assess the level of agreement between participants. We share our findings on scenario-level agreement between participants in 4.1.2. In the same manner, we used Krippendorff’s \(\alpha\) to measure intra- and inter-proxy LLM agreement to assess their self-consistency and cross-model agreement, respectively [48], [50]–[52]. We share results of these analyses in 4.2
Beyond quantitative analysis of how users and proxy LLMs evaluated the privacy and helpfulness of LLM-generated responses, we also sought to understand the rationale behind the evaluations. For that reason, in the survey (see 3.1), we asked participants to provide an open-response explanation of their choice. Two researchers coded participants’ answers using a thematic analysis approach, where one researcher first coded 10 answers for each question, producing a codebook with themes. After that, two researchers reviewed the codes and their applications together, discussed and resolved any disagreements by updating the codes or their applications. Next, two researchers split the remaining answers and applied the codes from the agreed-upon codebook, creating and discussing new codes as they emerged. We share qualitative results throughout 4 to support our findings.
For the user study, we provided all participants with an informed consent form explaining the purpose of our study, expected length, risks and benefits, and compensation for completing the survey. Only participants who give their consent proceeded to the survey. For the survey, participants took an average of 47 minutes to complete and were compensated $10 for their time. The study procedure was reviewed and approved by an internal ethics committee.
In this section, we first share results from the user study that helps us understand how users judged the privacy-preservation quality and helpfulness of LLM responses (addressing RQ1; see 4.1). We then compared participants’ judgements to LLMs’ assessments to check whether proxy LLMs provide a good approximation of how people might react to the privacy preservation quality and helpfulness of LLM-generated responses (addressing RQ2; see 4.2). Lastly, we examine in detail how users’ explanations of their evaluations compare to LLMs’ explanations (addressing RQ3; see 4.3). We use the term evaluations to refer to a single participant’s evaluation of one scenario. With 94 participants each evaluating five scenarios, there is a total of 470 evaluations.
We first report on participants’ evaluations of each scenario to answer how users perceived the privacy-preservation quality and helpfulness of LLM-generated responses (4.1.1). After that, we report on the level of agreement between participants per scenario (4.1.2).
Individual participants generally found LLM-generated responses to privacy-sensitive scenarios helpful and expressed interests in using them. Over 90% of participants’ evaluations indicated the response completed the task given in the scenario. 87% of the evaluations indicated the response was helpful; participants further said they would use the response to complete the task 84% of the time (see 2).
Participants indicated that GPT-5’s responses mostly or completely complied with privacy norms 78% of the time. When asked about whether the response respects their personal privacy preferences, 83% of the time participants indicated it did (see 3).
With that said, 13% of evaluations indicated GPT-5’s response did not respect the privacy norm, with one participant explaining,
The response directly shares identifiable client information… –P78
Participants who said a response did not fit their personal privacy preferences found it to have excessive private information, such as medical history or login credentials. For example, one participant reacting to the LLM-generated response to scenario 15, which involved password sharing, explained that,
…[the response did] not respect my personal privacy preferences because it shares login credentials. –P13
While collectively, participants found GPT-5’s responses helpful and mostly privacy-preserving, there were notable disagreements between participants when rating the same scenario. Within each scenario, treating every participant (\(\geq 5\) per scenario) as an individual coder, we found participants have low agreements with each other (\(\alpha=0.3573\)) [48].
With that said, participants’ disagreements varied by scenario. For example, on 6% of scenarios, participants completely agreed with each other on the level of helpfulness (see Helpfulness of Response in the Participants subfigure of 4). At least two participants completely disagreed with each other on whether they would use the response on 20% of scenarios (see Would use Response in the Participants subfigure of 4). At least one participant had a different rating in all 90 scenarios when asked about the sensitivity of the information in the scenario and whether they would share such information with an LLM (see Information Sensitivity and Would Share Information w/ LLM in the Participants subfigure of 4). When asked about the response respecting privacy norms and personal privacy preferences, we saw in 88% and 95% of the scenarios, respectively, at least one participant had a different judgement (see Complies with Privacy Norms and Respects Personal Privacy in the Participants subfigure of 4).
To better understand the quantitative results, we qualitatively coded participants’ explanations of their choices and found that participants had reasonable justification for their assessments, even when their assessments greatly disagreed when evaluating the same scenarios. For example, when judging the helpfulness of scenario 42—writing an email to a colleague about team morale—one participant said the response was very helpful because
It provides enough information to inform while maintaining a casual and approachable style. –P50
while a different participant said the response is very unhelpful because
The response did not mention the most important and specific information about the business… –P63
Participants also had different evaluations on the privacy-preservation quality of identical scenarios. P21 believed the response to scenario 10—sending a message about Emily’s school experience, completely follows privacy norm because
…only relevant information was shared… –P21
In contrast, another participant thought the same response not at all respected privacy norms, explaining
The response includes too much private info about Emily and the difficulties she is facing… –P15
Our quantitative and qualitative analysis showed that participants have diverse judgements on privacy-preserving quality and helpfulness of LLM-generated responses in privacy-sensitive scenarios. Further, individual participants had reasonable explanations for their judgements, which often depend on the context of the scenario. We discuss implications of our findings in 5.
Result 1: Individual participants found LLM responses to privacy-sensitive scenarios helpful and privacy preserving most of the time. However, when evaluating the same scenario, participants disagree with each other on both privacy and utility of LLM responses. Through participants’ explanation of their choices, we observe such differences came from nuances in individual perspectives and contextual factors within scenarios.
We asked five LLMs (Gemma-3, GPT-5, Llama-3.3, Mistral, and Qwen-3) to answer the same set of questions given to participants in the user study. We refer to these five as proxy LLMs to distinguish them from the LLM, GPT-5, that was used to generate responses to each scenario. We compared participants’ judgements to proxy LLMs’ evaluations to determine how well they can estimate user perceptions.
| For each scenario | Krippendorff \(\alpha\) |
|---|---|
| Between \(\geq\) 5 participants’ evaluations | 0.3573 |
| Gemma-3 | 0.9788 |
| GPT-5 | 0.9328 |
| Llama-3.3 | 0.9830 |
| Mistral | 0.8823 |
| Qwen-3 | 0.9848 |
| Across the 5 LLMs (between 25 evaluations) | 0.7770 |
Prior work has documented that LLMs may respond to the same prompt in different ways [20], [23], [24] across multiple runs, so we asked each proxy LLM to rate each scenario five times (i.e., five runs), totaling \(5\times 90 = 450\) evaluations per proxy LLM. We found proxy LLMs to be quite consistent in their evaluations. For each proxy LLM, treating one evaluation per run as an individual coder, we found that the lowest Krippendorff’s \(\alpha\) was 0.88 (Mistral), and the highest was 0.98 (Qwen-3), as shown in 1. In addition to being self-consistent, we found that proxy LLMs also moderately agreed (\(\alpha=0.77\)) with each other across the 25 evaluations per scenario.
| Spearman \(\rho\) | Gemma-3 | GPT-5 | Llama-3.3 | Mistral | Qwen-3 |
|---|---|---|---|---|---|
| Helpfulness Questions | 0.31*** | 0.34*** | 0.24*** | 0.25*** | 0.24*** |
| Privacy Questions | 0.60*** | 0.68*** | 0.27*** | 0.07 | 0.62*** |
In 4.1.2, we shared that participants often disagreed with each other when evaluating the same scenarios. That said, we calculated the mean evaluations from participants per scenario per survey question, and wanted to see whether proxy LLMs could estimate that. We computed the Spearman’s rank coefficient (\(\rho\)) between the mean participant evaluations and proxy LLM evaluations and only found a weak or moderate correlation (\(0.24\leq\rho\leq 0.68\)) for all survey questions (see 2).
We convert participants and proxy LLMs evaluations on the five-point likert scale to numeric values (-1 meaning completely disagree, 1 meaning completely agree, and 0 meaning neutral). Using the values, we calculate the standard deviation of participants’ and proxy LLMs’ evaluations within each scenario. Proxy LLMs had no disagreements across five runs on over 70% of the scenarios when asked about the helpfulness of a response or whether they would use the response given the scenario. In contrast, participants only fully agreed with each other on 10% or fewer scenarios (see Helpfulness of Response and Would Use Response of 5). When asked about the sensitivity of the information in the scenario, participants never fully agreed with each other. Proxy LLMs, on the other hand, showed consistent evaluations on over 50% of the scenarios. Interestingly, Gemma-3, GPT-5, Llama-3.3, and Qwen-3 fully agreed on whether they would share the information in the scenario with an LLM in over 94% of the scenarios (see Information Sensitivity and Would Share Information w/ LLM of 5). Further, when asked to evaluate the LLM-generated responses’ compliance with privacy norms, we found Mistral fully agreed with itself on 8% fewer scenarios than participants. In contrast to participants fully agreeing on 12% of the scenarios, Gemma-3, GPT-5, Llama-3.3, and Qwen-3 fully agreed on 71% of the scenarios (see Complies with Privacy Norms of 5). Finally, participants fully agreed with each other on only 4% of the scenarios when asked about the response respecting their personal privacy preferences. Proxy LLMs’ evaluations are much more consistent, fully agreeing on 91% (Gemma), 80% (GPT-5), 89% (Llama-3.3), 44% (Mistral), and 100% (Qwen-3) of the scenarios.
Our results reveal that even with multiple proxy LLMs, users’ wide range of perceptions regarding privacy and utility of LLM-generated responses in privacy-sensitive scenarios are often not captured. We make suggestions for future evaluations and potential improvements to better align proxy LLMs with user perceptions in 5.
Result 2: Proxy LLMs’ assessments of helpfulness and privacy preservation are generally consistent both within and across models (intra-model \(\alpha\ge 0.88\); inter-model \(\alpha=0.77\)). However, proxy LLMs’ assessments align only weakly to moderately with participants’ assessments (\(0.24\leq \rho\leq 0.68\)). Moreover, proxy LLMs’ assessments fail to capture the substantial differences in how different participants assess the same scenarios.
In 4.1.2, we investigated why participants disagreed with each other by examining their explanations. Prior work has shown that LLMs can explain their reasoning processes [17], [36]. Hence, just as we asked participants to explain their choices, we asked proxy LLMs to explain their choices when evaluating each scenario. We qualitatively analyzed explanations in scenarios where proxy LLMs and participants disagreed the most. We found that while proxy LLMs’ explanations were on the surface reasonable justifications for their choices, they often missed key details (e.g., who the information is being shared with, nuances in the given task) within the scenario, leading to misalignment with participant evaluations.
We found proxy LLM sometimes overlook nuances (e.g., audience of a message, intended task to complete) when evaluating LLM-generated responses to privacy-sensitive scenarios, leading to misaligned evaluations with participants. Scenario 20 asks an LLM to act as Jane and generate a Facebook post about her recent life. The LLM generated a post including details about Jane’s interview with Alex for her work. Qwen-3, as the proxy LLM, indicated it was extremely likely to use the LLM-generated post if it was acting as Jane because
…The post drafted aligns perfectly with the task of posting on my Facebook page, as it highlights a personal story… –Qwen-3
For the same scenario, four out of five participants indicated they were extremely or somewhat unlikely to use the response with one participant explaining,
…it shares confidential interview information that could harm Alex professionally and personally… –P38
Similarly, GPT-5 as the proxy LLM missed the nuances in the task given in scenario 65. In this scenario, three college students (John, Emily, and Alex) are discussing their experiences finding a job. Acting as John, with the task of replying to Emily’s last message about her nervousness, the LLM generated a response talking about Alex’s interview time. GPT-5 as the proxy LLM indicated it was extremely likely to use the response in the scenario because,
…It directly answers [Emily’s] question using known information… –GPT-5
In contrast, four out of five participants indicated they were unlikely to use the response with one participant explaining,
Emily asked for job hunting tips and the response just summarized the previous conversations… –P81
By comparing proxy LLMs’ and participants’ explanations, we found the misalignments in evaluations of privacy caused by proxy LLMs failing to identify private information (e.g., credit card information), having different perceptions on what constitutes privacy norms, and not capturing the diversity of participants’ privacy preferences.
When asked to evaluate the sensitivity of the information in the scenario, proxy LLM can fail to recognize crucial private details. In scenario 78, Sarah sent credit card details (i.e., card number, expiration date, and three-digit CVV) to John through messaging. The proxy LLM (Mistral) indicated the information shared in the scenario was moderately sensitive, completely missed the credit card details and explained,
…there is no explicitly sensitive information like financial data, location, or private conversations… –Mistral
Participants, on the other hand, found the credit card information to be very or extremely sensitive, with one participant explaining
[The message] includes full credit card details, which is highly private and should never be shared. –P70
In the survey, we asked proxy LLMs and participants to evaluate how well the LLM-generated responses comply with privacy norms. Through their explanations, we found proxy LLM (Llama-3.3) to have different interpretations of privacy norms. For example, in scenario 50, the LLM generated a response as John sharing his day at work with friend Mary. The response included the amount of a tax refund John was able to help his client get. The proxy LLM (Llama-3.3) evaluated the response to be somewhat respectful of privacy norms, stating
…the mention of the $3,250 tax refund could be seen as slightly sensitive…Nonetheless, it doesn’t seem to significantly breach any privacy norms… –Llama-3.3
In contrast, four out of five participants indicated such a response not at all respects privacy norms explaining that
[The response] is mentioning the [dollar amount] and the name of the client…–P54
In Scenario 60, Jane wanted to send her friend Emily a message about her day. The LLM, acting as Jane, generated a message that included her clients’ names, Sarah and Tom. The proxy LLM (Qwen-3), indicated the generated message to mostly respect privacy norms, explaining that
The response shares general, non-sensitive information about the day’s work, such as the names of clients (Sarah and Tom), their progress, and the emotional tone of the day…. –Qwen-3
When evaluating the same scenario, two participants found the LLM-generated message to be not at all respectful of privacy norms, with one of them explaining
The response directly shares identifiable client information (e.g., “Sarah”, “Tom”, and the “New Beginnings” program) in a casual text message, which violates core privacy norms in support-based professions. –P78
We established quantitatively in 4.1 that participants have wide-ranging views on the privacy-preserving quality of LLM-generated responses. Through qualitative analysis, we found proxy LLMs to often have semantically similar explanations across runs, lacking the nuance to capture differences shown by participants.
For example, we previously shared that P78 found the LLM-generated message to scenario 60 not at all respect privacy norms. In contrast, P81 evaluated the same message as completely respecting privacy norms, saying no privacy norm was breached. Moreover, in scenario 11, Jane tasked an LLM to draft an email to her previous advisor for advice on a legal case she is working on. Three participants strongly agreed that the email respects their personal privacy preferences with one explaining,
The response is overall very generic when referencing the case… –P22
However, two other participants found the LLM-generated email violates their privacy preferences because they found sensitive details were revealed,
While the response avoids sharing attachments or client communications, it still reveals sensitive strategies and evidence details. I’d prefer only general requests for strategic advice, excluding confidential defense tactics and case specifics. –P12
The diverse preferences of participants were not captured by proxy LLMs. Across five runs, each proxy LLM rated this scenario the same across five runs. Gemma-3 consistently found the LLM-generated email to respect their privacy preferences, explaining,
The generated email is well-written, …, and respects confidentiality. It is exactly the type of email I would expect… –Gemma-3
The qualitative analysis of explanations strengthens our findings in proxy LLMs’ not being good estimates of participants’ evaluations. We discuss implications in detail in 5.
Result 3: Proxy LLMs provided reasonable explanations for their evaluations on the privacy-preservation quality and helpfulness of LLM-generated responses in privacy-sensitive scenarios. However, occasionally, proxy LLMs missed key details and/or nuances in the scenario, which led to misalignments between proxy LLMs’ and participants’ evaluations. Further, proxy LLMs provided narrow range of evaluations and explanations, failing to capture the diverse judgements of users.
The diverse opinions from participants’ evaluations and the weak-to-moderate correlation between participants’ and proxy LLMs’ evaluations highlight the limited ability of proxy LLMs to estimate human judgements regarding privacy and utility. In this section, we discuss the implications of our findings and suggest several directions for future work to improve proxy LLMs’ ability to estimate human judgements in these areas. We further raise the question of when consistency is desirable versus when diversity is more appropriate in LLM responses.
Our findings indicate that proxy LLMs are poor stand-ins for users in the types of scenarios we studied. Across scenarios, proxy LLMs produced a narrow band of evaluations over repeated runs, whereas participants expressed a more diverse range of judgements. This divergence suggests that the common practice of relying on proxy LLMs to assess LLM-generated content is inadequate, particularly for privacy and utility decisions where individual differences matter. In line with prior work advocating human-centered evaluations [19]–[21], we argue that human evaluations should remain a key component in assessing LLM-generated content.
We identify two complementary directions for future work. First, improve proxy LLMs’ capability to better approximate the broad range of human judgements. Prior work has typically configured proxy LLMs to produce deterministic outputs (e.g., temperature = 0) and collapsed evaluations to a single statistic (e.g., a leakage rate). Our results show that, while proxy LLMs are less variable than humans, they still produce differing evaluations for up to 55% of scenarios. Researchers should characterize and report this range and compare it with the range of human opinions. Future work could explore ways to expand proxy LLMs’ range of evaluations of privacy and utility, potentially through training, prompting, adjusting model settings (e.g., temperature), or other inference-time techniques, to better reflect the diversity in human perceptions.
Another potential direction for future work is personalizing proxy LLMs to better reflect individuals’ privacy and utility preferences. Prior work has shown that LLMs can be adapted to individual users [55]–[57]. In the context of privacy and utility, researchers have also found ways to learn users’ privacy preferences and provide users with privacy-related options for online tracking and smart home devices [58]–[62]. Personalizing proxy LLMs could therefore lead them to match individual privacy and utility preferences more closely. One potential method, drawing from computer vision and named entity recognition (NER) fields of studies [63], is to investigate the importance of different components of the scenario in informing proxy LLMs’ judgements. Certain components of the scenario, such as type of data being shared, the purpose or intent of sharing, or the audience most influence a proxy LLM’s judgements. Identifying these components can help researchers understand which elements are most critical for accurate privacy and utility assessments, and in turn inform the alignment between proxy LLMs and human evaluations. Improving the fidelity of user preference estimation would make proxy LLM evaluations more practical and cost-effective for researchers (as they are often easier to set up and cost less than user studies).
While much prior work has emphasized the importance of consistency in LLM-generated content, our results highlight domains where convergence on a single answer is not necessarily appropriate. In settings where there may be a clear, objective answer (e.g., answering factual questions, providing straightforward medical diagnoses, supplying best practices for computer security), consistency is desirable. In contrast, privacy and perceived utility are subjective, determined by the context and individual preferences (e.g., a fitness app tracking users’ location) [14], [41]. When an LLM is positioned as a proxy for human judgements in such domains, producing a range of evaluations rather than a single judgement could better approximate the population it stands in for. We call for clearer taxonomies separating objective-answer tasks from preference- and perception-sensitive tasks. Making these distinctions can help researchers in guiding study designs and evaluation metrics, as well as when to rely on or go beyond deterministic proxy LLMs evaluations. Such taxonomies can also help users interpret whether an LLM’s output varies because the input was ambiguous or because the task has inherently subjective answers.
In this study, we collected 94 participants’ evaluations on the privacy and helpfulness of LLM-generated responses to 90 PrivacyLens scenarios. We found participants disagreed with one another when judging the same scenario. We further collected proxy LLMs’ evaluations of the same scenarios and found proxy LLMs were very consistent with themselves and moderately consistent with each other. Yet, we found only weak to moderate correlations between participants’ and proxy LLMs’ evaluations. Qualitative analysis of explanations provided by participants and proxy LLMs revealed where and why proxy LLMs’ evaluations diverged from participants’ evaluations (e.g., overlooking audience, not recognizing sensitive information). These results indicate that privacy and utility are individualized and context-dependent, and that proxy LLMs are unreliable at approximating user perceptions in privacy-sensitive scenarios. With our findings, we call for human involvement in evaluating LLM-generated responses, especially in the privacy and utility domains. We further discussed the potential for improving proxy LLMs’ ability to estimate human judgements in privacy and utility by exploring methods such as expanding the diversity of outputs and personalizing LLMs to match particular users’ needs and preferences.
In the appendix, we provide the demographics information for the 94 participants who completed the survey in 3. We provide an example scenario from PrivacyLens through a screenshot to show what we presented to participants in our survey 6.
| Demographic | Count | Percentage |
|---|---|---|
| Age | ||
| 18 - 24 | 2 | 2.13% |
| 25 - 34 | 22 | 23.40% |
| 35 - 44 | 33 | 35.11% |
| 45 - 54 | 20 | 21.28% |
| 55 - 64 | 10 | 10.64% |
| 65 - 74 | 6 | 6.38% |
| 75 - 84 | 1 | 1.06% |
| Gender | ||
| Female | 44 | 46.81% |
| Male | 48 | 51.06% |
| Prefer to self-describe | 2 | 2.13% |
| Education | ||
| No high school degree | 1 | 1.06% |
| High school graduate, diploma or the equivalent | 9 | 9.57% |
| Some college credit, no degree | 17 | 18.09% |
| Trade, technical, vocational training | 3 | 3.19% |
| Associate’s degree | 12 | 12.77% |
| Bachelor’s degree | 29 | 30.85% |
| Master’s degree | 15 | 15.96% |
| Professional degree | 1 | 1.06% |
| Doctorate degree | 7 | 7.45% |
| Income | ||
| Under $25,000 | 9 | 9.57% |
| $25,000 to $49,999 | 22 | 23.40% |
| $50,000 to $74,999 | 18 | 19.15% |
| $75,000 to $99,999 | 10 | 10.64% |
| $100,000 or more | 35 | 37.23% |