Abstract

Writing survey questions that easily and accurately convey their intent to a variety of respondents is a demanding and high-stakes task. Despite the extensive literature on best practices, the number of considerations to keep in mind is vast and even small errors can render collected data unusable for its intended purpose. The process of drafting initial questions, checking for known sources of error, and developing solutions to those problems requires considerable time, expertise, and financial resources. Given the rising costs of survey implementation and the critical role that polls play in media, policymaking, and research, it is vital that we utilize all available tools to protect the integrity of survey data and the financial investments made to obtain it. Since its launch in 2022, ChatGPT and other generative AI model platforms have been integrated into everyday life processes and workflows, particularly pertaining to text revision. While many researchers have begun exploring how generative AI may assist with questionnaire design, we have implemented a prompt experiment to systematically test what kind of feedback on survey questions an average ChatGPT user can expect. Results from our zero–shot prompt experiment, which randomized the version of ChatGPT and the persona given to the model, shows that generative AI is a valuable tool today, even for an average AI user, and suggests that AI will play an increasingly prominent role in the evolution of survey development best practices as precise tools are developed.

1 Introduction↩︎

Whether you are a social scientist measuring public opinion, a doctor assessing a patient’s health status, a government conducting a census, or a corporation trying to understand customer preferences, survey data can provide information related to your most pressing questions. This is because a survey is simply a structured questionnaire that “...consists of a set of standardized questions with a fixed scheme, which specifies the exact wording and order of the questions, for gathering information from respondents" [1]. This generic data collection format is an incredibly adaptable tool that generates data that, when paired with statistical methods, can produce incredible insights[2].

However, the creation of valid and reliable survey questionnaires that produce high-quality data is a laborious task that requires extensive time, financing, and expertise. This is partly because survey designers not only have to worry about foundational semantic difficulties of survey questions, i.e., problems affiliated with meaning, complexity, and concepts, but also task difficulties, i.e., problems affiliated with recall, reporting, readability, and recording [3]. These difficulties are often in direct competition; for example, a researcher must decide whether it is better to use a shorter question that is faster to read or a longer version that might provide more clarity. Fortunately, the widespread use of surveys has resulted in a vast array of rigorously tested tools, guidelines, and best practices for survey development that can be used to reveal problems that even the most experienced team or individual may not be able to identify on their own.

Although there is extensive research on how one should design a survey and even tools are available to facilitate the implementation of best practices, practical implementation of all best practices is limited. Tools like the Question Appraisal System (QAS) that helps identify potential problems by offering over 25 checks per item are time consuming [4], focus groups and cognitive interviews that dig into a respondent’s thought process when answering each question are expensive and require the time of trained experts to administer them [5], and the sheer number of guidelines and recommendations for word choice, length, order of questions and order of responses are so numerous that even the most trained and careful individual or team can easily let things slip through the cracks [6].

One can easily see a multitude of scenarios where best practices cannot be followed practically, confidently, or wholistically – a student working on their thesis, a rapid response team needing to gather data after a crisis, or an underfunded research group. Indeed, the state-of-the-art in survey design follows the total survey error (TSE) framework, which calls for balancing data quality with the reality of resource constraints (time and budgetary) [7]. The TSE framework is broadly structured into two categories: measurement - how we measure, and representation - who/what we measure. When data quality is intentionally compromised within the TSE framework, concessions are usually made within the representation side due to sampling’s direct impact on total cost and timelines. However, resource constraints have an equally pressing, though less direct and often less intentional, impact on the measurement side. As the cost of collecting survey data rises and response rates fall [8], it is more crucial than ever that fielded questionnaires are of the highest quality.

Since the release of ChatGPT–3.5 in November of 2022, researchers have been exploring and testing how generative AI can improve and expand current best practices across all disciplines. Of the working and published studies regarding generative AI, most address its applicability to fields like education, law, and computer programming rather than research applications for social science [9], [10]. However, in recent years the prevalence of AI in social science has been growing, amid concerns of bias and other ethical issues [11].

When considering how to integrate AI into survey methodology, and recognizing the work currently being done, one sees two distinct research paths: 1) developing new tools that expand the array of best practices, opening doors that did not previously exist, and 2) using AI to make existing best practices more efficient, effective, and accessible. Scholars working in Path 1 are seeking to leverage AI to its greatest extent. However, much of that scholarship focuses on niche applications of survey design [12] and is restricted to digital survey modes [13], [14]. Furthermore, the tools resulting from this path will likely require access to extensive computing resources or costly proprietary software, raising accessibility limitations. Simultaneously, other scholars have begun investigating Path 2 [15]. However, much of the work in this area is example-based or is restricted to a specific case-study [16].

Figure 1: Illustration depicting how we can conceptualize AI as a safety net for survey methodology best practices based on research The appendix elaborates on highly effective best practices, such as cognitive interviews, writing rules, and expert panels, but resource constraints reduce the ability to use these tools well or even at all.

Our work falls into Path 2; however, we utilize a systematic, pre-registered experimental design to understand: “How can we use generative AI today to supplement the best practices we already have?",”What kind of feedback can an average ChatGPT user expect to receive regarding their survey questions?", and “How does choice of model and persona impact the output?". Thus, as shown in Figure 1, we seek to learn how we can best use AI as a safety net when our best practices fail or are simply unavailable due to a lack of time, money, or expertise.

Our \(2\times3\) factorial experiment aims to understand what kind of feedback an average AI–user could expect to get on their survey questions. Our experiment consists of 6 unique treatments that randomize the version of GPT used (3.5 and 4.0)⁴ and the persona (none, survey design expert, linguist) given to the AI model in the prompt. The prompts were kept short and simple, to reflect the type of prompts an average user might write. An example prompt reads:

You are a survey design expert who drafted a survey in English that will be administered to respondents from a single cultural background that speak English as their first language. For the following question, give me up to 5 features of the question that could cause the respondents to interpret the question differently. Do you own a car?

Each treatment was applied to 262 questions obtained from three sources: the Gallup Q12 Survey, the World Values Survey (WVS), and the Local Governance Performance Index survey (LGPI). The content of the responses from the AI model was qualitatively coded using a codebook that summarizes common survey question problems identified in the survey methodology literature.

Table 1: Qualitative Codes
Code	Label
1	Vague term or phrase
2	Specialized knowledge
3	Syntax problem
4	Unfair presumption
5	Double barreled
6	Undefined/ill-defined reference period
7	Difficult recall or reference period
8	Complex estimation/computation
9	Sensitivity
10	Leading/biasing question
11	Answer set
NOTA	None of the above
SysVar	Systematic Variation (subset of NOTA)

Table 1 summarizes the codes while the full codebook can be found in Appendix 6. Each treatment–question pair resulted in 5 statements whose content was individually coded. We then created a new set of indicator variables at the treatment–question level, one for each code, that indicates if the code appears in any of the 5 statements.

2 Methodology↩︎

2.1 Experimental Design↩︎

We use a 2\(\times\)3 factorial design resulting in 6 unique treatments. The two factors we randomize are the version of GPT (3.5 and 4.0) and the persona (none, Survey Design Expert, Linguist) given to the AI model.

New generative AI models and updated versions of existing models continue to be released at a rapid pace. While the commonly held assumption is that “new is better," we randomize the version of ChatGPT to systematically test if this holds across all aspects of the evaluation task. It also provides empirical evidence for the importance of re-evaluating the performance of AI-based tools and strategies as new versions and models are released.

Prompts given to LLMs enforce rules, automate processes, and ensure specific qualities and quantities of generated output; they also helps the researcher reflect the researcher’s specific desires [17]. Our control condition for persona provides a baseline of what the AI can generate with fewer restrictions. Then we specify the type of persona as (1) a survey design expert or (2) a linguist. Both types of expertise can inform survey design and might alter the types of problems flagged.

The general format of our instruction prompt is:

{Persona} drafted a survey in English that will be administered to respondents from a single cultural background that speak English as their first language. For the following question, give me up to 5 features of the question that could cause the respondents to interpret the question differently. {Question}

The prompt includes the phrase “a survey in English that will be administered to respondents from a single cultural background that speak English as their first language" to fix the cultural-linguistic context. This was an attempt to reduce the likelihood that the model provides comments on problems regarding language or cultural variation.

The decision to ask for 5 features was made arbitrarily to balance the likelihood of getting all possible feedback from the AI against the possibility of the output becoming too large and to provide consistent structure to the format of the AI’s responses. The distribution of the number of codes per treatment-question pair suggests that 5 was a sufficient limit.

Lastly, our prompt asks the AI to determine “features of the question that could cause the respondents to interpret the question differently." Instead of providing a codebook for a specific assessment tool, such as the QAS–99, we wanted to ask the AI to evaluate the heart of the issue in survey design, do all respondents interpret the question the same way–in hopes of obtaining the most comprehensive range of feedback one can expect from the models.

This design and the following hypotheses were pre-registered on OSF on November 23, 2023: https://osf.io/4q26k.

H1 (Main Effect of Model): We expect that version 4 of GPT is more likely to flag all codes than version 3.5.
H2 (Effect of persona compared to unspecified): Providing a specific persona (survey design expert or linguist) will be more likely to flag all codes than when no specific persona is given.
H3 (Effect of Linguist Persona on Specific Codes): Providing the model with the persona linguist will increase the likelihood of flagging Code 1: Vague term or phrase and Code 3: Syntax compared to a when the persona is unspecified or assigned as survey design expert.

2.2 Power Analysis↩︎

To determine a sufficient sample size, we performed a power analysis via simulation using the powerSim function in the simr package [18] in R [19] to understand our power to detect main and interaction effects in our multilevel logistic model [20]. While we also plan to run ordinary multilevel models (MLM), the sample size needs for a logistic MLM are greater, so we only run a power analysis for the logistic MLM [21], [22].

We considered sample sizes (number of questions) ranging from 200 to 350, direct effect sizes of 0.1, 0.41, and 0.7 (corresponding generally to small, medium, and large effects), and interaction effects of 0.05, 0.1, and 0.2. Given this is the first study of its kind, we had no previous work on which we could base our choice of effects sizes. Because each effect will be tested for 11 codes, we also run the power analysis with a Bonferroni correction for multiple testing (\(\alpha\) = 0.05/11) and without (\(\alpha\) = 0.05). While in practice one would use the Holm correction, for simplicity we use the more conservative Bonferroni correction in the simulation.

2.3 Sourcing Experimental Units↩︎

We sourced 262 questions from surveys developed by Gallup, the World Values Survey Association (WVS), and the Governance and Local Development Institute (GLD). Specific questions from Gallup, WVS, and the LGPI were selected based on their ability to be posed as “stand alone" questions. Questions were excluded if they relied on answer set structure, survey logic, were not phrased as a question, or were redundant. Questions that were structured as fill in the blank (for example,”I am...with my government.") were also excluded.⁵

From Gallup, we selected the publicly available Q12 survey, containing 13 questions to measure employee engagement [23]. We selected questions from the 7th wave of WVS. The WVS has been translated and implemented in over 90 countries since the first wave in 1981 [24]. Finally, the 2019 version of the LGPI survey conducted by GLD, was implemented in 3 countries, requiring six translations [25]. We believe these three sources provide a range of refinement and topics. We expect the AI to identify the most potential issues for the LGPI because this survey covers the largest range of topics with the fewest refinement iterations. GLD is a newer organization than Gallup and WVS, with fewer opportunities to refine errors in the questions. Thus, we anticipate fewer problems to be flagged by the AI for the WVS questions because WVS had six prior waves of experience and covers fewer topics. Finally, the short, precise Q12 survey from Gallup is the mostly widely used and vetted and thus should have the fewest unresolved problems for the AI to identify.

2.4 Obtaining GPT Output↩︎

Each survey question in our sample serves as an experimental unit (EU) that independently receives all six treatments. We used an R script to create all experimental and formatting prompts. Procedure 2 describes the process of obtaining the AI output. The final step was implemented using a custom R script. All code is available in the Harvard Dataverse [26].

Figure 2: Steps to Generate Experimental Output

By starting a new chat before administering every prompt, we ensured independence of the AI’s responses and prevented it from learning across treatments⁶. We utilized the web-version of ChatGPT with default parameter values to mimic the experience that a typical researcher seeking assistance from ChatGPT might encounter.

2.5 Qualitative Coding↩︎

The study’s goal is to flag problems from the literature, so researchers can rewrite the question before pretesting and implementation. Thus, we need a coding scheme that focuses on fixable problems like word choice, syntax, ambiguity, etc. The best survey questionnaire assessment tool is the Question Appraisal System with over 25 questions to identify issues with a specific research question [4]. We adopt a truncated version of this Appraisal System with fewer codes based on an array of sources [3], [27], [28], [28]–[32]. We also adapt Rothgeb et al.’s scheme originally utilized to compare pre-testing techniques’ abilities to identify problems[33] . This focuses on four types of problems: question content, question structure, retrieval from memory, and judgment/evaluation, resulting in 10 subcodes and 2 emergent codes. Unlike other schemes, this does not rely on respondent behavior/answers and provides the most specificity about word choice and question content. If the statement did not qualify as one of the 11 codes, it was coded as None of the Above.

2.6 Analysis↩︎

Given the factorial experimental design and that every experimental unit receives each treatment, we use 2-level hierarchical regression models to test our pre–registered hypotheses. When the outcome is binary, we fit the following model: \[\begin{align} logit(y_{ij}) = & \beta_0 + \beta_M\mathbf{1}_{GPT-4} + \beta_{T2}\mathbf{1}_{Survey Design Expert} \\ & + \beta_{T3}\mathbf{1}_{Linguist} + u_i + \varepsilon_{ij} \nonumber \end{align}\] where \(y_{ij}\) takes on the value 0 or 1, indicating if a specific code was flagged for the output from question \(i\) after receiving treatment \(j\). The \(\mathbf{1}\)s represent indicator variables for the noted factor level and the \(\beta\)s are the corresponding coefficients. The term \(u_i\) is the random intercept associated with the question and \(\varepsilon_{ij}\) represents the error. The inclusion of the random question intercept addresses the fact that each treatment was applied to every question.

We will also fit 2-level linear probability models for each binary outcome: \[\begin{align} y_{ij} = & \beta_0 + \beta_M\mathbf{1}_{GPT-4} + \beta_{L1}\mathbf{1}_{Survey Design Expert}\\ & + \beta_{L2}\mathbf{1}_{Linguist} + u_i + \varepsilon_{ij}. \nonumber \end{align}\] When the results are robust for the multilevel logistic and multilevel linear probability models, we present the multilevel linear probability models, due to their ease of interpretation.

When assessing the effect of model and persona on the likelihood of generating content related to each of the 11 primary codes, we will assess statistical significance using \(p\)-values that have been adjusted using the Holm’s correction for multiple testing. All other regression models and statistical tests will assess significance at \(\alpha = 0.05\).

Additionally, we present a series of exploratory analyses, some of which were suggested as such in the pre–analysis plan and others that emerged after qualitative analysis. First, we analyze the effect of question source on the total number of codes and likelihood of flagging individual codes, using modified versions of the 2-level hierarchical models. We also explore if there is any structure or information contained in the placement of the flagged codes, i.e., are certain codes more likely to appear earlier or later in the 5 statements? We run a one-way ANOVA test to determine if all codes have the same average statement placement. Should we find a statistically significant result, we run a Tukey HSD test to determine which codes have similar and different average placement.

3 Results↩︎

We find partial support for all three preregistered hypotheses and no statistically significant evidence to the contrary.⁷ Approximately 81% of the individual statements contained content related to a single code, 3% related to two codes, and 15% provided no qualitatively codeable content. On average, across the 5 statements at the question–treatment level, there were 2.69 qualitative codes. Notably, Code 1 (vague term) was coded in 98% of question-treatment sets due to the code being too broadly defined. Therefore, the results cannot be meaningfully investigated, but we present the results without interpretation for completeness. We investigate the effect of model and persona on the total number of codes flagged at the question–treatment level and the likelihood of flagging a specific code at the question–treatment level. Statistical significance of the regression results is determined based on a Holm’s multiple testing correction. All other statistical tests are assessed at \(\alpha = 0.05\).

3.1 Model Effect↩︎

GPT–3.5 and GPT–4.0 content systematically produce different content, as indicated by the circle icons in Figure 3. Compared to the earlier model, GPT–3.5, GPT–4.0 produces, on average 0.55 (\(p<\)0.001), more codes and is more likely to produce Code 3–Syntax problem(\(\beta = 0.08\)), Code 5–Double barreled (\(\beta = 0.06\)), and Code 8–Complex estimatation (\(\beta = 0.01\)), and, with the largest effect sizes of the study, Code 9–Sensitivity (\(\beta = 0.14\)) and Code 10–Leading/biasing (\(\beta = 0.15\))⁸. Finally, GPT–4.0 is 17 percentage points (\(pp\)) less likely to produce NOTA (none of the above) statements. Notably, the more advanced GPT–4.0 model returns more content related to codes that require more consideration of what the respondent may think or feel (e.g., sensitivity or leading) beyond what is directly stated in the question (e.g., missing recall period or syntax).

Figure 3: Marginal Effects Plot Showing the Impact of Each Experimental Attribute on the Likelihood of Each Code Topic Appearing in the AI Output. Statistical significance determined using p-values adjusted for multiple testing with Holm’s correction. — Figure 3: Marginal Effects Plot Showing the Impact of Each Experimental Attribute on the Likelihood of Each Code Topic Appearing in the AI Output. Statistical significance determined using \(p\)-values adjusted for multiple testing with Holm’s correction.

3.2 Persona Effect↩︎

As expected [17], prompting the AI with either persona yielded substantially different outputs than without a persona. The Survey Design Expert persona produces 0.22 (\(p<0.001\)) more codes on average, is 8\(pp\) less likely to produce NOTA (\(p< 0.01\)), and more frequently produces Code 5–Double barreled (\(\beta = 0.05, p<0.05\)) and Code 11–Answer set (\(\beta = 0.06, p<0.05\)) compared to no persona. Interestingly, these are two of the codes most uniquely associated with survey methodology in the codebook. Meanwhile, the Linguist persona is only more likely to produce Code 3–Syntax (\(\beta = 0.05\)), the code most closely related to the expertise of a linguist. These results highlight the importance of strategic persona selection when seeking feedback on survey questions.

3.3 Unpacking None of the Above↩︎

NOTA was the second most frequent code, applied to nearly 17% of all question-treatment pairs. NOTA-coded statements include a range of content from meaningless output to helpful but off-task feedback. We denoted one version of the latter that occurred at high frequency (41% of all NOTA), Systematic Variation (SysVar). In SysVar cases, instead of explaining why respondents might interpret the question differently, the AI explained why the respondents might answer differently. While this is distinctly off-task, the information can be viewed as suggestions for future analyses or control variables that should be included in the survey. For example,

...someone who works in a tech-related field might have a different perspective than someone in a field less influenced by modern technology. Furthermore, historical events, media coverage, or influential figures in the community can shape perceptions about the impact of science and technology.

From this output, one could consider the value of adding an employment sector question to their survey to control for an additional source of variation in their analysis. We find that this type of SysVar content was 5\(pp\) (\(p<0.05\)) less likely to be generated by GPT–4.0 compared to GPT–3.5 and by the Survey Design Expert compared to no persona.

Also outside the scope of the instructions, the AI offered feedback relevant to context “...if the survey is part of a larger questionnaire about legal documentation or citizenship, respondents might interpret the question differently than if its part of a survey about child health or education," and question order/priming”Without the context of the entire survey, its possible that other questions in the survey address related or overlapping topics, which could affect how a respondent interprets this particular question. For instance, if there’s another question about transportation means or assets in general, it might cause some confusion or affect how respondents approach this question." Qualitatively reading output like these examples emphasizes the importance of reviewing content for novel material.

3.4 Validity Checks↩︎

We performed three validity checks to access the quality of the AI output. First, we examined if the AI produced codes as one would expect, given the nature of the question source. Second, we examined the structure of the output by performing a statement order analysis. Third, we compared the AI–generated codes to ones produced by a human.

3.4.1 Source of the Question↩︎

We find that ChatGPT often flagged issues in expected ways depending on the question’s source. We consider the Gallup Q12 survey questions of employee engagement the most refined source, given its administration to over 3.3 million workers across 100,000+ workplace teams [23]. The other two sources are academic rather than corporate, covering a wider range of topics and with fewer implementations than the Gallup Q12–three prior waves for LGPI [34] and six for WVS [35]. Pursuant to this logic, our results indicate that Gallup-sourced questions result in 0.31 fewer codes on average, (\(p<0.01\)), at the statement-treatment level compared to the LGPI. Conversely, we also find that WVS-sourced questions, which have undergone twice as many rounds of implementation compared to the LGPI, yield 0.19 (\(p<0.001\)) more codes.

When looking at the likelihood of specific codes occurring, we find no evidence that Gallup-sourced questions are more or less likely to return any of the 10 codes compared to the LGPI. Alternatively, we find that the WVS-sourced questions are more likely to return Code 2–Technical Knowledge (\(\beta = 0.07, p\le 0.01\)), a sensible finding given that the WVS survey has a number of questions regarding opinions on international organizations, which can be considered technical knowledge, particularly compared to the LGPI which focuses on local experiences. WVS-sourced questions are also more likely to return Code 10–Leading/Biasing (\(\beta = 0.15, p \le 0.001\)). This is likely driven by the structure of the WVS questions which often make a declarative statement such as “To what extent do you agree with the following statement: The only acceptable religion is my religion." and ask the respondent if they agree or disagree. Since the question starts from a specific point of view, the AI often comments that the question may be leading. Finally, WVS questions are less likely to have Code 4–Unfair Presumption (\(\beta = -0.13, p \le 0.01\)) compared to the LGPI. A review of the AI output shows that the Code 4 result is driven by the fact that many of the LGPI questions occur in batteries that use skip patterns. The AI would provide a comment such as”The phrasing assumes that everyone who is reading the question has not obtained a passport" as it did not know there was an earlier filtering question in the survey.

3.4.2 Structure of Output↩︎

A statement order analysis indicates that the AI generally produces statements coded as 5–Double barreled, 1–Vague, and 3–Syntax first, on average placed at 2.39, 2.50, and 2.59 respectively. Codes 7–Difficult recall and 9–Sensitivity are more likely to be placed later, on average, placed at 3.98 and 4.15 respectively. The SysVar code, which indicates statements containing off-task content, were more likely to occur towards the end of the set of 5 statements with an average placement at 4.24. This could suggest that the AI was reaching to fill the 5 slots as requested by the prompt, indicating the need to provide instructions to the AI for cases where no meaningful comments are available.

3.4.3 Comparison to Human Performance↩︎

We consider how the AI-generated codes compare to a human expert,⁹ applying the codebook to the 262 questions. Figure 4 shows the variation in how often the human and AI made the same or different determination for each code–question pair. We see that the human and AI output agree–both have the code as present or absent–the majority of the time, around 80% for all experimental conditions. We find that, for all experimental conditions, rarely did the human code something the AI did not, about 5% of instances. On the contrary, the AI much more often (around 15% of instances) coded something the human did not.

Looking at the differences between M1 (GPT–3.5) and M2 (GPT–4), we see that the percentage of codes missed by the AI (Human Only) decreases by about 1 percentage point and we see a similar increase in the percentage of instances where both gave a code (Both). Although this is a promising finding regarding the improvement of the AI models, we also see that GPT–4 had a higher percentage of “AI Only" instances where it found things the human did not, with a simultaneous reduction in instances whether neither gave a code (Neither), suggesting possible spurious codes.

Figure 4: Comparison of Codes between Human Assessment and Experimental Output. In the legend, M1 = GPT–3.4, M2 = GPT–4, P1 = No Persona, P2 = Survey Design Expert , and P3 = Linguist.

Examining the variation by code, we see a higher percentage of “AI Only" instances for Code 4–Unfair presumption and 6–Undefined reference period. We also see that there is an increase in”AI Only" instances by GPT–4 in Code 10–Leading/biasing. The highest percentage of positive agreement (Both)¹⁰ is for Code 4–Unfair presumption. Full results available in Appendix 7.

4 Discussion↩︎

Given the plethora of preliminary findings, there is little doubt that AI-based tools and solutions will change how we design, implement, and analyze surveys.

Our study provides consistent evidence that corroborates the intuition that AI can be a powerful tool to support the development of survey questions. We find that, without any additional model training or parameter adjustments, one can receive feedback on their survey questions that aligns with best practices and recommendations found in the survey methodology literature. Our study also uncovers important sources of variation in the type of feedback received. We find that both choice of model and persona specification influence the content and focus of the feedback. Validity checks show that the AI models generally produce feedback as one would expect, given the nature of the questions provides structured output in a consistent manner and provides feedback similar to that of a human expert.

While much of our findings support the belief that “newer models are better", we also find evidence to the contrary. We see that both models can handle issues related to semantics; however, GPT–4.0 was better equipped to identify”task difficulties," those related to recall, reporting, readability, and recording. The newer model, GPT–4, identified more problems for each question on average, was less likely to produce NOTA, and was more likely to produce nearly all codes, especially issues related to sensitivity and leading. However, our comparison to human output shows that it also tended to flag “problems" that a human did not find concerning.

Additionally, like model choice, we find that adequate prompt engineering sharpens the AI content to be more intuitive and relevant. For example, the Linguist persona is more likely to identify question issues related to syntax and the Survey Design Expert is more likely to identify issues such as multi-barreled subjects and answer set problems.

This study also suggests promising new research avenues for how generative AI can support survey development. The content of NOTA statements shows that generative AI can likely provide a wider range of feedback for question batteries, answer sets, and survey logic, rather than issues related to a single survey question. Outside of the instructions, the AI produced statements related to priming, survey logic, and cognitive load if the survey were given verbally. Furthermore, the AI offered feedback related to the analysis of results but with fielding implications, e.g., potential control variables to account for systematic variation in responses.

4.1 Lessons Learned↩︎

Although we find the AI models to be remarkably helpful with survey question refinement, we conclude this paper with lessons we should carry with us as we integrate AI into our survey development workflows and see AI-based survey tools become more common. Firstly, it is critical that we continue to evaluate AI-based tools and procedures as new models and model versions become available. While we are inclined to believe that newer models will bring better performance, our results indicate that this is not guaranteed to be true across all facets of the survey development process. Similarly, we find strong evidence, consistent with research in other disciplines, that persona and other features of prompt design strongly impact the output content and format, underscoring the need for careful and intentional prompt design. Lastly, we must continually caution ourselves against using AI as a substitution for methodological expertise. We found that the AI models often flagged issues the human coder did not, demonstrating a possibility for new or missed insights, but also potential for spurious observations that take precious time to sort through.

Our experiment shows that, even today, without any additional training, ChatGPT, and likely other generative AI models, can function as an effective safety net in survey development, providing critical feedback that might otherwise have leave survey data unusable due to a lack of time, financing, or expertise. At the same time, we find that the AI can produce a significant amount of irrelevant feedback and miss issues a human expert would be able to identify. Therefore, for the time being, generative AI is best viewed as an evolving complement to, not a replacement for, established best practices in survey development. By utilizing these tools within the total survey error framework, we can maximize AI’s current strengths, while preparing for more revolutionary applications in the near future. By thinking of AI as a safety net and not a substitution, students, practitioners, and scholars can now apply an additional layer of quality control, catching errors that might otherwise slip past in time- or resource- constrained scenarios.

5 Sourcing Experimental Units↩︎

We sourced 262 questions from surveys developed by Gallup, World Values Survey Association (WVS), and the Governance and Local Development Institute (GLD). Specific questions from Gallup, WVS, and LGPI were selected on grounds of their ability to be posed as “stand alone" questions. Questions were excluded if they relied on answer set structure, survey logic, were not phrased as a question, or were redundant. Questions that were structured as fill in the blank (for example,”I am...with my government.") were also excluded.

These sources were partially selected to explore if the AI can handle questions with a range of expected quality. From Gallup, we selected the publicly available Q12 survey containing 13 questions to measure employee engagement [23]. We selected questions from the 7th wave of WVS. This survey has been translated and implemented in over 90 countries since the first wave in 1981 [24]. Finally, the 2019 version of the LGPI survey conducted by GLD was implemented in 3 countries requiring six translations [25]. We believe these three sources provide a range of refinement and topics. Of our externally sourced questions, we expect the AI to identify the most potential issues for LGPI because this survey covers the largest range of topics with the fewest iterations. GLD is a newer organization than Gallup and WVS with lower capacity and experience to resolve translation issue within their questions. Thus, we anticipate fewer problems to be flagged by the AI for the WVS questions because WVS had six prior waves of experience with international surveys, translated into more languages than the LGPI, and covers fewer topics. Finally, the short precise Q12 survey from Gallup is the mostly widely used and vetted compared to WVS and LGPI and thus should have the fewest unresolved problems that the AI should be able to identify.

6 Qualitative Coding of GPT Output↩︎

Qualitative coding was done in two different stages. In the first round, the output from the questionable questions were coded during codebook refinement, and the authors knew the source though not the treatment or question. We used an abductive approach which sits at the intersection of inductive and deductive approaches to analyze the output from the AI [36]. We begin with a theory driven framework to see if the AI can identify what the literature highlights as important while leaving room for emergent patterns. For the first 100 statements, the authors coded together to decide a code and settle on the proper line of reasoning. Then in two batches of 100, the authors coded independently and settled discrepancies while updating the code book. With the final 300 questionable question statements, the authors coded independently and passed an inter-coder reliability check ¹¹.

For the second round, the output from the other sources were randomized and anonymized at the statement level in an attempt to blind the qualitative coders from the source, treatment, and question. The authors coded the first 100 together to ensure the codebook’s applicability for output from a different question sources. Approximately a thousand statements were coded by two trained qualitative coders, though due to low inter coder reliability all codes were double checked by the authors. Once this was clarified, authors coded the remainder of the output independently and highlighted and decided codes for material that was unclear or novel.

Table 2: Complete Qualitative Codebook
Problem Type	Code	Label	Description	Problem might be solved by:	Example(s):
Question Content	1	Vague term or phrase	When a phrase or term is unclear, respondents don’t know what you mean. Respondents don’t know which aspect of the term to answer about, respondents don’t know what you’re talking about	Definition or quantifiers, Telling participants where to look or think	Kids, car, own, etc.
	2	Specialized knowledge	When a term or phrase requires more specialized information, terminology. Quality of response is affected by respondent’s knowledge of the topic	Additional uniform, technical information, or the respondent being an expert already OR Filtering out people without the specialized knowledge	PCP as angeldust or TIA as microstroke
Question Structure	3	Syntax problem	Structural problems like question too long, complex, or awkward, grammar problems	Rewrite the question	If any at all, verbs not matching the time frame
	4	Unfair Presumption	The questions assumes to a detriment prior knowledge or conditions about the respondent	Filter or acknowledge exclusion	Assumes respondent knowledge, ability to get pregnant
	5	Double Barrelled	Several questions or multiple subjects	One subject or multiple questions	Women and children, courts and police
Affiliated with time	6	Undefined or ill-defined reference period	Question doesn’t sufficiently define or narrow the time period the respondent should consider	Providing specific time frame	Recent, week, prior
	7	Difficult recall or reference period	Time frame accurately specified but is too specific or far in the past to be easy to answer. this is also highly specific or confusing reference period. Difficult to recall the specific item or topic of interest in the question (eg. too many options to pick from)	Provide memory cues (Simple orienting memory info)	Election of 2001 (Bush v. Al Gore election); What did you have for breakfast 10 days ago v. Where were you for 9/11?
Judgment/ Evaluation	8	Complex estimation/computation	Questions with sizable calculations (math) or combination of lots of information	Defining criteria or orienting to values	Grams of butter over a week, perceived speed of rate increase
	9	Potentially Sensitive/Emotionally charged or controversial	Question makes people uncomfortable, taboo, or offended, triggers emotion	Not asking the question	Obamacare, formality issue
	10	Leading/Biasing question	The wording of the question causes people to answer differently than their true opinion. The wording of the question makes certain answer choices or types of answer more likely than they would be if the question was worded differently or used different words.	Rephrasing the question; Randomizing part of the question or randomizing the answer set (if also answer set issue)	Excellent listed first making people more likely to choose excellent. List of answer is so long, people don’t read all options and choose items earlier in the list. Multiple topics are presented in the question and respondents might lock on to the first topic, ignoring the rest. Question is about a controversial topic and people will feel compelled to answer a certain way so they are not judged, e.g. Would you like to help starving children?
Emergent codes	11	Answer set	If the question had some answers within it and the AI flags problems affiliated	Modifying answer set	No neutral option, order of answers, insufficient answers
		Internal Code: Systematic Variation	Specific types/demographics of people will consistently answer certain ways	Not a problem for question but of interest for analysis
		None of the above	Problems that AI flags inappropriately (like cultural or linguistic variation) OR problems affiliated with the survey’s logic, format, mode, or context, this includes if you need a “treatment paragraph”	Not a problem for questions, solved by survey or “treatment paragraph”	Cultures are different and may answer differently, it’s difficult to ask questions on the phone, context of why you’re asking about teachers or why you should care

Table 3: Rothgeb et al (2007) adapted coding scheme. Asterisk indicates control question for experiment. Codes marked with “+" indicates emergent codes identified by project researchers.
Type		Description	Example
Question Content	1	Vague/Undefined Topic/Term	Do you own a car?*
	2	Specialized Knowledge (solvable by additional information)	Do you think that adults should be able to use PCP without any legal penalty?*
Question Structure	3	Question too long, Complex or Awkward, Syntax Problem	Before you got married, how long did you live in Maryland after you graduated from college?*
	4	Unfair Presumption	What is the best thing about driving a convertible?
	5	Several Questions or Multiple Subjects	Do you think women and children should be given the first available flu shots?*
	6	Leading/Biasing Question+
	7	Unclear/Vague Question Scope+
	8	Undefined Reference Period+	How long have you lived in College Park?*
Retrieval from Memory	9	Difficult Recall/Reference Period or Lack of Memory Cues	Which candidate did you vote for in the presidential election of 2004?
Judgment/ Evaluation	10	Complex Estimation	Compared to a year ago, do you feel the prices of most things you buy are going up faster than they did then, going up as fast, going up slower, or not going up at all?*
	11	Potentially sensitive, Emotionally Charged, or Controversial
Internal Codes	I1	Answer Set Problem+
	I2	Systematic Variation in Responses+

Typical combinations:

Assumption of prior (specialized) knowledge: 2 and 4 b/c unfair presumption
Especially problematic words like value are often 1 and 11 because it’s a scope problem money v. emotions, but even if provided a scope still vague for monetary value
9 and 10 often combined when the emotional trigger leads people to ONE direction in the answer (for example triggering pride)

Additional notes

For most questions, we do not include any answer sets. This means that if it says “no answer set provided or unclear how respondents should answer” this would be “none of the above” because it’s beyond the scope of the project. But for some questions, we do have some answers like “agree or disagree with the following…” and if this is the case and the AI mentions it, it’s most likely 12 (meaning that you can assume if it looks like an answer set it is 12). It may also be other things for example (The questions structure, starting with the positive term excellent, might subtly encourage respondents to think more positively about the performance of the police and courts. By starting with the most positive option, it may create a priming effect, leading respondents to consider the positive aspects first.)
Cultural differences or variation are typically none of the above, but read closely if it mentions normative, moral, religious, personal issues because that’s within the scope and is coded 9 for sensitivity concerns.
When unsure, think about how to fix the problem.
If multiple codes and unsure, try to break it down by sentence.
Air on the side of more codes, but if you code more than 1 you have to find explicit proof in the output.

Changes during codebook refinement

Deleted all codes related to response selections, interview difficulties, and survey flow because it’s beyond the scope of this study.
Changed the terminology from “complex” to technical.
Combine vague topic and undefined/vague term because they are very similar and the “solution” for the researcher is the same, i.e. to specify the topic.
Combine question too long and complex or awkward syntax because they are very similar and the “solution” for the researcher is the same, i.e. rewrite the question.
Narrowed/redefined “erroneous” to mean more of a presupposed condition. (ex. What’s the best thing about driving a convertible? Presupposes the subject has access to a convertible and enjoys driving it.) This problem might not exist if given the whole survey flow, but is especially relevant given the individual, independent question.
“Long” recall/reference period is changed to difficult because there is something different about “what day was your child born” and “what did you have 2 weeks ago today for breakfast?”

7 Full Results↩︎

7.1 Power Analysis↩︎

Figure 5: Power analysis results for detecting a Model effect of sizes 0.1, 0.41, and 0.7 (without multiple testing correction)

Figure 6: Power analysis results for detecting a Model effect of sizes 0.1, 0.41, and 0.7 (with multiple testing correction)

Figure 7: Power analysis results for detecting a Persona effect of sizes 0.1, 0.41, and 0.7 (without multiple testing correction)

Figure 8: Power analysis results for detecting a Persona effect of sizes 0.1, 0.41, and 0.7 (with multiple testing correction)

Figure 9: Power analysis results for detecting an interaction effect between Model and Persona of sizes 0.05, 0.1, and 0.2 (without multiple testing correction)

Figure 10: Power analysis results for detecting an interaction effect between Model and Persona of sizes 0.05, 0.1, and 0.2 (with multiple testing correction)

7.2 Experimental Analysis↩︎

Table 4: Multilevel Regression Results for Number of Codes Per Treatment-Question Pair
	Number of Codes
(Intercept)	\(2.34^{***}\)
	\((0.05)\)
ModelM2	\(0.55^{***}\)
	\((0.04)\)
PersonaP2	\(0.22^{***}\)
	\((0.05)\)
PersonaP3	\(-0.03\)
	\((0.05)\)
AIC	\(4117.60\)
BIC	\(4149.76\)
Log Likelihood	\(-2052.80\)
Num. obs.	\(1571\)
Num. groups: Qs	\(262\)
Var: Question (Int.)	\(0.18\)
Var: Residual	\(0.67\)
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 5: Regression Results with Holmes Multiple Testing Correction on Model Effect Only
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10	Code 11
(Intercept)	\(0.98^{***}\)	\(0.03^{**}\)	\(0.05^{**}\)	\(0.45^{***}\)	\(0.09^{***}\)	\(0.40^{***}\)	\(0.03^{**}\)	\(0.00\)	\(0.07^{***}\)	\(0.06^{***}\)	\(0.18^{***}\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.03)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
ModelM2	\(0.01\)	\(0.01\)	\(0.08^{***}\)	\(0.02\)	\(0.06^{***}\)	\(0.04\)	\(0.02\)	\(0.01^{*}\)	\(0.14^{***}\)	\(0.15^{***}\)	\(0.03\)
	\((0.00)\)	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.01)\)	\((0.02)\)	\((0.02)\)
PersonaP2	\(-0.00\)	\(0.02^{*}\)	\(0.03\)	\(0.03\)	\(0.05^{**}\)	\(0.02\)	\(0.00\)	\(-0.00\)	\(-0.02\)	\(0.04\)	\(0.06^{**}\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
PersonaP3	\(0.00\)	\(0.01\)	\(0.05^{**}\)	\(-0.01\)	\(-0.01\)	\(-0.04\)	\(-0.00\)	\(0.00\)	\(0.00\)	\(-0.01\)	\(-0.02\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
AIC	-2662.01	-752.65	862.20	2136.58	867.48	1786.64	-869.76	-3803.92	895.13	1017.60	1425.46
BIC	-2629.86	-720.49	894.36	2168.74	899.64	1818.79	-837.61	-3771.77	927.28	1049.76	1457.62
Log Likelihood	1337.01	382.32	-425.10	-1062.29	-427.74	-887.32	440.88	1907.96	-441.56	-502.80	-706.73
Num. obs.	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571
Num. groups: Qs	262	262	262	262	262	262	262	262	262	262	262
Var: Question (Int.)	0.00	0.02	0.02	0.06	0.03	0.11	0.01	0.00	0.02	0.02	0.05
Var: Residual	0.01	0.03	0.09	0.19	0.08	0.13	0.03	0.01	0.09	0.09	0.11
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 6: Regression Results with Holmes Multiple Testing Correction on Persona Effect (P2 = Survey Design Expert) Only
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10	Code 11
(Intercept)	\(0.98^{***}\)	\(0.03^{**}\)	\(0.05^{**}\)	\(0.45^{***}\)	\(0.09^{***}\)	\(0.40^{***}\)	\(0.03^{**}\)	\(0.00\)	\(0.07^{***}\)	\(0.06^{***}\)	\(0.18^{***}\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.03)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
ModelM2	\(0.01\)	\(0.01\)	\(0.08^{***}\)	\(0.02\)	\(0.06^{***}\)	\(0.04^{*}\)	\(0.02^{*}\)	\(0.01^{**}\)	\(0.14^{***}\)	\(0.15^{***}\)	\(0.03\)
	\((0.00)\)	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.01)\)	\((0.02)\)	\((0.02)\)
PersonaP2	\(-0.00\)	\(0.02\)	\(0.03\)	\(0.03\)	\(0.05^{*}\)	\(0.02\)	\(0.00\)	\(-0.00\)	\(-0.02\)	\(0.04\)	\(0.06^{*}\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
PersonaP3	\(0.00\)	\(0.01\)	\(0.05^{**}\)	\(-0.01\)	\(-0.01\)	\(-0.04\)	\(-0.00\)	\(0.00\)	\(0.00\)	\(-0.01\)	\(-0.02\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
AIC	-2662.01	-752.65	862.20	2136.58	867.48	1786.64	-869.76	-3803.92	895.13	1017.60	1425.46
BIC	-2629.86	-720.49	894.36	2168.74	899.64	1818.79	-837.61	-3771.77	927.28	1049.76	1457.62
Log Likelihood	1337.01	382.32	-425.10	-1062.29	-427.74	-887.32	440.88	1907.96	-441.56	-502.80	-706.73
Num. obs.	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571
Num. groups: Qs	262	262	262	262	262	262	262	262	262	262	262
Var: Question (Int.)	0.00	0.02	0.02	0.06	0.03	0.11	0.01	0.00	0.02	0.02	0.05
Var: Residual	0.01	0.03	0.09	0.19	0.08	0.13	0.03	0.01	0.09	0.09	0.11
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 7: Regression Results with Holmes Multiple Testing Correction on Persona Effect (P3 = Linguist) Only
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10	Code 11
(Intercept)	\(0.98^{***}\)	\(0.03^{**}\)	\(0.05^{**}\)	\(0.45^{***}\)	\(0.09^{***}\)	\(0.40^{***}\)	\(0.03^{**}\)	\(0.00\)	\(0.07^{***}\)	\(0.06^{***}\)	\(0.18^{***}\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.03)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
ModelM2	\(0.01\)	\(0.01\)	\(0.08^{***}\)	\(0.02\)	\(0.06^{***}\)	\(0.04^{*}\)	\(0.02^{*}\)	\(0.01^{**}\)	\(0.14^{***}\)	\(0.15^{***}\)	\(0.03\)
	\((0.00)\)	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.01)\)	\((0.02)\)	\((0.02)\)
PersonaP2	\(-0.00\)	\(0.02^{*}\)	\(0.03\)	\(0.03\)	\(0.05^{**}\)	\(0.02\)	\(0.00\)	\(-0.00\)	\(-0.02\)	\(0.04\)	\(0.06^{**}\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
PersonaP3	\(0.00\)	\(0.01\)	\(0.05^{*}\)	\(-0.01\)	\(-0.01\)	\(-0.04\)	\(-0.00\)	\(0.00\)	\(0.00\)	\(-0.01\)	\(-0.02\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)	\((0.02)\)
AIC	-2662.01	-752.65	862.20	2136.58	867.48	1786.64	-869.76	-3803.92	895.13	1017.60	1425.46
BIC	-2629.86	-720.49	894.36	2168.74	899.64	1818.79	-837.61	-3771.77	927.28	1049.76	1457.62
Log Likelihood	1337.01	382.32	-425.10	-1062.29	-427.74	-887.32	440.88	1907.96	-441.56	-502.80	-706.73
Num. obs.	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571
Num. groups: Qs	262	262	262	262	262	262	262	262	262	262	262
Var: Question (Int.)	0.00	0.02	0.02	0.06	0.03	0.11	0.01	0.00	0.02	0.02	0.05
Var: Residual	0.01	0.03	0.09	0.19	0.08	0.13	0.03	0.01	0.09	0.09	0.11
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

7.3 Robustness Checks↩︎

This section shows the results when a logistic multilevel model is used to model the likelihood of each code.

Table 8: Logistic Regression Results with Holmes Multiple Testing Correction on Model Effect Only
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10	Code 11
(Intercept)	\(9.23^{***}\)	\(-9.29^{***}\)	\(-3.57^{***}\)	\(-0.29^{*}\)	\(-3.60^{***}\)	\(-0.71^{***}\)	\(-8.06^{***}\)	\(-31.99\)	\(-3.61^{***}\)	\(-3.60^{***}\)	\(-2.30^{***}\)
	\((1.27)\)	\((0.90)\)	\((0.27)\)	\((0.14)\)	\((0.31)\)	\((0.19)\)	\((0.92)\)	\((76.01)\)	\((0.28)\)	\((0.27)\)	\((0.22)\)
ModelM2	\(1.28\)	\(0.42\)	\(0.94^{***}\)	\(0.09\)	\(0.75^{***}\)	\(0.28\)	\(0.92\)	\(27.54\)	\(1.68^{***}\)	\(1.70^{***}\)	\(0.22\)
	\((0.64)\)	\((0.35)\)	\((0.18)\)	\((0.12)\)	\((0.18)\)	\((0.13)\)	\((0.35)\)	\((76.01)\)	\((0.20)\)	\((0.19)\)	\((0.15)\)
PersonaP2	\(-0.00\)	\(1.06^{*}\)	\(0.44^{*}\)	\(0.14\)	\(0.61^{**}\)	\(0.12\)	\(0.00\)	\(-1.11\)	\(-0.31\)	\(0.38\)	\(0.53^{**}\)
	\((0.73)\)	\((0.43)\)	\((0.22)\)	\((0.14)\)	\((0.21)\)	\((0.16)\)	\((0.41)\)	\((1.16)\)	\((0.22)\)	\((0.20)\)	\((0.18)\)
PersonaP3	\(0.28\)	\(0.39\)	\(0.62^{**}\)	\(-0.05\)	\(-0.13\)	\(-0.30\)	\(-0.17\)	\(0.30\)	\(0.02\)	\(-0.16\)	\(-0.16\)
	\((0.75)\)	\((0.45)\)	\((0.22)\)	\((0.14)\)	\((0.23)\)	\((0.16)\)	\((0.42)\)	\((0.77)\)	\((0.21)\)	\((0.21)\)	\((0.19)\)
AIC	154.37	410.39	1068.72	2021.76	1061.31	1721.41	414.36	97.28	1065.69	1130.48	1420.38
BIC	181.17	437.18	1095.52	2048.56	1088.11	1748.20	441.16	124.08	1092.49	1157.27	1447.17
Log Likelihood	-72.19	-200.19	-529.36	-1005.88	-525.66	-855.70	-202.18	-43.64	-527.85	-560.24	-705.19
Num. obs.	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571
Num. groups: Qs	262	262	262	262	262	262	262	262	262	262	262
Var: Question (Int.)	49.30	45.48	2.12	1.65	3.74	5.03	26.33	0.00	2.57	2.27	3.20
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 9: Logistic Regression Results with Holmes Multiple Testing Correction on Persona Effect (P2 = Survey Design Expert) Only
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10	Code 11
(Intercept)	\(9.23^{***}\)	\(-9.29^{***}\)	\(-3.57^{***}\)	\(-0.29^{*}\)	\(-3.60^{***}\)	\(-0.71^{***}\)	\(-8.06^{***}\)	\(-31.99\)	\(-3.61^{***}\)	\(-3.60^{***}\)	\(-2.30^{***}\)
	\((1.27)\)	\((0.90)\)	\((0.27)\)	\((0.14)\)	\((0.31)\)	\((0.19)\)	\((0.92)\)	\((76.01)\)	\((0.28)\)	\((0.27)\)	\((0.22)\)
ModelM2	\(1.28^{*}\)	\(0.42\)	\(0.94^{***}\)	\(0.09\)	\(0.75^{***}\)	\(0.28^{*}\)	\(0.92^{**}\)	\(27.54\)	\(1.68^{***}\)	\(1.70^{***}\)	\(0.22\)
	\((0.64)\)	\((0.35)\)	\((0.18)\)	\((0.12)\)	\((0.18)\)	\((0.13)\)	\((0.35)\)	\((76.01)\)	\((0.20)\)	\((0.19)\)	\((0.15)\)
PersonaP2	\(-0.00\)	\(1.06\)	\(0.44\)	\(0.14\)	\(0.61^{*}\)	\(0.12\)	\(0.00\)	\(-1.11\)	\(-0.31\)	\(0.38\)	\(0.53^{*}\)
	\((0.73)\)	\((0.43)\)	\((0.22)\)	\((0.14)\)	\((0.21)\)	\((0.16)\)	\((0.41)\)	\((1.16)\)	\((0.22)\)	\((0.20)\)	\((0.18)\)
PersonaP3	\(0.28\)	\(0.39\)	\(0.62^{**}\)	\(-0.05\)	\(-0.13\)	\(-0.30\)	\(-0.17\)	\(0.30\)	\(0.02\)	\(-0.16\)	\(-0.16\)
	\((0.75)\)	\((0.45)\)	\((0.22)\)	\((0.14)\)	\((0.23)\)	\((0.16)\)	\((0.42)\)	\((0.77)\)	\((0.21)\)	\((0.21)\)	\((0.19)\)
AIC	154.37	410.39	1068.72	2021.76	1061.31	1721.41	414.36	97.28	1065.69	1130.48	1420.38
BIC	181.17	437.18	1095.52	2048.56	1088.11	1748.20	441.16	124.08	1092.49	1157.27	1447.17
Log Likelihood	-72.19	-200.19	-529.36	-1005.88	-525.66	-855.70	-202.18	-43.64	-527.85	-560.24	-705.19
Num. obs.	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571
Num. groups: Qs	262	262	262	262	262	262	262	262	262	262	262
Var: Question (Int.)	49.30	45.48	2.12	1.65	3.74	5.03	26.33	0.00	2.57	2.27	3.20
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 10: Logistic Regression Results with Holmes Multiple Testing Correction on Persona Effect (P3 = Linguist) Only
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10	Code 11
(Intercept)	\(9.23^{***}\)	\(-9.29^{***}\)	\(-3.57^{***}\)	\(-0.29^{*}\)	\(-3.60^{***}\)	\(-0.71^{***}\)	\(-8.06^{***}\)	\(-31.99\)	\(-3.61^{***}\)	\(-3.60^{***}\)	\(-2.30^{***}\)
	\((1.27)\)	\((0.90)\)	\((0.27)\)	\((0.14)\)	\((0.31)\)	\((0.19)\)	\((0.92)\)	\((76.01)\)	\((0.28)\)	\((0.27)\)	\((0.22)\)
ModelM2	\(1.28^{*}\)	\(0.42\)	\(0.94^{***}\)	\(0.09\)	\(0.75^{***}\)	\(0.28^{*}\)	\(0.92^{**}\)	\(27.54\)	\(1.68^{***}\)	\(1.70^{***}\)	\(0.22\)
	\((0.64)\)	\((0.35)\)	\((0.18)\)	\((0.12)\)	\((0.18)\)	\((0.13)\)	\((0.35)\)	\((76.01)\)	\((0.20)\)	\((0.19)\)	\((0.15)\)
PersonaP2	\(-0.00\)	\(1.06^{*}\)	\(0.44^{*}\)	\(0.14\)	\(0.61^{**}\)	\(0.12\)	\(0.00\)	\(-1.11\)	\(-0.31\)	\(0.38\)	\(0.53^{**}\)
	\((0.73)\)	\((0.43)\)	\((0.22)\)	\((0.14)\)	\((0.21)\)	\((0.16)\)	\((0.41)\)	\((1.16)\)	\((0.22)\)	\((0.20)\)	\((0.18)\)
PersonaP3	\(0.28\)	\(0.39\)	\(0.62^{*}\)	\(-0.05\)	\(-0.13\)	\(-0.30\)	\(-0.17\)	\(0.30\)	\(0.02\)	\(-0.16\)	\(-0.16\)
	\((0.75)\)	\((0.45)\)	\((0.22)\)	\((0.14)\)	\((0.23)\)	\((0.16)\)	\((0.42)\)	\((0.77)\)	\((0.21)\)	\((0.21)\)	\((0.19)\)
AIC	154.37	410.39	1068.72	2021.76	1061.31	1721.41	414.36	97.28	1065.69	1130.48	1420.38
BIC	181.17	437.18	1095.52	2048.56	1088.11	1748.20	441.16	124.08	1092.49	1157.27	1447.17
Log Likelihood	-72.19	-200.19	-529.36	-1005.88	-525.66	-855.70	-202.18	-43.64	-527.85	-560.24	-705.19
Num. obs.	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571	1571
Num. groups: Qs	262	262	262	262	262	262	262	262	262	262	262
Var: Question (Int.)	49.30	45.48	2.12	1.65	3.74	5.03	26.33	0.00	2.57	2.27	3.20
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

7.4 Statement Order Analysis↩︎

Table 11: Average Statement Placement by Code
Code	Average Statement Placement
Code 1	2.503495
Code 2	3.162500
Code 3	2.591549
Code 4	3.647551
Code 5	2.387387
Code 6	2.809917
Code 7	3.982759
Code 8	3.111111
Code 9	4.16682
Code 10	3.749004
Code 11	3.643060
SysVar	4.243108
NOTA	3.767263

Table 12: One Way ANOVA Test for Different Average Statement Placement of Codes
Effect	\(\hat{\eta}^2_G\)	90% CI	\(F\)	\(\mathit{df}\)	\(\mathit{df}_{\mathrm{res}}\)	\(p\)
Code	.197	[.183, .208]	165.95	12	8140	\(<\) .001

Pairwise Comparisons of Average Code Placement with Confidence Intervals and Adjusted \(p\)-Values
	Difference	Lower CI	Upper CI	Adjusted p-Value
	Difference	Lower CI	Upper CI	Adjusted p-Value
(continued on next page)
	1.2455092	0.9723676	1.5186509	0.0000000
	1.1395647	0.9064987	1.3726308	0.0000000
	0.6590052	0.1850065	1.1330040	0.0003053
	0.0880545	-0.2071261	0.3832352	0.9989028
	1.1440560	0.9845168	1.3035952	0.0000000
	-0.1161074	-0.4055510	0.1733362	0.9833451
	0.3064226	0.1370953	0.4757499	0.0000002
	1.4792639	0.9240813	2.0344464	0.0000000
	0.6076164	-0.7932429	2.0084756	0.9686212
	1.6531873	1.3606012	1.9457733	0.0000000
	1.2637687	1.0996543	1.4278831	0.0000000
	1.7396130	1.5192395	1.9599866	0.0000000
	-0.1059445	-0.4525398	0.2406508	0.9986055
	-0.5865040	-1.1254690	-0.0475390	0.0189930
	-1.1574547	-1.5485303	-0.7663790	0.0000000
	-0.1014532	-0.4035480	0.2006416	0.9965661
	-1.3616166	-1.7483803	-0.9748529	0.0000000
	-0.9390866	-1.2464631	-0.6317102	0.0000000
	0.2337546	-0.3778304	0.8453396	0.9889652
	-0.6378929	-2.0620470	0.7862613	0.9597872
	0.4076780	0.0185571	0.7967990	0.0300627
	0.0182594	-0.2862764	0.3227953	1.0000000
	0.4941038	0.1559130	0.8322946	0.0000985
	-0.4805595	-1.0003640	0.0392450	0.1038954
	-1.0515102	-1.4157266	-0.6872938	0.0000000
	0.0044913	-0.2619195	0.2709021	1.0000000
	-1.2556721	-1.6152546	-0.8960896	0.0000000
	-0.8331421	-1.1055273	-0.5607569	0.0000000
	0.3396991	-0.2550694	0.9344677	0.8001042
	-0.5319484	-1.9489623	0.8850656	0.9905658
	0.5136225	0.1515058	0.8757393	0.0001962
	0.1242039	-0.1449716	0.3933795	0.9497620
	0.6000483	0.2933131	0.9067834	0.0000000
	-0.5709507	-1.1214128	-0.0204886	0.0337155
	0.4850508	-0.0062024	0.9763040	0.0568701
	-0.7751126	-1.3225197	-0.2277055	0.0002039
	-0.3525826	-0.8471013	0.1419360	0.4709051
	0.8202586	0.0963079	1.5442093	0.0111135
	-0.0513889	-1.5272898	1.4245120	1.0000000
	0.9941820	0.4451069	1.5432572	0.0000002
	0.6047634	0.1120054	1.0975215	0.0032882
	1.0806078	0.5663691	1.5948465	0.0000000
	1.0560015	0.7338423	1.3781607	0.0000000
	-0.2041619	-0.6067925	0.1984687	0.9028244
	0.2183681	-0.1087490	0.5454851	0.5822618
	1.3912093	0.7694686	2.0129500	0.0000000
	0.5195618	-0.9089830	1.9481066	0.9928737
	1.5651327	1.1602372	1.9700283	0.0000000
	1.1757141	0.8512649	1.5001634	0.0000000
	1.6515585	1.2953306	2.0077864	0.0000000
	-1.2601634	-1.5770743	-0.9432525	0.0000000
	-0.8376334	-1.0505340	-0.6247328	0.0000000
	0.3352078	-0.2347769	0.9051926	0.7661595
	-0.5364397	-1.9432309	0.8703515	0.9891860
	0.5091313	0.1893477	0.8289148	0.0000104
	0.1197127	-0.0890658	0.3284911	0.7957456
	0.5955570	0.3401760	0.8509379	0.0000000
	0.4225300	0.1005804	0.7444795	0.0009907
	1.5953712	0.9763337	2.2144088	0.0000000
	0.7237237	-0.7036467	2.1510941	0.9028714
	1.7692946	1.3685623	2.1700269	0.0000000
	1.3798760	1.0606374	1.6991147	0.0000000
	1.8557204	1.5042317	2.2072090	0.0000000
	1.1728413	0.6000397	1.7456428	0.0000000
	0.3011938	-1.1067411	1.7091286	0.9999658
	1.3467647	1.0219870	1.6715423	0.0000000
	0.9573461	0.7409958	1.1736963	0.0000000
	1.4331904	1.1715831	1.6947978	0.0000000
	-0.8716475	-2.3755878	0.6322928	0.7835397
	0.1739234	-0.4465897	0.7944365	0.9994116
	-0.2154952	-0.7867774	0.3557870	0.9901596
	0.2603491	-0.3295613	0.8502596	0.9640905
	1.0455709	-0.3824400	2.4735819	0.4252259
	0.6561523	-0.7511651	2.0634697	0.9457837
	1.1319967	-0.2829850	2.5469784	0.2799217
	-0.3894186	-0.7115091	-0.0673281	0.0042024
	0.0864257	-0.2676551	0.4405066	0.9998592
	0.4758443	0.2175805	0.7341082	0.0000001

7.5 Question Source Analysis↩︎

Table 13: Regression Results of Question Source Analysis
	Number of Codes
(Intercept)	\(2.30^{***}\)
	\((0.05)\)
ModelM2	\(0.55^{***}\)
	\((0.04)\)
PersonaP2	\(0.22^{***}\)
	\((0.05)\)
PersonaP3	\(-0.03\)
	\((0.05)\)
SourceWVS	\(0.19^{**}\)
	\((0.07)\)
SourceGallup	\(-0.31^{*}\)
	\((0.15)\)
AIC	\(4114.94\)
BIC	\(4157.82\)
Log Likelihood	\(-2049.47\)
Num. obs.	\(1571\)
Num. groups: Qs	\(262\)
Var: Question (Int.)	\(0.17\)
Var: Residual	\(0.67\)
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 14: Question Source Regression Results with Holm’s Multiple Testing Correction on SourceWVS
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10
(Intercept)	\(0.98^{***}\)	\(0.01\)	\(0.07^{***}\)	\(0.50^{***}\)	\(0.10^{***}\)	\(0.43^{***}\)	\(0.02^{*}\)	\(-0.00\)	\(0.06^{**}\)	\(0.02\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.03)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)
ModelM2	\(0.01\)	\(0.01\)	\(0.08^{***}\)	\(0.01\)	\(0.06^{***}\)	\(0.04^{*}\)	\(0.02^{*}\)	\(0.01^{**}\)	\(0.14^{***}\)	\(0.15^{***}\)
	\((0.00)\)	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.01)\)	\((0.02)\)
PersonaP2	\(-0.00\)	\(0.02^{*}\)	\(0.03\)	\(0.03\)	\(0.05^{**}\)	\(0.02\)	\(0.00\)	\(-0.00\)	\(-0.02\)	\(0.04\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)
PersonaP3	\(0.00\)	\(0.01\)	\(0.05^{**}\)	\(-0.01\)	\(-0.01\)	\(-0.04\)	\(-0.00\)	\(0.00\)	\(0.00\)	\(-0.01\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)
SourceWVS	\(-0.00\)	\(0.07^{**}\)	\(-0.03\)	\(-0.13^{**}\)	\(-0.03\)	\(-0.12\)	\(0.01\)	\(0.00\)	\(0.04\)	\(0.15^{***}\)
	\((0.01)\)	\((0.02)\)	\((0.02)\)	\((0.04)\)	\((0.03)\)	\((0.05)\)	\((0.02)\)	\((0.00)\)	\((0.03)\)	\((0.02)\)
SourceGallup	\(0.01\)	\(-0.01\)	\(-0.08\)	\(-0.24^{**}\)	\(-0.03\)	\(0.14\)	\(0.02\)	\(-0.00\)	\(-0.03\)	\(-0.02\)
	\((0.02)\)	\((0.04)\)	\((0.05)\)	\((0.09)\)	\((0.06)\)	\((0.10)\)	\((0.03)\)	\((0.01)\)	\((0.06)\)	\((0.05)\)
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

Table 15: Question Source Regression Results with Holm’s Multiple Testing Correction on SourceGallup
	Code 1	Code 2	Code 3	Code 4	Code 5	Code 6	Code 7	Code 8	Code 9	Code 10
(Intercept)	\(0.98^{***}\)	\(0.01\)	\(0.07^{***}\)	\(0.50^{***}\)	\(0.10^{***}\)	\(0.43^{***}\)	\(0.02^{*}\)	\(-0.00\)	\(0.06^{**}\)	\(0.02\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.03)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)
ModelM2	\(0.01\)	\(0.01\)	\(0.08^{***}\)	\(0.01\)	\(0.06^{***}\)	\(0.04^{*}\)	\(0.02^{*}\)	\(0.01^{**}\)	\(0.14^{***}\)	\(0.15^{***}\)
	\((0.00)\)	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.01)\)	\((0.02)\)
PersonaP2	\(-0.00\)	\(0.02^{*}\)	\(0.03\)	\(0.03\)	\(0.05^{**}\)	\(0.02\)	\(0.00\)	\(-0.00\)	\(-0.02\)	\(0.04\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)
PersonaP3	\(0.00\)	\(0.01\)	\(0.05^{**}\)	\(-0.01\)	\(-0.01\)	\(-0.04\)	\(-0.00\)	\(0.00\)	\(0.00\)	\(-0.01\)
	\((0.01)\)	\((0.01)\)	\((0.02)\)	\((0.03)\)	\((0.02)\)	\((0.02)\)	\((0.01)\)	\((0.00)\)	\((0.02)\)	\((0.02)\)
SourceWVS	\(-0.00\)	\(0.07^{***}\)	\(-0.03\)	\(-0.13^{***}\)	\(-0.03\)	\(-0.12^{*}\)	\(0.01\)	\(0.00\)	\(0.04\)	\(0.15^{***}\)
	\((0.01)\)	\((0.02)\)	\((0.02)\)	\((0.04)\)	\((0.03)\)	\((0.05)\)	\((0.02)\)	\((0.00)\)	\((0.03)\)	\((0.02)\)
SourceGallup	\(0.01\)	\(-0.01\)	\(-0.08\)	\(-0.24\)	\(-0.03\)	\(0.14\)	\(0.02\)	\(-0.00\)	\(-0.03\)	\(-0.02\)
	\((0.02)\)	\((0.04)\)	\((0.05)\)	\((0.09)\)	\((0.06)\)	\((0.10)\)	\((0.03)\)	\((0.01)\)	\((0.06)\)	\((0.05)\)
\(^{*}p<0.001\); \(^{}p<0.01\); \(^{*}p<0.05\)

7.6 Comparison to Human Expert↩︎

Figure 11: Comparison of Human and AI Use of Each Code

References↩︎

[1]

A. K. L. Cheung, “Structured questionnaires,” in Encyclopedia of quality of life and well-being research, A. C. Michalos, Ed. Dordrecht: Springer Netherlands, 2014, pp. 6399–6402.

[2]

S. L. B. Payne, The art of asking questions: Studies in public opinion, 3. Princeton University Press, 1951.

[3]

S. Presser and J. Blair, “Survey pretesting: Do different methods produce different results?” Sociological Methodology, vol. 24, pp. 73–104, 1994.

[4]

G. B. Willis and J. T. Lessler, “Question appraisal system QAS-99,” National Cancer Institute, vol. –, 1999.

[5]

F. J. Fowler, Google-Books-ID: BZBB9NoqICYCImproving Survey Questions: Design and Evaluation. SAGE, 1995.

[6]

J. M. Converse and S. Presser, Google-Books-ID: AXRZbfHM_94CSurvey Questions: Handcrafting the Standardized Questionnaire. SAGE, 1986.

[7]

P. P. Biemer, “Total survey error: Design, implementation, and evaluation,” Public opinion quarterly, vol. 74, no. 5, pp. 817–848, 2010.

[8]

J. Wagner, H. Guyer, and C. Evanchek, _eprint: https://academic.oup.com/jssam/article-pdf/9/5/943/41727211/smaa024.pdf“Using Time Series Models to Understand Survey Costs,” Journal of Survey Statistics and Methodology, vol. 9, no. 5, pp. 943–960, Nov. 2020, doi: 10.1093/jssam/smaa024.

[9]

Y. Liu et al., “Summary of ChatGPT-related research and perspective towards the future of large language models,” Meta-Radiology, vol. 1, p. 100017, 2023.

[10]

T. Wu et al., “A brief overview of ChatGPT: The history, status quo and potential future development,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023.

[11]

C. A. Bail, Publisher: Proceedings of the National Academy of Sciences“Can Generative AI improve social science?” Proceedings of the National Academy of Sciences, vol. 121, no. 21, p. e2314021121, May 2024, doi: 10.1073/pnas.2314021121.

[12]

C. Paduraru, R. Cristea, and A. Stefanescu, “Adaptive questionnaire design using AI agents for people profiling.” in ICAART (3), 2024, pp. 633–640.

[13]

E. 2025. Conference, “ESRA 2025 program.” Online: https://www.europeansurveyresearch.org/conf2025/prog.php, 2025.

[14]

P. Sturgis, T. S. Robinson, L. Fung, and C. Roberts, “SOCbot: Using large language models to measure and classify occupations in surveys.” OSF, 2025.

[15]

F. Olivos and M. Liu, “ChatGPTest: Opportunities and cautionary tales of utilizing AI for questionnaire pretesting,” Field Methods, vol. 0, no. 0, p. 1525822X241280574, doi: 10.1177/1525822X241280574.

[16]

U. A. Fichtner et al., “Ex ploring the potential of large language models for integration into an academic s tatistical consulting service–the EXPOLS study protocol,” PloS one, vol. 19, no. 12, p. e0308375, 2024.

[17]

J. White et al., arXiv:2302.11382 [cs]“A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.” arXiv, Feb. 2023, doi: 10.48550/arXiv.2302.11382.

[18]

P. Green and C. J. MacLeod, “Simr: An r package for power analysis of generalised linear mixed models by simulation,” Methods in Ecology and Evolution, vol. 7, no. 4, pp. 493–498, 2016, doi: 10.1111/2041-210X.12504.

[19]

R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023.

[20]

M. G. Arend and T. Schäfer, “Statistical power in two-level models: A tutorial based on monte carlo simulation.” Psychological methods, vol. 24, no. 1, p. 1, 2019.

[21]

A. Ali et al., Publisher: Public Library of Science“Sample size issues in multilevel logistic regression models,” PLOS ONE, vol. 14, no. 11, p. e0225427, Nov. 2019, doi: 10.1371/journal.pone.0225427.

[22]

R. Moineddin, F. I. Matheson, and R. H. Glazier, “A simulation study of sample size for multilevel logistic regression models,” BMC Medical Research Methodology, vol. 7, p. 34, Jul. 2007, doi: 10.1186/1471-2288-7-34.

[23]

Gallup, “Who we are.” Available at https://www.gallup.com/corporate/212381/ who-we-are.aspx Accessed: 2023/08/11, 2023.

[24]

World Values Survey Association, “Home\(>\)what we do\(>\)questionnaire development.” Available at https://www.worldvaluessurvey.org/who we areContents.jsp Accessed: 2023/08/11, 2020.

[25]

Governance and Local Development Institute, “About GLD.” Available at https://gld.gu.se/en/about-gld/ Accessed: 2023/08/11, 2025.

[26]

E. Metheney and L. Yehle, “Replication Data for: AI Shows Immense Promise to Revolutionize How We Design Surveys.” Harvard Dataverse, 2025, doi: 10.7910/DVN/AN3U9F.

[27]

F. Fowler, “Coding the behavior of interviewers and respondents to evaluate survey questions,” in Question evaluation methods: Contributing to the science of data quality, J. Madans, K. Miller, A. Maitland, and G. B. Willis, Eds. John Wiley & Sons, 2011, pp. 7–21.

[28]

F. Bais et al., Publisher: SAGE Publications Inc“Can Survey Item Characteristics Relevant to Measurement Error Be Coded Reliably? A Case Study on 11 Dutch General Population Surveys,” Sociological Methods & Research, vol. 48, no. 2, pp. 263–295, May 2019, doi: 10.1177/0049124117729692.

[29]

F. J. Fowler Jr, “The problem with survey research.” SAGE Publications Sage CA: Los Angeles, CA, 2014.

[30]

A. Holbrook, Y. I. Cho, and T. Johnson, “The impact of question and respondent characteristics on comprehension and mapping difficulties,” International Journal of Public Opinion Quarterly, vol. 70, no. 4, pp. 565–595, 2006.

[31]

Y. P. Ongena and W. Dijkstra, “Methods of behavior coding of survey interviews,” Journal of official statistics, vol. 22, no. 3, pp. 1–34, 2006.

[32]

J. Van der Zouwen, W. Dijkstra, and J. H. Smit, “Studying respondent-interviewer interaction: The relationship between interviewing style, interviewer behavior, and response behavior,” in Measurement errors in surveys, P. P. Biemer, R. M. Groves, L. E. Lyberg, N. A. Mathiowetz, and S. Sudman, Eds. Wiley Online Library, 2004, pp. 419–437.

[33]

J. Rothgeb, G. Willis, and B. Forsyth, “Questionnaire pretesting methods: Do different techniques and different organizations produce similar results?” BMS: Bulletin of Sociological Methodology / Bulletin de Methodologie Sociologique, vol. 96, no. 96, pp. 5–31, 2007.

[34]

Governance and Local Development Institute, “Local governance process indicators (LGPI).” Available at https://gld.gu.se/en/research/local-governance-process-indicators-lgpi-0 Accessed: 2025/08/27, 2025.

[35]

World Values Survey Association, “Home - who we are.” Available at https://www.worldvaluessurvey.org/who we areContents.jsp Accessed: 2023/08/11, 2023.

[36]

M. Alvesson, J. Sandberg, and K. Einola, “Reflexive design in qualitative research,” The SAGE handbook of qualitative research design, vol. 2, pp. 23–40, 2022.

Researcher/Statistician/Head of Data, Governance and Local Development Institute, University of Gothenburg↩︎
PhD Student, University of Gothenburg↩︎
We gratefully acknowledge FORMAS (2016-00228, PI: Ellen Lust) and The Swedish Research Council (2016-01687, PI: Ellen Lust, and E0003801, PI: Pam Fredman) for financial support; Megan Baxter for data collection support; Emmanuel Kwabena Asare and Kiara Castaman for qualitative coding support; Rose Shaber-Twedt for editing support; Ellen Lust for insights and feedback; and the MiMac Conference for early encouragement and feedback.↩︎
These were the available models during data collection from September to December 2023.↩︎
Additional details can be found in Appendix 5.↩︎
The current memory feature of ChatGPT was not available when the data was collected from September to December 2023.↩︎
The hypotheses and analysis plan were submitted on OSF on November 23, 2023 https://osf.io/4q26k.↩︎
See the tables in Appendix 7 for full results.↩︎
The coding was completed by author Erica Ann Metheney.↩︎
Excluding Code 1–Vague Term that was coded almost 100% of the time by the AI models.↩︎
The coders were above 80% agreement for all codes and 70% full agreement on statements.↩︎

Generative AI as a Safety Net for Survey Question Refinement³

Abstract

1 Introduction↩︎

2 Methodology↩︎

2.1 Experimental Design↩︎

2.2 Power Analysis↩︎

2.3 Sourcing Experimental Units↩︎

2.4 Obtaining GPT Output↩︎

2.5 Qualitative Coding↩︎

2.6 Analysis↩︎

3 Results↩︎

3.1 Model Effect↩︎

3.2 Persona Effect↩︎

3.3 Unpacking None of the Above↩︎

3.4 Validity Checks↩︎

3.4.1 Source of the Question↩︎

3.4.2 Structure of Output↩︎

3.4.3 Comparison to Human Performance↩︎

4 Discussion↩︎

4.1 Lessons Learned↩︎

5 Sourcing Experimental Units↩︎

6 Qualitative Coding of GPT Output↩︎

7 Full Results↩︎

7.1 Power Analysis↩︎

7.2 Experimental Analysis↩︎

7.3 Robustness Checks↩︎

7.4 Statement Order Analysis↩︎

7.5 Question Source Analysis↩︎

7.6 Comparison to Human Expert↩︎

References↩︎

Subjects

Updated on Academus

Generative AI as a Safety Net for Survey Question Refinement3

Abstract

1 Introduction↩︎

2 Methodology↩︎

2.1 Experimental Design↩︎

2.2 Power Analysis↩︎

2.3 Sourcing Experimental Units↩︎

2.4 Obtaining GPT Output↩︎

2.5 Qualitative Coding↩︎

2.6 Analysis↩︎

3 Results↩︎

3.1 Model Effect↩︎

3.2 Persona Effect↩︎

3.3 Unpacking None of the Above↩︎

3.4 Validity Checks↩︎

3.4.1 Source of the Question↩︎

3.4.2 Structure of Output↩︎

3.4.3 Comparison to Human Performance↩︎

4 Discussion↩︎

4.1 Lessons Learned↩︎

5 Sourcing Experimental Units↩︎

6 Qualitative Coding of GPT Output↩︎

7 Full Results↩︎

7.1 Power Analysis↩︎

7.2 Experimental Analysis↩︎

7.3 Robustness Checks↩︎

7.4 Statement Order Analysis↩︎

7.5 Question Source Analysis↩︎

7.6 Comparison to Human Expert↩︎

References↩︎

Subjects

Updated on Academus

Generative AI as a Safety Net for Survey Question Refinement³