Is ChatGPT detrimental to innovation? A field experiment among university students

Hassan, Mazen; Amin, Engi; Mansour, Sarah; Kelani, Zeyad

doi:10.1108/REPS-08-2024-0045

Purpose

This paper investigates the potential collateral effects of A.I innovations, specifically ChatGPT, on three key variables: innovation, readiness to exert effort and risk behavior.

Design/methodology/approach

A field experiment was conducted involving nearly 100 senior university students at a public university in Egypt, at a time when ChatGPT had not yet been legally operational. Over a one-month period, participants submitted three graded essay assignments. The treatment group utilized ChatGPT to write the essays, while the control group completed the assignments without such assistance. After submission, both groups participated in a lab-based innovation game, a risk game and a real effort task to measure their respective innovation, risk aversion and effort exertion.

Findings

The results reveals that students who used ChatGPT demonstrated significantly lower levels of innovation (ChatGPT usage is associated with a decrease in innovation scores by approximately 0.6–0.72 standard deviation points, at the 95% confidence level) and risk aversion (individuals in the ChatGPT group are more likely to become risk lovers, at the 90% confidence level) compared to the Non-ChatGPT group. Although the reduction in effort exerted by the ChatGPT group was not statistically significant, the overall trends suggest a potential decrease in effort related to the use of A.I. applications.

Research limitations/implications

On avenues for future research, although field experiments will always have the advantage of high ecological validity, testing the effect of AITGs on behavior could also benefit from the controlled environment of lab experiments. In such designs, spill-over worries would be minimal and internal validity would be high. To address the issue of external validity however, new experimental designs could be thought of to test the generalizability of the findings; longitudinal studies that trace the effect of technology across time, expanding the participant pool across multiple institutions that vary in terms of type (public/private), age of students (schools/universities), academic background and majors … etc.

Practical implications

On practical and policy implications – and in line with economic theories on innovation and economic growth and development (Schumpeter, 1942; Romer, 1990; Acemoglu and Robinson, 2013) – our humble findings point to an urgent need to augment existing education with concrete, innovation-based practices. These could include embedding design thinking and problem-based learning modules directly into curricula, as well as choice architecture that alters people’s behavior without restricting options. In addition, designing student innovation competitions and startup incubators that reward novelty and impact would incentivize the value of innovation.

Originality/value

This study is among the first to empirically test the impact of ChatGPT on innovation, effort, and risk behavior in a real-world academic setting. It provides preliminary evidence of the potential negative effects of A.I. applications on these variables, offering valuable insights for further research into the broader implications of A.I. on human behavior.

Introduction

Even before the rise of ChatGPT and other artificial intelligence text generators (AITGs), there has been a global debate – that is not short of controversy – on the risks and benefits of automation and artificial intelligence (Pasquale, 2015), with some calls for regulation (Wachter and Mittelstadt, 2019), or at least a pause on developing the technology until further studies assess their impacts. The ascend of ChatGPT has certainly intensified this debate, with worries about the new technology from scientists, educators, and even students (see, for example, Farhi et al., 2023). Indeed, recent survey evidence has shown that two-thirds of students use ChatGPT in their studies (von Garrel and Mayer, 2023). On the use of ChatGPT-like LLMs for assisted learning, Pan and Ni (2024) have found that males and junior-level students showed a significantly higher proportion). In this context, this paper seeks to test the effect of the continuous usage of ChatGPT on innovation, effort levels, and risk behavior among university students.

Our assumption is straightforward and intriguing: that repeated reliance on technology to come up with answers to questions and to solve problems (just as ChatGPT does) creates a social norm of dependence on technology to innovate, and – if used over a sustained period of time – crowds out the human innovation drive. To explore this question, we conducted a pre-registered field experiment with nearly 100 senior university students at a public university in Egypt, where we tested the effect of using ChatGPT for a month in doing assignments on three dependent variables usually assumed by the literature to be A.I.-collaterals: innovation, risk behavior, and readiness to exert effort. The field experiment lasted approximately four weeks and involved participants submitting three graded essay assignments during that period. In the treatment (ChatGPT) group, students were asked to write essays using ChatGPT, whereas in the control (Non-ChatGPT) group, such an option was neither mentioned nor allowed (the experiment was conducted before ChatGPT was legally operational in Egypt). One week after all assignments were submitted, the two groups were asked to participate in a lab-based innovation game, a risk game, and a real effort task to measure their respective innovation, risk aversion, and effort exertion. Our findings are nuanced; the ChatGPT group was significantly less innovative (measured by how frequently they changed the sales strategies at the 95% confidence level) and less risk averse (at the 90% confidence level). The ChatGPT group also exerted less effort (measured by how frequently they recorded their strategies for reference over the rounds), although this result was not statistically significant.

This paper is divided into five sections that detail our theory and design. The following section outlines our theory, whereas section three presents the experimental design. Section four presents the findings, and section five concludes with a discussion of the study’s possible limitations and avenues for future research.

Theory

Technological advances have become a significant factor influencing human behavior and, consequently, the social norms that emerge from – and later guide – that behavior. The invention of personal computers and laptops has made working from home beyond office hours an expected behavior from employers. The rise of mobile phones has made people hardly unavailable. The rise of social media has affected the attention span of its users (Firth et al., 2020), leading to increased social comparison and a decline in social interaction (Fischetti, 2016). This is hardly surprising, given the penetration of technology into almost every aspect of our everyday life, coupled with our increasing reliance on it to make our lives easier – and, in parallel, also different.

The premise of this paper is that technologies have spill-over effects across domains and are capable of affecting behavior in areas, and at times, beyond those in which they are initially used. For example, it is now almost a fact that smartphones make us less attentive to other tasks while using them (Altmann et al., 2014). They also negatively affect our recall accuracy and behavioral control (Chen et al., 2016), making their effects persist – long after we stop using them – although of course, such effects depend on the duration and intensity of usage.

We contend that a similar spill-over effect occurs concerning one of the newest – yet at the same time one of the fastest-growing – technologies: ChatGPT. We argue that the continuous use of ChatGPT would also leave its imprint on socio-economic behavior in domains where ChatGPT is not primarily used. We focus on three behavior types central to economic growth: innovation, effort, and risk behavior.

Starting with innovation; it is indeed a value that is crucial to economic activity, productivity, and even social relations (e.g. Amabile, 2019; Bruhn and McKenzie, 2009). Innovation, however, is a mentally pressing task. It requires individuals to continually think about a specific task (and how to improve it) through a reiterative process of asking questions and attempting to answer them. A central dimension of innovation, therefore, involves problem-solving (Amabile, 2019; Bieser, 2022; Sternberg and Lubart, 1999); whereby knowledge is produced via experimentation (Arrow, 1969) and where early failure is an integral part of the process (Manso, 2011). ChatGPT – and other AITGs – however, predominantly involve asking an algorithm a series of questions and waiting for the chatbot to generate answers. Repeatedly outsourcing problem-solving to ChatGPT may reduce mental effort and, over time, undermine human-driven innovation (Kosmyna et al. 2025; Stadler et al., 2024; Gerlich, 2025; Zhang et al., 2024). Taking the mental exercise out of problem-solving is likely to lead to counter-innovation attitudes. Our first hypothesis, therefore, is:

H1.

The continuous use of ChatGPT over a sustained period of time, would drive down innovative behavior.

We acknowledge that A.I. can also facilitate innovation through various mechanisms. These might include freeing up the time of humans otherwise consumed in monotonous tasks (Baska, 2018) that are not structured to lead to innovation (although some studies do show that we might not be making use of this time in innovative tasks, just in more screen time instead (Ortiz-Ospina and Roser, 2023). Another mechanism could be A.I., which helps to process a large body of data to identify patterns that might unearth a problem and suggest a solution (see Bieser, 2022). A third mechanism could be by serving as a collaborative tool that helps individuals generate, refine, and expand ideas. Specially in education, AI could support creative learning by stimulating curiosity and by enabling students to experiment across disciplines (Yang et al., 2025; Shen et al., 2023; Holmes, 2023). We argue, however, that these mechanisms are (a) more about incremental innovation (solving problems with a focus on one variable to be optimized), rather than the type of ground-breaking innovation that involves a radical shake-up of the status quo for the better, and (b) that they largely talk about machine-driven innovation (either partly or entirely). In this paper, however, we are interested in studying human-driven innovation and particularly whether the potential mechanisms outlined above would – over time – crowd out the innovation drive by humans that has been so central to human progress (Bloom et al., 2020).

A second socio-economic attitude that could be affected by AITGs is risk behavior. While a reasonable level of risk is necessary for economic activity to thrive (Khalid et al., 2024), excessively low risk aversion could lead to reckless behavior and endanger economic enterprises.

AI’s risk effects are however context-dependent. In unregulated settings, AI may amplify instability, whereas in structured environments — such as finance — it can constrain risk and promote consistency (D’Acunto et al., 2019). We thus argue that the continuous usage of ChatGPT could increase risk tolerance via two mechanisms. First, there is the moral hazard context that this specific type of automation creates. The costless experimentation of asking a chatbot as many questions as one likes until one receives an answer (that one likes) is likely to prime users with a sense of insurance (Winter, 2000) that allows for endless experimentation at virtually no or low cost, thereby decreasing their risk aversion. A second mechanism by which repeated usage of ChatGPT could affect risk behavior is that previous research has shown that when humans interact with machines and algorithms, they tend to apply a different set of values: mainly getting less emotional and less concerned with social rules of conduct (for a review, see Han et al., 2025). Such decreased pro-sociality has been shown in trust games (Schniter et al., 2020), ultimatum games, as well as, in dictator and public goods games (Melo et al., 2016). It is these social values, however, that often restrict reckless behavior and make an individual stop short of taking major risks (for more experimental studies on risk behavior see Charness et al., 2018; Anderson and Mellor, 2009; Holmes, 2023). Our second hypothesis, therefore, reads as follows:

H2.

The continuous use of ChatGPT over a sustained period of time, would decrease risk aversion.

Effort is usually a likely collateral of many technological advances. Because most technologies mainly aim at making a machine, software, or an algorithm replace some aspect of human effort (e.g. sending voice message instead of typing it, voice or face recognition instead of pressing buttons), increased reliance on technologies is likely to decrease our perception of the amount of effort required from us, leading to complacency (Zhang et al., 2024; Ahmad et al., 2023). Relevant research has shown how technology has made users less active (Woessner et al., 2021). Such effects are not confined to people with prior low skills: Galletta et al. (2005) has shown that using a spell-check program makes even individuals with high writing skills miss spelling errors later on. In health care domains, it was also shown that the capacity of professional staff to read mammograms went down (making them miss cancers) if they used computer-aided detection systems (Alberdi et al., 2004). Our third hypothesis, therefore, reads as follows:

H3.

The continuous use of ChatGPT over a sustained period of time, would decrease the level of effort one exerts to complete a task.

Material and methods

We conducted a pre-registered field experiment in which 94 senior students from an Egyptian public university participated ^[1]. We chose a university as our experimental context for two reasons. First, universities and students are typically expected to be the primary engines for innovation by developing novel ideas, debating significant questions, and having the time to experiment with multiple solutions. Indeed, most breakthroughs have usually originated from university campuses (Lawton-Smith, 2006). Second, the potential negative effects of ChatGPT on university students were among the first concerns raised when the technology made its debut. Just as artificial intelligence text generators (AITGs) started to make headlines, many feared that students would use the chatbot to do assignments, write theses, and hence change – or harm – their learning trajectory (for a discussion, see Lo, 2023). This concern deepened after several studies pointed to ChatGPT’s reasonable performance in handling academic tasks, such as taking exams and generating academic abstracts (Friederichs et al., 2023). A university campus, therefore, seemed to be an ideal venue for a field experiment on the probable effects of ChatGPT on innovation.

We designed our experiment to address the most basic concerns of educational institutions about ChatGPT: whether students seeking assistance with the technology would become less innovative over time, thereby contributing to the higher ecological validity of our design. Our participants were undergraduate students with the same major (social sciences) and minor (social science computing) at the same faculty at the Egyptian public university. Participants were enrolled in two different courses: Social Network Analysis and Data Mining. We used enrollment in either of these two parallel courses as our unit of randomization to assign participants to either the treatment or the control group. These specific courses were selected due to their equivalence in difficulty, prerequisites, coursework demands, grading structures, and instructors’ experience.

Although randomizing at the course level (i.e. cluster randomization) can introduce concerns about group comparability due to potential self-selection (Raudenbush, 1997), we attempted to mitigate this risk through careful course selection and post-hoc balance checks (Bloom, 2005). All participants were at the same academic level (senior-level students), had previously taken the same core courses and assessments regarding research methods and writing, and had the same computational skills. Furthermore, given that ChatGPT was not easily accessible in Egypt either legally or logistically at the time of the study, the likelihood of systematic prior exposure was assumed to be minimal. To verify balance in terms of socio-economic and demographic characteristics, we conducted a series of statistical tests. Specifically, we used independent samples t-tests for continuous variables (age and financial status) and chi-square tests for categorical variables (gender, religion, and residence). All tests failed to reject the null hypothesis of equality across groups (p-values > 0.05), indicating that the treatment and control groups were statistically similar. In the next section, we present analyses showing that both groups did not significantly differ in crucial background variables ^[2].

While alternative designs such as stratified or within-course randomization can improve statistical balance, especially in smaller samples (Bruhn and McKenzie, 2009), pursuing such an approach would have increased the risk of interference and treatment contamination in our context. Because the treatment involved the active use of ChatGPT—while the control group was restricted from using it—randomizing students within the same course could have led to information leakage, peer influence, tool sharing, and non-compliance. These risks are especially salient in educational environments, as students often collaborate or informally exchange information, potentially jeopardizing the treatment integrity (Vanderweele et al., 2013).

The field component of the experiment was administered as follows. Students in both courses (i.e. the treatment and the control) were asked to submit three graded essay assignments over the course of one month during the term in April 2023. The difference between the groups was that members of the treatment group (enrolled in the Social Network Analysis course) were asked to write the essays using ChatGPT, whereas members of the control group (enrolled in the Data Mining course) were not given the option to use ChatGPT.

The timing of our experiment allowed us to capitalize on a brief window of opportunity when access to ChatGPT was still unavailable in Egypt. It was only in November 2023 that Egyptians could create an account and access ChatGPT with local cell phone numbers and Egyptian IP addresses. Before November 2023, Egyptians could access ChatGPT via a VPN and a foreign phone number, but this step was largely burdensome and, therefore, quite rare at the time. For the treatment group, we created ChatGPT accounts for them using foreign cell phone numbers. This context allowed for a clean field experiment that minimizes the risk of contamination – where participants in the control group might also use ChatGPT – and ensured more reliable results.

One week after all assignments were submitted, the two groups were asked to take part in a lab-based innovation game (Ederer and Manso, 2013) and a risk game (Crosetto and Filippin, 2013). In the innovation game, participants had to innovate – over seven rounds – to increase the sales of a hypothetical lemonade stand by deciding on the values of six parameters (after having received advice from the previous stand manager):

Location of the stand (school, business, stadium).
Lemonade color (green, pink).
Sugar content (scale from 0 to 10).
Lemon concentration (scale from 0 to 10).
Price (scale from 0 to 10).
An advertising message or slogan with a maximum of 20 words.

After each round, participants were shown the profit they made using the parameter values they had chosen and were asked whether they wanted to change or keep their parameter choices for the next round. Payoffs were calculated according to the optimal parameters set by Ederer and Manso (2013), and any deviation from the optimal parameter values (unknown to the participants) was penalized. We added a further innovation task to the design that asked participants to write an advertising message or slogan. This additional task was not included in the payoff structure but was used to further assess innovation, as shown below. After the conclusion of the sessions, four external annotators were tasked with rating the advertising messages on a scale of creativity from 0 to 10. The annotators were unaware of the research question or the experimental design to ensure coding impartiality.

To measure effort levels, participants were given a sheet of outlined paper at the beginning of each session to record their strategies for reference throughout the rounds. Recording previous choices should have been essential for supposedly efficient participants to avoid repeating their choices in future rounds (given the many decisions they had to make in each round). Filling in the cells was voluntary, and participants were not informed that their decisions and profits would be recorded and measured. Instead, they were told that the sheet was for them if they wanted to track their choices and profits. These sheets were collected after the session concluded. We used the number of cells filled out in the sheets as our measure of participants’ effort levels.

To measure risk behavior, participants completed a dynamic version of Crosetto and Filippin (2013) “Bomb Risk Elicitation Task.” The task is a game that presents participants with 100 closed “boxes,” 99 of which contain a small monetary reward (the amount in each of these 99 boxes is the same). However, one of the 100 boxes contains a bomb that would wipe out all the monetary gains achieved thus far if it were opened. Once the game starts, a box is collected every second until the participant clicks “stop” or all 100 boxes have been collected. The underlying assumption is that risk-tolerant individuals will collect more boxes while risk-averse individuals will collect fewer (for previous papers that applied the same game in Egypt, see Brooke and Hassan, 2022). See the Online Supplementary Material for the experimental script.

Results

Table 1 shows a summary of the characteristics of the participants in the control and treatment groups. A total of 94 participants were initially recruited for the experiment. However, one participant was discarded due to missing information during payouts. The majority of the participants were female – which is common in social science departments in Egypt (e.g. Amin et al., 2018; Haas et al., 2021; AbdelNabi et al., 2022; Hassan et al., 2023). The average age was 21, and approximately 92% of the participants were Muslim. Balance tests conducted to compare the demographic composition of the control and treatment groups revealed no randomization failures (t-test for age and financial status, chi-square tests for gender, religion, and residence; p-value > 0.05). The experimental script and screenshots can be found in the Online Supplementary Material.

Table 1

Subject demographics

	Non-ChatGPT	ChatGPT	Total
Gender:
Male (%)	11 (25.5%)	8 (16%)	20.5%
Female (%)	32 (74.5%)	42 (84%)	79.5%
Age (SD)	21.1 (1)	21 (0.74)	21 (0.9)
Household Income (SD)	2.42 (0.62)	2.54 (0.61)	2.5 (0.6)
Residence:
Urban (%)	43 (100%)	48 (96%)	91 (97.8%)
Rural (%)	0 (0%)	2 (4%)	2 (2.2%)
Religion:
Muslim (%)	37 (86%)	49 (98%)	86 (92.5%)
Christian (%)	6 (14%)	1 (2%)	7 (7.5%)
Observations	43 (100%)	50 (100%)	93 (100%)

	Non-ChatGPT	ChatGPT	Total
Gender:
Male (%)	11 (25.5%)	8 (16%)	20.5%
Female (%)	32 (74.5%)	42 (84%)	79.5%
Age (SD)	21.1 (1)	21 (0.74)	21 (0.9)
Household Income (SD)	2.42 (0.62)	2.54 (0.61)	2.5 (0.6)
Residence:
Urban (%)	43 (100%)	48 (96%)	91 (97.8%)
Rural (%)	0 (0%)	2 (4%)	2 (2.2%)
Religion:
Muslim (%)	37 (86%)	49 (98%)	86 (92.5%)
Christian (%)	6 (14%)	1 (2%)	7 (7.5%)
Observations	43 (100%)	50 (100%)	93 (100%)

In the following, we investigate whether the use of ChatGPT for four weeks before their participation in the lab-based games influenced (1) the innovative behavior based on the lemonade stand game, (2) the level of effort calculated from the sheets used to record strategies, and (3) the risk behavior concluded from the bomb risk elicitation task. Data from all sessions were compiled and cleaned before analysis. Initial outlier detection was conducted using the Interquartile Range (IQR) and Median Absolute Deviation (MAD) methods. These non-parametric approaches are more robust to non-normal distributions than z-score-based methods, which assume normality, especially in small-to-moderate sample sizes (Leys et al., 2013). Two anomalous cases were identified and excluded based on their detection across both methods ^[3].

First: innovative behavior based on the lemonade stand game

To measure the level of innovativeness from the lemonade stand game, we began by examining whether there was a significant treatment effect on any of the three continuous decision parameters (sugar content, lemon concentration, and price) that participants had to choose in each round. Since we measured innovative behavior by the subject-specific standard deviation, as studied by Ederer and Manso (2013), we focused only on the three continuous decision parameters (sugar content, lemon concentration, and price). Although this is a random first check – because, theoretically, we do not expect one parameter to be more conducive to innovation than the others – it is still an essential first step to examine the parameters separately. Figure 1 below illustrates a significant difference in innovative behavior (measured by subject-specific standard deviation) in the decision regarding lemon concentration (Mann-Whitney p-value = 0.036). However, there is no significant difference in sugar content or price when comparing the treatment to the control (Mann-Whitney p-value > 0.1).

Figure 1

Three error-bar plots compare subject standard deviations for Chat G P T and Non-Chat G P T across lemon, sugar, and price.

View large Download slide

The figure contains three vertically stacked error-bar graphs comparing two groups labeled “Chat G P T” and “Non-Chat G P T” on the horizontal axis. Panel (a) shows subject-specific standard deviation for lemon concentration. The vertical axis ranges from 0.00 to 2.00. The Chat G P T group has a mean of approximately 0.59 with an error range from about 0.47 to 0.71. The Non-Chat G P T group has a higher mean of approximately 0.97 with an error range from about 0.69 to 1.26. A Mann–Whitney U test indicates a statistically significant difference (p < 0.05). Panel (b) shows subject-specific standard deviation for sugar concentration, with the same vertical axis range. The Chat G P T group has a mean of approximately 0.95 (error range 0.73 to 1.18), while the Non-Chat G P T group has a mean of approximately 1.15 (error range 0.92 to 1.37). The Mann–Whitney U test indicates no statistically significant difference (p > 0.1). Panel (c) shows subject-specific standard deviation for price. The Chat G P T group has a mean of approximately 0.91 with an error range from about 0.72 to 1.11, while the Non-Chat G P T group has a mean of approximately 1.07 with an error range from about 0.85 to 1.28. The Mann–Whitney test again indicates no statistically significant difference (p > 0.1). All numerical values are approximate.rtically stacked error-bar graphs comparing two groups labeled “ChatGPT” and “Non-ChatGPT” on the horizontal axis. Panel (a) shows subject-specific standard deviation for lemon concentration. The vertical axis ranges from 0.00 to 2.00. The ChatGPT group has a mean of approximately 0.59 with an error range from about 0.47 to 0.71. The Non-ChatGPT group has a higher mean of approximately 0.97 with an error range from about 0.69 to 1.26. A Mann–Whitney U test indicates a statistically significant difference (p < 0.05). Panel (b) shows subject-specific standard deviation for sugar concentration, with the same vertical axis range. The ChatGPT group has a mean of approximately 0.95 (error range 0.73 to 1.18), while the Non-ChatGPT group has a mean of approximately 1.15 (error range 0.92 to 1.37). The Mann–Whitney U test indicates no statistically significant difference (p > 0.1). Panel (c) shows subject-specific standard deviation for price. The ChatGPT group has a mean of approximately 0.91 with an error range from about 0.72 to 1.11, while the Non-ChatGPT group has a mean of approximately 1.07 with an error range from about 0.85 to 1.28. The Mann–Whitney test again indicates no statistically significant difference (p > 0.1). All numerical values are approximate.

Average innovation scores for continuous parameters. Notes: Bars represent mean values; error bars denote 95% confidence intervals. Mann–Whitney U test indicated a statistically significant difference between groups regarding lemon concentration (p-value < 0.05)

Figure 1

View large Download slide

The figure contains three vertically stacked error-bar graphs comparing two groups labeled “Chat G P T” and “Non-Chat G P T” on the horizontal axis. Panel (a) shows subject-specific standard deviation for lemon concentration. The vertical axis ranges from 0.00 to 2.00. The Chat G P T group has a mean of approximately 0.59 with an error range from about 0.47 to 0.71. The Non-Chat G P T group has a higher mean of approximately 0.97 with an error range from about 0.69 to 1.26. A Mann–Whitney U test indicates a statistically significant difference (p < 0.05). Panel (b) shows subject-specific standard deviation for sugar concentration, with the same vertical axis range. The Chat G P T group has a mean of approximately 0.95 (error range 0.73 to 1.18), while the Non-Chat G P T group has a mean of approximately 1.15 (error range 0.92 to 1.37). The Mann–Whitney U test indicates no statistically significant difference (p > 0.1). Panel (c) shows subject-specific standard deviation for price. The Chat G P T group has a mean of approximately 0.91 with an error range from about 0.72 to 1.11, while the Non-Chat G P T group has a mean of approximately 1.07 with an error range from about 0.85 to 1.28. The Mann–Whitney test again indicates no statistically significant difference (p > 0.1). All numerical values are approximate.rtically stacked error-bar graphs comparing two groups labeled “ChatGPT” and “Non-ChatGPT” on the horizontal axis. Panel (a) shows subject-specific standard deviation for lemon concentration. The vertical axis ranges from 0.00 to 2.00. The ChatGPT group has a mean of approximately 0.59 with an error range from about 0.47 to 0.71. The Non-ChatGPT group has a higher mean of approximately 0.97 with an error range from about 0.69 to 1.26. A Mann–Whitney U test indicates a statistically significant difference (p < 0.05). Panel (b) shows subject-specific standard deviation for sugar concentration, with the same vertical axis range. The ChatGPT group has a mean of approximately 0.95 (error range 0.73 to 1.18), while the Non-ChatGPT group has a mean of approximately 1.15 (error range 0.92 to 1.37). The Mann–Whitney U test indicates no statistically significant difference (p > 0.1). Panel (c) shows subject-specific standard deviation for price. The ChatGPT group has a mean of approximately 0.91 with an error range from about 0.72 to 1.11, while the Non-ChatGPT group has a mean of approximately 1.07 with an error range from about 0.85 to 1.28. The Mann–Whitney test again indicates no statistically significant difference (p > 0.1). All numerical values are approximate.

Average innovation scores for continuous parameters. Notes: Bars represent mean values; error bars denote 95% confidence intervals. Mann–Whitney U test indicated a statistically significant difference between groups regarding lemon concentration (p-value < 0.05)

Next, we measure innovativeness by examining the propensity of participants to explore different combinations of the three parameters (sugar content, lemon concentration, and price), thereby capturing the degree of novelty in the sales strategy. The innovation metric was calculated as the average, subject-specific standard deviation of strategy choices for the three continuous variables (sugar content, lemon content, and price). Figure 2 shows the average level of innovativeness among the ChatGPT and Non-ChatGPT groups. The Non-ChatGPT group exhibited a significantly higher level of innovativeness compared to the ChatGPT group (Mann-Whitney U test, p-value = 0.037), confirming our hypothesis H1.

Figure 2

An error bar plot compares the standard deviations of subjects between the Chat G P T and Non-Chat G P T groups.

View large Download slide

The figure shows a single error-bar graph comparing two groups labeled “Chat G P T” and “Non-Chat G P T” on the horizontal axis. The vertical axis is labeled “Subject-Specific Standard Deviations” and ranges from 0.0 to 4.0 in increments of 0.5. The Chat G P T group has a mean subject-specific standard deviation of approximately 2.48, with an error range from about 2.12 to 2.83. The Non-Chat G P T group has a higher mean of approximately 3.19, with an error range from about 2.64 to 3.75. All numerical values are approximate.

Average aggregate level of innovativeness for ChatGPT and Non-ChatGPT groups. Notes: Bars represent mean values; error bars denote 95% confidence intervals. Mann–Whitney U test indicated a statistically significant difference between groups (p-value < 0.05)

Table 2 presents the results of regression analyses on two models with the level of innovativeness as the dependent variable after controlling for potential confounders. The coefficient for ChatGPT is negative and statistically significant across the two models (at the 95% confidence level in model 1 and the 90% confidence level in model 2), indicating that the level of innovation for participants in the ChatGPT group is significantly lower than the Non-ChatGPT group, after controlling for other variables. The negative values suggest that ChatGPT usage is associated with a decrease in innovation scores by approximately 0.6–0.72 standard deviation points, depending on the model. Additionally, living in an urban area seems to have a positive and significant effect on innovation in the second model. Moreover, household income is negatively associated with innovation and is statistically significant, albeit at the 0.1 level, suggesting a moderate confidence level that higher income may reduce innovation scores. This observation could be because participants with higher incomes might not have valued the need to innovate to increase their payoffs as much as those on lower incomes.

Table 2

Regression analysis for level of innovativeness in lemonade stand game

	Dependent variable
	Exploration of sales strategies
	(1)	(2)
ChatGPT	−0.719^**	−0.604^*
ChatGPT	(0.326)	(0.328)
Female		−0.320
Female		(0.462)
Age		−0.030
Age		(0.173)
Urban		0.673^**
Urban		(0.322)
Financials		−0.510^*
Financials		(0.273)
Constant	3.167^***	4.081
Constant	(0.271)	(3.529)
Observations	91	91

	Dependent variable
	Exploration of sales strategies
	(1)	(2)
ChatGPT	−0.719^**	−0.604^*
ChatGPT	(0.326)	(0.328)
Female		−0.320
Female		(0.462)
Age		−0.030
Age		(0.173)
Urban		0.673^**
Urban		(0.322)
Financials		−0.510^*
Financials		(0.273)
Constant	3.167^***	4.081
Constant	(0.271)	(3.529)
Observations	91	91

Note(s): Results are clustered by Subject ID, ***p < 0.01, **p < 0.05, *p < 0.1

The second innovation test we conducted involved coding the advertising messages that participants were asked to write in each round along an innovation scale ranging from 1 to 10. This task culminated in 265 unique messages (as some participants chose not to change their messages at some rounds). Four coders coded the results after the experiment. We calculated intraclass correlation coefficients (ICCs) among the four annotators to assess the reliability of coding. While the individual-level agreement was fair (ICC(2,1) = 0.228), the average agreement across raters (ICC(2,k) = 0.506) reflected moderate reliability. Such levels are common in research assessing creativity, where subjective evaluations often yield only fair-to-moderate consistency (Hennessey and Amabile, 2010). To mitigate this limitation of the fair individual-level agreement, we relied on the average creativity score across raters, which provides a relatively more stable and representative outcome measure.

Figure 3 shows no significant difference in the creativity of these messages between the treatment and the control (Mann-Whitney U test, p-value > 0.05), contradicting H1. We also ran other tests tracing innovation that produced insignificant results. Additional results can be found in the Online Supplementary Material.

Figure 3

An error bar plot compares average message creativity scores between the Chat G P T and Non-Chat G P T groups.

View large Download slide

The figure shows a single error-bar graph comparing two groups labeled “Chat G P T” and “Non-Chat G P T” on the horizontal axis. The vertical axis is labeled “Message Creativity” and ranges from 3 to 6 in increments of 1. The Chat G P T group has a mean creativity score of approximately 4.19, with an error range from about 3.92 to 4.45. The Non-Chat G P T group has a mean of approximately 4.09, with an error range from about 3.75 to 4.39. The error bars for the two groups overlap, indicating similar levels of message creativity. All numerical values are approximate.

Creativity scores of advertising messages between ChatGPT and Non-ChatGPT groups. Notes: Bars represent mean values; error bars denote 95% confidence intervals. Mann–Whitney U test indicated no significant difference between groups (p-value > 0.05)

Second: risk behavior from the bomb risk elicitation task

To calculate risk behaviors from the Bomb Risk Elicitation Task, the propensity for risk-taking among participants was determined by the number of boxes they left unopened before pressing the stop button. A larger number of boxes opened reflects a higher inclination towards risk-taking, as the probability of encountering a bomb that would destroy all earnings increases. A participant was considered risk-loving if s/he opened more than one-third of the boxes and risk-averse otherwise. Figure 4 illustrates the proportion of risk lovers in the Non-ChatGPT group and the ChatGPT group. The proportion of risk lovers in the ChatGPT group was higher than that of the Non-ChatGPT group, as suggested by the hypothesis H2. Moreover, this observed difference is statistically significant at the 90% confidence level (Mann-Whitney U test, p-value = 0.079).

Figure 4

An error bar plot compares the proportions of risk-loving participants between the Chat G P T and Non-Chat G P T groups.

View large Download slide

The figure shows a single error-bar graph comparing two groups labeled “Chat G P T” and “Non-Chat G P T” on the horizontal axis. The vertical axis is labeled “Proportion of Risk Lover” and ranges from 0 to 1.0 in increments of 0.25. The Chat G P T group has an average proportion of risk-loving participants of approximately 0.61, with an error range from about 0.46 to 0.75. The Non-Chat G P T group has a lower average proportion of approximately 0.42, with an error range from about 0.27 to 0.57. All numerical values are approximate.

Proportion of risk lovers among ChatGPT and Non-ChatGPT groups. Notes: Bars represent mean values; error bars denote 95% confidence intervals. Fisher exact test indicated no significant difference at 0.05 significance level (p-value = 0.09)

Table 3 presents regression analyses investigating the likelihood of risk-loving behavior. Across both models, the coefficients for the ChatGPT group are positive (at the 90% confidence level), indicating that individuals in the ChatGPT group are more likely to become risk lovers than those in the Non-ChatGPT group (which is in line with the hypothesis). Moreover, these results are statistically significant. In the second model, being female is a negative and statistically significant factor, indicating a lower likelihood of being a risk-loving person.

Table 3

Regression analysis for risk loving behavior in the bomb risk elicitation task

	Dependent variable
	Risk lover
	(1)	(2)
ChatGPT	0.751^*	0.818^*
ChatGPT	(0.427)	(0.449)
Female		−1.135^*
Female		(0.631)
Age		−0.002
Age		(0.252)
Urban		0.236
Urban		(1.484)
Household Income		0.639
Household Income		(0.400)
Constant	−0.329	−0.608
Constant	(0.309)	(5.386)
Observations	91	91
Akaike Inf. Crit	126.909	130.316

	Dependent variable
	Risk lover
	(1)	(2)
ChatGPT	0.751^*	0.818^*
ChatGPT	(0.427)	(0.449)
Female		−1.135^*
Female		(0.631)
Age		−0.002
Age		(0.252)
Urban		0.236
Urban		(1.484)
Household Income		0.639
Household Income		(0.400)
Constant	−0.329	−0.608
Constant	(0.309)	(5.386)
Observations	91	91
Akaike Inf. Crit	126.909	130.316

Note(s): ***p < 0.01, **p < 0.05, *p < 0.1

Third: level of effort as elicited from the sheet for recording strategies

To assess the level of effort participants exerted (measured by their recording of the sales strategies employed at each of the seven rounds), we calculated the number of cells filled out by each participant at each period (minimum = 0 and maximum = 8). As mentioned above, filling in the cells was voluntary, and participants were not informed that their decisions and profits would be recorded and measured ^[4].

Figure 5 displays the average effort exerted by the ChatGPT and Non-ChatGPT groups. The results show that the average effort exerted by participants in the Non-ChatGPT group is slightly higher than that exerted by the ChatGPT group, which aligns with our hypothesis H3. However, this difference is statistically insignificant (Mann-Whitney U test, p-value > 0.05).

Figure 5

An error-bar plot compares average effort levels between Chat G P T and Non-Chat G P T groups, with similar mean values.

View large Download slide

The figure shows a single error-bar graph comparing two groups labeled “Chat G P T” and “Non-Chat G P T” on the horizontal axis. The vertical axis is labeled “Effort” and ranges from 4 to 6 in increments of 1. The Chat G P T group has a mean effort score of approximately 5.28, with an error range from about 4.92 to 5.61. The Non-Chat G P T group has a slightly higher mean of approximately 5.45, with an error range from about 5.11 to 5.82. The error bars for the two groups overlap, indicating similar levels of effort. All numerical values are approximate.

Average level of effort for ChatGPT and Non-ChatGPT groups. Notes: Bars represent mean values; error bars denote 95% confidence intervals. Mann–Whitney U test indicated no significant difference between groups (p-value > 0.05)

Table 4 presents the results of regression analyses on three models with the level of effort as the dependent variable. Across all models, the coefficient for the ChatGPT treatment is negative (as expected by the hypothesis), suggesting that participants using ChatGPT exerted slightly less effort than the Non-ChatGPT group. However, these results are not statistically significant in any of the models. The coefficient for the period is negative and is statistically significant in models 2 and 3 at the 0.1 significance level, implying that as the game progressed, participants tended to exert less effort, possibly due to fatigue or diminishing interest. In model 3, residing in an urban area is associated with a significantly higher level of effort, suggesting that urban participants were more engaged or able to exert more effort in the game.

Table 4

Regression analysis for level of effort exerted in lemonade stand game

	Dependent variable
	Effort
	(1)	(2)	(3)
ChatGPT	−0.184	−0.184	−0.035
ChatGPT	(0.615)	(0.615)	(0.611)
Female			−0.206
Female			(0.852)
Age			−0.411
Age			(0.392)
Urban			4.285^***
Urban			(0.681)
Household Income			−0.187
Household Income			(0.546)
Period		−0.216^***	−0.216^***
Period		(0.041)	(0.041)
Constant	5.449^***	6.313^***	11.106
Constant	(0.449)	(0.432)	(7.963)
Observations	637	637	637

	Dependent variable
	Effort
	(1)	(2)	(3)
ChatGPT	−0.184	−0.184	−0.035
ChatGPT	(0.615)	(0.615)	(0.611)
Female			−0.206
Female			(0.852)
Age			−0.411
Age			(0.392)
Urban			4.285^***
Urban			(0.681)
Household Income			−0.187
Household Income			(0.546)
Period		−0.216^***	−0.216^***
Period		(0.041)	(0.041)
Constant	5.449^***	6.313^***	11.106
Constant	(0.449)	(0.432)	(7.963)
Observations	637	637	637

Note(s): Results are clustered by Subject ID, ***p < 0.01, **p < 0.05, *p < 0.1

Putting together the details of the full picture, our results appear nuanced. Exposure to ChatGPT over a few weeks, while completing assignments, was shown to significantly decrease innovation, as measured by our lemonade stand game. Moreover, the use of ChatGPT had a positive impact on risk-taking behavior compared to the control, and this effect was statistically significant. On the other hand, ChatGPT also had a negative impact on the level of effort exerted in a task, albeit not statistically significant. However, the fact that one of our two statistically significant findings is at the 90% confidence level highlights the need for further testing.

Discussion and conclusion

The fast spread of ChatGPT has changed the nature of human-computer interaction, providing users with sophisticated conversational interfaces that respond intelligently to various prompts and queries. In this paper, we examined the effect of continuous reliance on such a technological breakthrough, when doing assignments, on undergraduate students’ innovative behavior, risk attitudes, and readiness to exert effort. We found consistent negative effects on innovation and risk-taking, although in some cases, significance levels were only marginal (significant at 90% confidence level), while the effect on effort was not statistically significant. These nuances suggest that our results should be interpreted as preliminary evidence, warranting replication with larger samples.

Our findings on the impact of ChatGPT use on innovation, risk behavior, and effort exertion among university students provide valuable insights when considered within the broader context of studies on technology’s role in education, economics, and behavioral research. The observed decrease in innovative behavior is statistically significant at the 95% confidence level and aligns with recent studies that highlight the potential negative effects of excessive digital tool reliance on creativity and independent problem-solving skills. For example, Agarwal et al. (2023) emphasized that the reliance on AI-driven decision-making could crowd out human innovation, ultimately reducing individuals’ capacity for creative problem-solving.

Our results also suggest that the use of ChatGPT was associated with greater risk tolerance among students. Recent experimental studies by Sabour et al. (2025) found that A.I. interaction can decrease emotional and social inhibitions, thus increasing risk tolerance among users. These findings, also, align with the work of Salatino et al. (2025), where they demonstrated that individuals exhibit heightened risk-taking behaviors when operating in digitally mediated, impersonal environments, such as those facilitated by A.I. systems. However, this effect reached significance only at the 90% confidence level. As such, it should also be regarded as tentative evidence.

Regarding effort exertion, our findings, although statistically insignificant, complement recent research indicating that prolonged reliance on assistive technologies may lead to complacency and reduced effort. The literature review done by Balalle (2024) demonstrates that digital tools designed to reduce effort could inadvertently encourage minimal engagement, negatively impacting deeper learning outcomes. Similarly, Nikolova et al. (2024) document that extensive automation reduces intrinsic motivation, effort, and productivity in workplace contexts. Hence, our findings support and extend recent literature, highlighting concerns about A.I. tools in the educational realm.

We would also like to use this discussion section to discuss some caveats, possible limitations of our study, and avenues for future research. On caveats, despite the paper’s focus and its pre-registered hypothesis on possible negative effects of ChatGPT, it is still important to highlight that A.I. applications and AITGs will likely have significant advantages. For example, removing social cues from the decision-making process led to less discriminatory decisions than human decisions (e.g. Hoffman et al., 2018). A.I. also has considerable potential in dealing with data that is too large for humans to handle effectively. Depending on usage techniques, even ChatGPT itself could contribute to innovation development if it succeeds – by providing a vast amount of valuable information – to foster a culture of curiosity and continuous exploration (see Romero-Rodríguez et al., 2023) among its users. This study, therefore, does not deny the multiple advantages that A.I. and AITGs can generate.

Regarding limitations, the findings of our study should be interpreted in light of potential design restrictions. First, our sample size was constrained by the number of students enrolled in the two university courses available for the experiment. Although post-hoc power analyses indicate that the study was adequately powered to detect moderate-to-large effects on innovation and risk-taking, it was underpowered to detect smaller effects—particularly for the effort outcome. Second, to avoid contamination and unintended peer spill-over between treatment and control groups, we randomized at the course level rather than at the individual level. While this helped prevent spill-over effects and protected internal validity, it limited our ability to implement finer-grained randomization strategies. Future studies could thus implement additional safeguards, such as stricter enrollment screening that overlooks all course subjects in which students are enrolled. Third, we did not include academic performance metrics (e.g. GPA) as covariates or subgrouping variables to preserve participant anonymity, as collecting such data would have required linking responses to identifiable student records. Fourth, our second innovation measure, based on creativity scores of advertising messages, showed only moderate inter-rater reliability (ICC(2,k) = 0.506). While this level of agreement is typical for creativity assessments due to their subjective nature (Hennessey and Amabile, 2010), it nonetheless limits confidence in the robustness of this outcome. Future studies could enhance measurement reliability by employing larger and more diverse panels of raters, adopting more standardized coding rubrics, or complementing human evaluations with sentiment or computational text analysis methods (Amin et al., 2025). Fifth, our measure of effort should be interpreted as an indirect proxy for task engagement. While this method is common in experimental research practices (Charness et al., 2018; Ederer and Manso, 2013), it only captures one observable dimension of effort and may underestimate unrecorded or cognitive forms of effort. Future research could address this limitation by combining such indirect markers with complementary measures, including time-on-task or validated self-report effort scales (Bönte et al., 2017; Kool et al., 2010). Finally, given that this is, to the best of our knowledge, one of the first experimental studies to assess the behavioral impact of AITGs in an educational context, there are no existing benchmarks for effect sizes. Therefore, further testing of our preliminary findings, whilst improving on these possible limitations, is essential.

On avenues for future research, although field experiments will always have the advantage of high ecological validity, testing the effect of AITGs on behavior could also benefit from the controlled environment of lab experiments. In such designs, spill-over worries would be minimal and internal validity would be high. To address the issue of external validity however, new experimental designs could be thought of to test the generalizability of the findings; longitudinal studies that trace the effect of technology across time, expanding the participant pool across multiple institutions that vary in terms of type (public/private), age of students (schools/universities), academic background and majors … etc. Future work could also refine the measures used here: effort was captured through an indirect proxy that may underestimate cognitive engagement, and innovation was partly assessed using subjective creativity ratings with only moderate inter-rater reliability. Replicating these findings with larger and more diverse samples, validated effort measures (e.g. time-on-task or effort scales), and multi-dimensional assessments of innovation would help strengthen the robustness of findings.

It is also important to note that we only tracked one dimension of innovation: innovation related to problem-solving. There are, however, many other dimensions of innovation. One such dimension, for example, is the active reflection that individuals occasionally engage in (Bieser, 2022). Such aspects of creativity and innovation require a free mindset that does not explicitly think about a specific problem but wanders around and accumulates ideas that could be used later (Dyer et al., 2011).

On practical and policy implications – and in line with economic theories on innovation and economic growth and development (Schniter et al., 2020; Romero-Rodríguez et al., 2023; Acemoglu and Robinson, 2013) – our humble findings point to an urgent need to augment existing education with concrete, innovation-based practices. These could include embedding design thinking and problem-based learning modules directly into curricula, as well as choice architecture that alters people’s behavior without restricting options. In addition, designing student innovation competitions and startup incubators that reward novelty and impact would incentivize the value of innovation. Generally speaking, innovation thrives when students are encouraged to experiment, fail safely, and connect ideas across disciplines (for a systematic literature review, see Peláez-Sánchez et al., 2024). This requires the education system to reward creativity, not memorization or recall. To be more specific, the learning environment should be changed to embrace “safe-to-fail” norms, adopt “try and reflect” learning models and apply “open-ended” models. As for curriculum design, more emphasis should be given to cross-disciplinary projects and real-world problems. Regarding assessment, more weight should be given to student collaboration, autonomy and interaction (Yang et al., 2025; Holmes, 2023). That said however, for the previously mentioned initiatives to be implemented efficiently, teachers should be offered professional development on human oversight to ensure the quality and accuracy of AI-generated content, inquiry-based learning, pedagogy and coaching (Su and Yang, 2023; Wang and Fan, 2025). Moreover, and as confirmed by focus group discussions in Hassan et al. (2023), students feel they lack access to role models. Therefore, entrepreneurs should also be invited to give seminars and workshops at schools, universities and, if possible, to the public. These ideas align with the views of endogenous growth economists, who advocate for government and private sector institutions to nurture innovation initiatives and offer incentives for individuals and businesses to be more creative, such as R&D funding and intellectual property rights.

At the end, it should be made clear that AI, and particularly AITGs, present both significant opportunities and notable risks (Shen et al., 2023). To ensure that the benefits outweigh the harms, clear regulation and international standards are essential which promote transparency, accountability, equity and ethical use while preventing misuse and uneven power dynamics between countries, corporations and individuals (Zhou et al., 2024).

This work was sponsored by the Economic Research Forum (ERF) and has benefited from both financial and intellectual support. The content and recommendations do not necessarily reflect ERF’s views. Sarah Mansour thanks the Center for Interdisciplinary Research “ZiF” at Bielefeld University, Germany, where an earlier version of this paper was presented. The authors would like to thank Dr. Heba Medhat Zaki and Dr. Mayada Aref for giving us access to their courses to apply both the field and lab interventions on their students. The authors would also like to thank the IT team at the experimental lab of the Faculty of Economics and Political Science, Cairo University and Nourhan Abdelhamid Elsheikh for their dedication throughout the experimental sessions. We thank Mohamed Nagdy, Malak Ezzat, Reem Zakaria and Mariam ElKashef for assistance with the coding task.

Notes

1.

Pre-registration process included information on the main research question, key hypotheses of the study, a description of dependent variables and how they will be measured, the number of conditions subjects will be assigned to and the type of analyses to be conducted. An anonymized copy of the pre-registration, created by the authors to use during peer-review, can be found in the Online Supplementary Material.

2.

A detailed post-hoc power analysis for the three main outcome variables is provided in the Online Supplementary Material: https://osf.io/2pr3j/?view_only=e62f25ff509f4292b0966d08555abb64

3.

Upon closer inspection of the two cases, they started at optimal profit strategies and did not change any of their strategies during any of the experimental periods suggesting non-compliance that impose a threat to internal validity. Analyses were run with and without the two cases and the direction of effects were consistent but caused an inflation of standard errors.

4.

A screenshot of the sheet given to the subject at the beginning of the session, if he/she wanted to record his/her choices and profits over the rounds, can be found in the Online Supplementary Material.

The supplementary material for this article can be found online: https://osf.io/2pr3j/?view_only=e62f25ff509f4292b0966d08555abb64

References

AbdelNabi

,

M.

,

Wanas

,

K.

and

Mansour

,

S.

(

2022

), “

How can tax compliance be incentivized? An experimental examination of voice and empathy

”,

Review of Economics and Political Science

, Vol.

7

No.

2

, pp.

87

-

107

, March 2022, doi:

https://doi.org/10.1108/REPS-05-2021-0053

Google Scholar

Agarwal

,

N.

,

Moehring

,

A.

,

Rajpurkar

,

P.

and

Salz

,

T

. (

2023

), “

Combining human expertise with artificial intelligence: experimental evidence from radiology

”,

NBER Working

Paper 31422

, doi:

https://doi.org/10.3386/w31422

.

Google Scholar

Ahmad

,

S.F.

,

Han

,

H.

,

Alam

,

M.M.

,

Rehmat

,

M.K.

,

Irshad

,

M.

,

Arraño-Muñoz

,

M.

and

Ariza-Montes

,

A.

(

2023

), “

Impact of artificial intelligence on human loss in decision making, laziness and safety in education

”,

Humanities and Social Sciences Communications

, Vol.

10

No.

1

, 311, doi:

https://doi.org/10.1057/s41599-023-01787-8

.

Google Scholar

Alberdi

,

E.

,

Povyakalo

,

A.

,

Strigini

,

L.

and

Ayton

,

P.

(

2004

), “

Effects of incorrect computer-aided detection (CAD) output on human decision-making in mammography

”,

Academic Radiology

, Vol.

11

No.

8

, pp.

909

-

918

, doi:

https://doi.org/10.1016/j.acra.2004.05.012

.

Google Scholar

PubMed

Altmann

,

E.M.

,

Trafton

,

J.G.

and

Hambrick

,

D.Z.

(

2014

), “

Momentary interruptions can derail the train of thought

”,

Journal of Experimental Psychology: General

, Vol.

143

No.

1

, pp.

215

-

226

, doi:

https://doi.org/10.1037/a0030986

.

Google Scholar

PubMed

Amabile

,

T.M.

(

2019

), “

Creativity, artificial intelligence, and a world of surprises

”,

Academy of Management Discoveries

, Vol.

6

No.

3

, pp.

351

-

354

, doi:

https://doi.org/10.5465/amd.2019.0075

.

Google Scholar

Amin

,

E.

,

Abouelela

,

M.

and

Soliman

,

A.

(

2018

), “

The role of heterogeneity and the dynamics of voluntary contributions to public goods: an experimental and agent-based simulation analysis

”,

The Journal of Artificial Societies and Social Simulation

, Vol.

21

No.

1

, 3, doi:

https://doi.org/10.18564/jasss.3585

.

Google Scholar

Amin

,

E.

,

Hassan

,

M.

and

Mansour

,

S.

(

2025

), “

Producing a sentiment lexicon for colloquial Egyptian on social media: methodology and findings

”,

British Journal of Middle Eastern Studies

, pp.

1

-

18

, doi:

https://doi.org/10.1080/13530194.2025.2524817

.

Google Scholar

Anderson

,

L.R.

and

Mellor

,

J.M.

(

2009

), “

Are risk preferences stable? Comparing an experimental measure with a validated survey-based measure

”,

Journal of Risk and Uncertainty

, Vol.

39

No.

2

, pp.

137

-

160

, doi:

https://doi.org/10.1007/s11166-009-9075-z

.

Google Scholar

Arrow

,

K.J.

(

1969

), “

Classificatory notes on the production and transmission of technological knowledge

”,

The American Economic Review

, Vol.

59

No.

2

, pp.

29

-

35

.

Google Scholar

Balalle

,

H.

(

2024

), “

Exploring student engagement in technology-based education in relation to gamification, online/distance learning, and other factors: a systematic literature review

”,

Social Sciences and Humanities Open

, Vol.

9

, 100870, doi:

https://doi.org/10.1016/j.ssaho.2024.100870

.

Google Scholar

Baska

,

M.

(

2018

), “

Artificial intelligence could give workers back two weeks a year

”,

People Management

,

available at:

https://www.peoplemanagement.co.uk/article/1744734/artificial-intelligence-give-workers-back-two-weeks-year

Google Scholar

Bieser

,

J.

(

2022

), “

Creative through A.I. – how artificial intelligence can Support the Development of new ideas

”,

Gottlieb Duttweiler Institute Research Paper No. Forthcoming

, doi:

https://doi.org/10.59986/CCHA2271

.

Google Scholar

Bloom

,

H.S.

(

2005

), “Randomizing groups to evaluate place-based programs”, in

Learning More from Social Experiments: Evolving Analytic Approaches

,

Russell Sage Foundation

,

New York, NY, US

.

Google Scholar

Bloom

,

N.

,

Jones

,

C.I.

,

Van Reenen

,

J.

and

Webb

,

M.

(

2020

), “

Are ideas getting harder to find?

”,

The American Economic Review

, Vol.

110

No.

4

, pp.

1104

-

1144

, doi:

https://doi.org/10.1257/aer.20180338

.

Google Scholar

Bönte

,

W.

,

Lombardo

,

S.

and

Urbig

,

D.

(

2017

), “

Economics meets psychology: experimental and self-reported measures of individual competitiveness

”,

Personality and Individual Differences

, Vol.

116

, pp.

179

-

185

, doi:

https://doi.org/10.1016/j.paid.2017.04.036

.

Google Scholar

Brooke

,

S.

and

Hassan

,

M.

(

2022

), “

Does learning about protest abroad inform individuals’ attitudes about protest at home? Experimental evidence from Egypt

”,

Government and Opposition

, Vol.

57

No.

3

, pp.

428

-

445

, doi:

https://doi.org/10.1017/gov.2021.16

.

Google Scholar

Bruhn

,

M.

and

McKenzie

,

D.

(

2009

), “

In pursuit of balance: randomization in practice in development field experiments

”,

American Economic Journal: Applied Economics

, Vol.

1

No.

4

, pp.

200

-

232

, doi:

https://doi.org/10.1257/app.1.4.200

.

Google Scholar

Charness

,

G.

,

Gneezy

,

U.

and

Henderson

,

A.

(

2018

), “

Experimental methods: measuring effort in economics experiments

”,

Journal of Economic Behavior and Organization

, Vol.

149

, pp.

74

-

87

, doi:

https://doi.org/10.1016/j.jebo.2018.02.024

.

Google Scholar

Chen

,

J.

,

Liang

,

Y.

,

Mai

,

C.

,

Zhong

,

X.

and

Qu

,

C.

(

2016

), “

General deficit in inhibitory control of excessive smartphone users: evidence from an event-related potential study

”,

Frontiers in Psychology

, Vol.

7

, p.

511

, doi:

https://doi.org/10.3389/fpsyg.2016.00511

.

Google Scholar

PubMed

Crosetto

,

P.

and

Filippin

,

A.

(

2013

), “

The ‘bomb’ risk elicitation task

”,

Journal of Risk and Uncertainty

, Vol.

47

No.

1

, pp.

31

-

65

, doi:

https://doi.org/10.1007/s11166-013-9170-z

.

Google Scholar

D’Acunto

,

F.

,

Prabhala

,

N.

and

Rossi

,

A.G.

(

2019

), “

The promises and pitfalls of robo-advising

”,

Review of Financial Studies

, Vol.

32

No.

5

, pp.

1983

-

2020

, doi:

https://doi.org/10.1093/rfs/hhz014

.

Google Scholar

Dyer

,

J.

,

Gregersen

,

H.

and

Christensen

,

C.

(

2011

),

The Innovator’s DNA: Mastering the Five Skills of Disruptive Innovators

,

Harvard Business Review Press

,

Boston

.

Google Scholar

Ederer

,

F.

and

Manso

,

G.

(

2013

), “

Is pay for performance detrimental to innovation?

”,

Management Science

, Vol.

59

No.

7

, pp.

1496

-

1513

, doi:

https://doi.org/10.1287/mnsc.1120.1683

.

Google Scholar

Farhi

,

F.

,

Jeljeli

,

R.

,

Aburezeq

,

I.

,

Dweikat

,

F.F.

,

Al-shami

,

S.A.

and

Slamene

,

R.

(

2023

), “

Analyzing the students’ views, concerns, and perceived ethics about chat GPT usage

”,

Computers and Education: Artificial Intelligence

, Vol.

5

, 100180, doi:

https://doi.org/10.1016/j.caeai.2023.100180

.

Google Scholar

Firth

,

J.A.

,

Torous

,

J.

and

Firth

,

J.

(

2020

), “

Exploring the impact of internet use on memory and attention processes

”,

International Journal of Environmental Research and Public Health

, Vol.

17

No.

24

, 9481, doi:

https://doi.org/10.3390/ijerph17249481

.

Google Scholar

Fischetti

,

M.

(

2016

), “

Social technologies are making us less social

”,

Scientific American

,

available at:

https://www.scientificamerican.com/article/social-technologies-are-making-us-less-social/

Google Scholar

Friederichs

,

H.

,

Friederichs

,

W.J.

and

März

,

M.

(

2023

), “

ChatGPT in medical school: how successful is AI in progress testing?

”,

Medical Education Online

, Vol.

28

No.

1

, 2220920, doi:

https://doi.org/10.1080/10872981.2023.2220920

.

Google Scholar

Galletta

,

D.F.

,

Durcikova

,

A.

,

Everard

,

A.

and

Jones

,

B.M.

(

2005

), “

Does spell-checking software need a warning label?

”,

Communications of the ACM

, Vol.

48

No.

7

, pp.

82

-

86

, doi:

https://doi.org/10.1145/1070838.1070841

.

Google Scholar

Gerlich

,

M.

(

2025

), “

AI tools in society: impacts on cognitive offloading and the future of critical thinking

”,

Societies

, Vol.

15

No.

1

, p.

6

, doi:

https://doi.org/10.3390/soc15010006

.

Google Scholar

Haas

,

N.

,

Hassan

,

M.

,

Mansour

,

S.

and

Morton

,

R.B.

(

2021

), “

Polarizing information and support for reform

”,

Journal of Economic Behavior and Organization

, Vol.

185

, pp.

883

-

901

, doi:

https://doi.org/10.1016/j.jebo.2020.10.013

.

Google Scholar

Han

,

Z.

,

Song

,

G.

,

Zhang

,

Y.

and

Li

,

B.

(

2025

), “

Trust the machine or trust yourself: how AI usage reshapes employee self-efficacy and willingness to take risks

”,

Behavioral Sciences

, Vol.

15

No.

8

, 1046, doi:

https://doi.org/10.3390/bs15081046

.

Google Scholar

Hassan

,

M.

,

Amin

,

E.

,

Mansour

,

S.

and

Voigt

,

S.

(

2023

), “

Incentivizing cooperation against a norm of defection: experimental Evidence from Egypt

”,

Journal of Behavioral and Experimental Economics

, Vol.

107

, 102121, doi:

https://doi.org/10.1016/j.socec.2023.102121

.

Google Scholar

Hennessey

,

B.A.

and

Amabile

,

T.M.

(

2010

), “

Creativity

”,

Annual Review of Psychology

, Vol.

61

No.

1

, pp.

569

-

598

,

Annual Reviews, Hennessey, Beth A.: Department of Psychology, Wellesley College, Wellesley, MA, US, 02481, bhenness@wellesley.edu

, doi:

https://doi.org/10.1146/annurev.psych.093008.100416

.

Google Scholar

PubMed

Hoffman

,

M.

,

Kahn

,

L.B.

and

Li

,

D.

(

2018

), “

Discretion in hiring

”,

Quarterly Journal of Economics

, Vol.

133

No.

2

, pp.

765

-

800

, doi:

https://doi.org/10.1093/qje/qjx042

.

Google Scholar

Holmes

,

W.

(

2023

), “

The unintended consequences of artificial intelligence and education (Education International Research report)

”,

Education International

,

available at:

https://www.ei-ie.org/en/item/28115:the-unintended-consequences-of-artificial-intelligence-and-education

Google Scholar

Khalid

,

J.

,

Chuanmin

,

M.

,

Altaf

,

F.

,

Shafqat

,

M.M.

,

Khan

,

S.K.

and

Ashraf

,

M.U.

(

2024

), “

AI-driven risk management and sustainable decision-making: role of perceived environmental responsibility

”,

Sustainability

, Vol.

16

No.

16

, 6799, doi:

https://doi.org/10.3390/su16166799

.

Google Scholar

Kool

,

W.

,

McGuire

,

J.T.

,

Rosen

,

Z.B.

and

Botvinick

,

M.M.

(

2010

), “

Decision making and the avoidance of cognitive demand

”,

Journal of Experimental Psychology: General

, Vol.

139

No.

4

, pp.

665

-

682

, doi:

https://doi.org/10.1037/a0020198

.

Google Scholar

PubMed

Kosmyna

,

N.

,

Hauptmann

,

E.

,

Tong Yuan

,

Y.

,

Situ

,

J.

,

Liao

,

X.-H.

,

Vivian Beresnitzky

,

A.

,

Braunstein

,

I.

and

Maes

,

P.

(

2025

), “

Your brain on ChatGPT: accumulation of cognitive debt when using an AI assistant for essay writing task

”,

arXiv preprint arXiv:2506.08872 4

.

Google Scholar

Lawton-Smith

,

H.

(

2006

),

Universities, Innovation and the Economy

, (1st) ed.,

Routledge

,

London

, doi:

https://doi.org/10.4324/9780203358054

.

Google Scholar

Leys

,

C.

,

Ley

,

C.

,

Klein

,

O.

,

Bernard

,

P.

and

Licata

,

L.

(

2013

), “

Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median

”,

Journal of Experimental Social Psychology

, Vol.

49

No.

4

, pp.

764

-

766

, doi:

https://doi.org/10.1016/j.jesp.2013.03.013

.

Google Scholar

Lo

,

C.K.

(

2023

), “

What is the impact of ChatGPT on education? A rapid review of the literature

”,

Education Sciences

, Vol.

13

No.

4

, p.

410

, doi:

https://doi.org/10.3390/educsci13040410

.

Google Scholar

Manso

,

G.

(

2011

), “

Motivating innovation

”,

The Journal of Finance

, Vol.

66

No.

5

, pp.

1823

-

1860

, doi:

https://doi.org/10.1111/j.1540-6261.2011.01688.x

.

Google Scholar

Melo

,

C.D.

,

Marsella

,

S.

and

Gratch

,

J.

(

2016

), “

People do not feel guilty about exploiting machines

”,

ACM Transactions on Computer-Human Interaction

, Vol.

23

No.

2

, pp.

1

-

17

, doi:

https://doi.org/10.1145/2890495

.

Google Scholar

Nikolova

,

M.

,

Cnossen

,

F.

and

Nikolaev

,

B.

(

2024

), “

Robots, meaning, and self-determination

”,

Research Policy

, Vol.

64

, 104987, doi:

https://doi.org/10.1016/j.respol.2024.104987

.

Google Scholar

Ortiz-Ospina

,

E.

(

2024

), “

Loneliness and social connections

”,

Our World in Data

,

available at:

https://ourworldindata.org/social-connections-and-loneliness

Google Scholar

Pan

,

G.

and

Ni

,

J.

(

2024

), “

A cross sectional investigation of ChatGPT-like large language models application among medical students in China

”,

BMC Medical Education

, Vol.

24

No.

1

, p.

908

, doi:

https://doi.org/10.1186/s12909-024-05871-8

.

Google Scholar

PubMed

Pasquale

,

F.

(

2015

),

The Black Box Society: The Secret Algorithms that Control Money and Information

,

Harvard University Press

,

Cambridge, MA

,

available at:

http://www.jstor.org/stable/j.ctt13x0hch

Google Scholar

Peláez-Sánchez

,

I.C.

,

Velarde-Camaqui

,

D.

and

Glasserman-Morales

,

L.D.

(

2024

), “

The impact of large language models on higher education: exploring the connection between AI and Education 4.0

”,

Frontiers in Education

, Vol.

9

, 1392091, doi:

https://doi.org/10.3389/feduc.2024.1392091

.

Google Scholar

Raudenbush

,

S.W.

(

1997

), “

Statistical analysis and optimal design for cluster randomized trials

”,

Psychological Methods

, Vol.

US

No.

2

, pp.

173

-

185

, doi:

https://doi.org/10.1037/1082-989X.2.2.173

.

Google Scholar

Romero-Rodríguez

,

J.-M.

,

Ramírez-Montoya

,

M.-S.

,

Buenestado-Fernández

,

M.

and

Lara-Lara

,

F.

(

2023

), “

Use of ChatGPT at university as a tool for complex thinking: students’ perceived usefulness

”,

Journal of New Approaches in Educational Research

, Vol.

12

No.

2

, pp.

323

-

339

, doi:

https://doi.org/10.7821/naer.2023.7.1458

.

Google Scholar

Sabour

,

S.

,

Liu

,

J.M.

,

Liu

,

S.

,

Yao

,

C.Z.

,

Cui

,

S.

,

Zhang

,

X.

,

Zhang

,

W.

,

Cao

,

Y.

,

Bhat

,

A.

,

Guan

,

J.

,

Wu

,

W.

,

Mihalcea

,

R.

,

Wang

,

H.

,

Althoff

,

T.

,

Lee

,

T.M.C.

and

Huang

,

M.

(

2025

), “

Human decision-making is susceptible to AI-driven manipulation

”,

(arXiv:2502.07663)

,

arXiv

, doi:

https://doi.org/10.48550/arXiv.2502.07663

.

Google Scholar

Salatino

,

A.

,

Prével

,

A.

,

Caspar

,

E.

and

Lo Bue

,

S.

(

2025

), “

‘Fire! Do not fire!’: investigating the effects of autonomous systems on agency and moral decision-making

”,

Acta Psychologica

, Vol.

260

, 105350, doi:

https://doi.org/10.1016/j.actpsy.2025.105350

.

Google Scholar

Schniter

,

E.

,

Shields

,

T.W.

and

Sznycer

,

D.

(

2020

), “

Trust in humans and robots: economically similar but emotionally different

”,

Journal of Economic Psychology

, Vol.

78

, 102253, doi:

https://doi.org/10.1016/j.joep.2020.102253

.

Google Scholar

Shen

,

Y.

,

Heacock

,

L.

,

Elias

,

J.

,

Hentel

,

K.D.

,

Reig

,

B.

,

Shih

,

G.

and

Moy

,

L.

(

2023

), “

ChatGPT and other large language models are double-edged swords

”,

Radiology

, Vol.

307

No.

2

, e230163, doi:

https://doi.org/10.1148/radiol.230163

.

Google Scholar

Stadler

,

M.

,

Bannert

,

M.

and

Sailer

,

M.

(

2024

), “

Cognitive ease at a cost: LLMs reduce mental effort but compromise depth in student scientific inquiry

”,

Computers in Human Behavior

, Vol.

160

, 108386, doi:

https://doi.org/10.1016/j.chb.2024.108386

.

Google Scholar

Sternberg

,

R.J.

and

Lubart

,

T.I.

(

1999

), “The concept of creativity: prospects and paradigms”, in

Handbook of Creativity

, Vol.

1

, pp.

3

-

15

, doi:

https://doi.org/10.1017/cbo9780511807916.003

.

Google Scholar

Su

,

J.

and

Yang

,

W.

(

2023

), “

Unlocking the power of ChatGPT: a framework for applying generative AI in education

”,

ECNU Review of Education

, Vol.

6

No.

3

, pp.

355

-

366

, doi:

https://doi.org/10.1177/20965311231168423

.

Google Scholar

Vanderweele

,

T.J.

,

Hong

,

G.

,

Jones

,

S.M.

and

Brown

,

J.L.

(

2013

), “

Mediation and spill-over effects in group-randomized trials: a case study of the 4Rs educational intervention

”,

Journal of the American Statistical Association

, Vol.

108

No.

502

, pp.

469

-

482

, doi:

https://doi.org/10.1080/01621459.2013.779832

.

Google Scholar

PubMed

von Garrel

,

J.

and

Mayer

,

J.

(

2023

), “

Artificial Intelligence in studies—use of ChatGPT and AI-based tools among students in Germany

”,

Humanities and Social Sciences Communications

, Vol.

10

No.

1

, p.

799

, doi:

https://doi.org/10.1057/s41599-023-02304-7

.

Google Scholar

Wachter

,

S.

and

Mittelstadt

,

B.

(

2019

), “

A right to reasonable inferences: re-thinking data protection law in the age of big data and AI

”,

Columbia Business Law Review

, p.

494

.

Google Scholar

Wang

,

J.

and

Fan

,

W.

(

2025

), “

The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis

”,

Humanities and Social Sciences Communications

, Vol.

12

No.

1

, p.

621

, doi:

https://doi.org/10.1057/s41599-025-04787-y

.

Google Scholar

Winter

,

R.A.

(

2000

), “Optimal insurance under moral hazard”, in

Handbook of Insurance

,

Springer

, pp.

155

-

183

.

Google Scholar

Crossref

Woessner

,

M.N.

,

Tacey

,

A.

,

Levinger-Limor

,

A.

,

Parker

,

A.G.

and

Levinger

,

I.

(

2021

), “

The evolution of technology and physical inactivity: the good, the bad, and the way forward

”,

Frontiers in Public Health

, Vol.

9

, 655491, doi:

https://doi.org/10.3389/fpubh.2021.655491

.

Google Scholar

Yang

,

T.C.

,

Hsu

,

Y.C.

and

Wu

,

J.Y.

(

2025

), “

The effectiveness of ChatGPT in assisting high school students in programming learning: evidence from a quasi-experimental research

”,

Interactive Learning Environments

, Vol.

33

No.

6

, pp.

1

-

18

, doi:

https://doi.org/10.1080/10494820.2025.2450659

.

Google Scholar

Zhang

,

S.

,

Zhao

,

X.

,

Zhou

,

T.

and

Kim

,

J.H.

(

2024

), “

Do you have AI dependency? The roles of academic self efficacy, academic stress, and performance expectations on problematic AI usage behavior

”,

International Journal of Educational Technology in Higher Education

, Vol.

21

No.

1

, 34, doi:

https://doi.org/10.1186/s41239-024-00467-0

.

Google Scholar

Zhou

,

J.

,

Muller

,

H.

,

Holzinger

,

A.

and

Chen

,

F.

(

2024

), “

Ethical ChatGPT: concerns, challenges, and commandments

”,

Electronics

, Vol.

13

No.

17

, 3417, doi:

https://doi.org/10.3390/electronics13173417

.

Google Scholar

2025

Mazen Hassan, Engi Amin, Sarah Mansour and Zeyad Kelani

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence.

Is ChatGPT detrimental to innovation? A field experiment among university students

Introduction

Theory

Material and methods

Results

First: innovative behavior based on the lemonade stand game

Second: risk behavior from the bomb risk elicitation task

Third: level of effort as elicited from the sheet for recording strategies

Discussion and conclusion

Notes

References

Email Alerts

Cited By

Is ChatGPT detrimental to innovation? A field experiment among university students Open Access

Introduction

Theory

Material and methods

Results

First: innovative behavior based on the lemonade stand game

Second: risk behavior from the bomb risk elicitation task

Third: level of effort as elicited from the sheet for recording strategies

Discussion and conclusion

Notes

References

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Sharing Unavailable

Is ChatGPT detrimental to innovation? A field experiment among university students