To unravel the drivers of service consumers’ parasocial relationships with artificial intelligence-enabled voice assistants (VAs), this study examines how VA frequency- and time-related paralinguistic features affect parasocial attraction of VAs. The authors zoom in on the interrelations between consumers’ social perceptions by exploring how parasocial attraction drives perceived anthropomorphism and trust in VAs.
In an online experiment, a VA displayed high or low voice intonation and high or low speech rate. Self-reported data of 580 Prolific participants regarding their perceptions of parasocial attraction, anthropomorphism and trust were collected and subjected to partial least squares path modeling.
The results show a moderating role of VA speech rate on the effect of voice intonation on parasocial attraction, such that voice intonation increases VA parasocial attraction when speech rate is high. In turn, parasocial attraction drives trust in a VA, both directly and indirectly via perceived anthropomorphism.
The study outcomes can help designers and service managers design and infuse VAs in service frontlines within smart service systems, in ways that promise to enhance customer experiences and make services more inclusive.
By addressing the interplay between VAs’ frequency- and time-related paralinguistic features, this study offers new insights into the effects on consumers’ parasocial relationships with VAs and subsequent social perceptions. Such insights can benefit continued research into smart service systems.
Propelled by recent breakthroughs in natural language processing and speech recognition, voice assistants (VAs) powered by artificial intelligence (AI) have burst onto the scene, especially in smart service systems, and have transformed how consumers and employees experience every service encounter (Beverungen et al., 2019; Grewal et al., 2022). Voice-driven interactions enable seamless, hands-free control of smart devices; reflecting the extraordinary promise of such intuitive interfaces (de Barcelos Silva et al., 2020). An estimated 8.4 billion VA devices are in use today, double the number in 2020 (Laricchia, 2024). These transformations also extend beyond in–home smart speakers. Smartphones represent the primary gateways for voice interactions, followed by smart speakers, smart TVs and cars. In turn, the economic impacts of VAs are substantial, with a 2024 global market valued at approximately $50 billion and expected to reach almost $148 billion by 2030 (Ispiryan, 2025).
In response, service firms seek more ways to integrate VAs into their service provision, such as to facilitate ordering at Amazon (Amazon, 2024a) or book travel via Tripadvisor (Tripadvisor, 2024). In addition, ChatGPT incorporates voice features (OpenAI, 2023) and YouTube is rolling out voice replies, such that creators can reply to comments with audio clips (Hutchinson, 2025). In their introduction of such novel uses, leading VA providers, such as Amazon Alexa and Apple Siri, have started offering the option to modify VA’s voice features, e.g. speech rate (Amazon, 2024b; Apple, 2024). From an inclusivity standpoint, expanding personalization of such accessibility features (Mende et al., 2024; Stead et al., 2022) could make these services more inclusive for vulnerable consumers. Indeed, elderly, disabled or low (digitally) literate service consumers could greatly benefit from using VAs to simplify overwhelming portals and processes, by guiding them or performing certain tasks directly (Abdolrahmani et al., 2018; Zoorob et al., 2022). Thus, exploring other voice features could inform the expansion of VA functionalities to make services more accessible for vulnerable groups.
Recent research also underscores the need to study paralinguistic features, such as communication style and speech rate, to strengthen relational outcomes with VAs (van Pinxteren et al., 2020). A primary and distinctive aspect of VAs is their voice, as the main mode of communication in consumer interactions (Grewal et al., 2022; Klaus and Zaichkowsky, 2020). Voice carries paralinguistic characteristics, such as vocal intonation and speech rate, and these prominent vocal cues shape people’s cognitive and behavioral responses (Baird et al., 2017; de Waele et al., 2019; Rodero, 2015). That is, both frequency- and time-related paralinguistic features (Hildebrand et al., 2020) substantively affect people’s social perception of an interlocutor, irrespective of whether that entity is human (Anikin et al., 2018; Imhof, 2010; Smith et al., 1975), a physical robot (Niculescu et al., 2013) or a VA (Efthymiou and Hildebrand, 2020; Wei et al., 2023). Different vocal features interact in this process (Black, 1961; Guyer et al., 2019), though precisely how this interplay unfolds in the context of human-to-VA interactions in smart service systems has yet to be studied.
For example, a critical question for service scholarship is whether and how it might drive trust in smart service systems [1]. Prior research predicts how this long-term, relational outcome (Guenzi et al., 2016) might stem from perceptions of convenience (Malodia et al., 2023), warmth or competence (Dandotiya et al., 2024), but Mele and Russo-Spena (2024) also suggest accounting for relationships and interconnectedness to understand consumer interaction with VAs. When they develop pseudo-social relationships with VAs (Guha et al., 2023; Hernandez-Ortega et al., 2022), people often perceive the non-human entity as human-like (Hernandez-Ortega and Ferreira, 2021; Whang and Im, 2021). In turn, such relationships tend to be associated with higher levels of satisfaction and usage (Han and Yang, 2018) and more positive attitudes and behaviors toward both the non-human entity and company deploying it (Chung and Cho, 2017; Marinova et al., 2017; McLean et al., 2021; Park and Lennon, 2004). However, we lack clear insights into the role of VA parasocial attraction [2], a key manifestation in parasocial relationships (Ashe and McCutcheon, 2001; Kang et al., 2024; Marikyan et al., 2022).
Therefore, we investigate specifically if and how frequency-related and time-related paralinguistic features, in the form of voice intonation and speech rate, shape consumers' parasocial attraction toward VAs. Such parasocial attraction might arguably evoke social perceptions of anthropomorphism [3], which then may foster consumer trust. We apply social agency theory as the primary theoretical lens in our attempts to identify the mechanisms by which vocal cues contribute to the formation of human-like perceptions of VAs and thereby enhance users’ willingness to trust these artificial entities. Taken together, this study makes several important theoretical and managerial contributions.
First, by drawing on social agency theory – which stipulates that when humans communicate with computers, social conversations schemas can be activated by social cues embedded in computer-generated communication (Atkinson et al., 2005; Mayer et al., 2003) – we identify parasocial attraction as an important driver of VA trust in smart service systems, which has relevant implications for relationship dynamics and interconnectedness research.
Second, we move beyond functional design aspects (Blut et al., 2021) and shed new light on the interplay of a VA’s frequency and time-related voice features. As we establish, users are drawn to VAs whose characteristics align with the personal ideals they aspire to embody (Klohnen and Luo, 2003; Wetzel and Insko, 1982); that is, our findings emphasize the importance of this type of perceived alignment. Furthermore, we specify how various paralinguistic cues interact to create a vocal balance (Patterson, 1973; Rodero et al., 2022). Examining how these cues work together, rather than in isolation, uncovers deeper insights into human-technology interaction dynamics within smart service environments.
Third, as a contribution to human–computer interaction research, we also shift the focus from the most appropriate designs for anthropomorphic computer features toward the psychological mechanisms that determine users’ perceptions of those features. Rather than depicting how vocal cues elicit anthropomorphism and preference, we investigate the role of anthropomorphism, as a psychological construct involving the attribution of human-like qualities to non-human entities (Becker-Olsen and Hill, 2006), in prompting trust during human–computer interactions. Fourth, the research insights offer concrete guidelines for managers of firms that design VAs and service managers interested in deploying or optimizing its use in (inclusive) smart service systems.
Conceptual background and hypotheses development
Organizational frontlines represent the boundaries between the service organization, its customers and other stakeholders (Singh et al., 2017) and facilitate service encounters in which consumers interact with a concrete service interface, which integrates service elements such as human actors, the physical environment, service processes and technology (Larivière et al., 2017; Patrício et al., 2011). The increasing prevalence of the latter element creates frontlines infused with smart technologies (de Keyser et al., 2019; Schultz and Gorlas, 2023; van Doorn et al., 2017), which can provide important benefits to both service providers (e.g. controlling and optimizing product operations, generating novel data streams) and service consumers (e.g. creating value-in-use from using smart technologies; Beverungen et al., 2019; Hottat et al., 2023). At minimum, these service frontlines serve as a resource enabling value creation in the exchange process between service consumers and providers (Akaka and Vargo, 2014). More advanced technologies, however, even can function as autonomous actors in value creation process (de Keyser et al., 2019). For example, a VA is capable of booking an appointment or virtual care consultation without external intervention.
Existing smart service system configurations are mainly characterized by connectivity, which links actors in the service system (Henkens et al., 2021); automation, which enables them to take over tasks from human actors (Mele et al., 2022); and/or dynamic learning and adapting abilities (Mele and Russo-Spena, 2024). As smart service systems reshape traditional service communication (Guha et al., 2023; Mahr and Huh, 2022) across the entire customer journey (Gonçalves et al., 2020; Grewal et al., 2022), VAs come to the forefront, prompting promising benefits but also significant levels of distrust, particularly in contexts that feature sensitive information (Huang et al., 2024; Malodia et al., 2023).
In the shift from human-to-human service encounters to human-to-VA encounters (Larivière et al., 2017), the importance of communicative behaviors is evident (van Pinxteren et al., 2020). Human–computer interaction studies postulate that computers can function as social actors and human users assign human traits to computers (Nass et al., 1994). Therefore, fundamental social principles, as rooted in social psychology, can arise in these service interactions as well (Nass et al., 1995). Accordingly, we adopt social agency theory (Atkinson et al., 2005; Mayer et al., 2003) as the theoretical foundation for this research; it predicts that social conversations schemas get activated by social cues, including those embedded in computer-generated communication. In particular, the voices provided by a VA are rich in social cues (Edwards et al., 2019; Huang et al., 2024), so human users can interpret an interaction with a VA as a social conversation with another social agent. This interpretation, in turn, triggers social rules and associated human-to-human communication schemas, which users apply to form social perceptions of VAs. As complements to social agency theory, we derive insights from communication, social psychology and human–computer interaction literature in the next sections, to establish our conceptual model (Figure 1) and define VA’s frequency- and time-related paralinguistic features, as well as predict their effects on consumers’ parasocial attraction and ultimately trust and related social perception outcomes.
The diagram shows two section headings at the top labeled “V A paralinguistic features” on the left and “Consumers’ social perceptions” on the right, with the right section enclosed by a dashed rectangular boundary. Under “V A paralinguistic features”, a rectangular box labeled “Intonation” is positioned on the left and connects with a straight rightward arrow to an oval labeled “Parasocial attraction” inside the dashed boundary. Above the arrow, a rectangular box labeled “Speech rate” is positioned above the arrow and links downward with a vertical arrow to the horizontal arrow between “Intonation” and “Parasocial attraction”. Inside the dashed boundary under “Consumers’ social perceptions”, the oval labeled “Parasocial attraction” connects with a diagonal upward rightward arrow to an oval labeled “Anthropomorphism”, and the oval labeled “Anthropomorphism” connects with a diagonal downward rightward arrow to an oval labeled “Trust”. The oval labeled “Parasocial attraction” connects with a rightward arrow to an oval labeled “Trust”.Conceptual framework. Source: Authors’ own work
The diagram shows two section headings at the top labeled “V A paralinguistic features” on the left and “Consumers’ social perceptions” on the right, with the right section enclosed by a dashed rectangular boundary. Under “V A paralinguistic features”, a rectangular box labeled “Intonation” is positioned on the left and connects with a straight rightward arrow to an oval labeled “Parasocial attraction” inside the dashed boundary. Above the arrow, a rectangular box labeled “Speech rate” is positioned above the arrow and links downward with a vertical arrow to the horizontal arrow between “Intonation” and “Parasocial attraction”. Inside the dashed boundary under “Consumers’ social perceptions”, the oval labeled “Parasocial attraction” connects with a diagonal upward rightward arrow to an oval labeled “Anthropomorphism”, and the oval labeled “Anthropomorphism” connects with a diagonal downward rightward arrow to an oval labeled “Trust”. The oval labeled “Parasocial attraction” connects with a rightward arrow to an oval labeled “Trust”.Conceptual framework. Source: Authors’ own work
VA frequency- and time-related paralinguistics as drivers of consumers’ parasocial attraction
Any voice, whether human or synthesized, carries paralinguistic features to which a listener can prescribe attributes and thus guide their subsequent thoughts, attitudes and behaviors (Baird et al., 2017). These non-linguistic voice characteristics (Abercrombie, 1968; Crystal, 1974) drive social perceptions (Anikin and Persson, 2017; Apple et al., 1979; Guyer et al., 2019; Scherer et al., 1973), even in verbal human-VA communication (e.g. Cohn et al., 2021; Moussalli and Cardoso, 2020). A common categorization of vocal paralinguistic features includes four broad categories: frequency (i.e. intonation), time (i.e. speech rate), amplitude (i.e. loudness of speech) and spectral (i.e. voice instability) features (Hildebrand et al., 2020; Schuller et al., 2013). Research into paralinguistics asserts that frequency and time dimensions are most effective for communicating emotional meaning (e.g. Scherer, 1974). Similarly, when users form perceptions of a VA, we anticipate that its frequency- and time-related vocal features are more relevant than amplitudes or spectral features. Users have direct control over a VA’s amplitude (e.g. Amazon, 2024c), and due to their inherent lack of biological mechanisms (Teixeira et al., 2013) and preprogrammed nature (Guha et al., 2023), VAs produce consistent and stable vocal output. In other words, voice instability is not present in VA communication. Furthermore, prior research on parasocial attraction (Avelino et al., 2020; Chen and Park, 2021; Mariani et al., 2023) has outlined its similarities with interpersonal social attraction (Cialdini, 2009). The effect of vocal features on (social) attractiveness are thus well established across communication (e.g. Burgoon et al., 1990), social psychology (e.g. Robbins and DeNisi, 1994) and human-computer interaction (e.g. Bartneck et al., 2009a; Wagner et al., 2019) literature domains.
VA voice intonation
Voice intonation, or variation of pitch (Hildebrand et al., 2020), drives social perceptions of (social) attraction (Kühne et al., 2020; Niculescu et al., 2013). Average pitch values are around 210 Hz and 120 Hz for female and male voices, respectively (Niculescu et al., 2013). These auditory social cues (Feine et al., 2019) can signal meaning (Bevacqua et al., 2010) and activate socially acquired vocal expression schemata that help listeners infer underlying emotions (Patel and Scherer, 2013). In human-to-human communication schemas, lower pitch variation implies a lack of emotion. Expressing emotion through intonation constitutes an important voice function, such that people tend to favor greater VA pitch variation (Kühne et al., 2020). In turn, we predict that voice intonation should exert a positive effect on VA parasocial attraction in human-to-VA service interactions within a smart service system.
VA speech rate
Speech rate is determined by the number of words a speaker uses in a given timeframe (Hildebrand et al., 2020); it generally equals 150–200 words per minute in normal speech (Ketrow, 1990). It likely predicts users’ social perceptions of VAs (Cohn et al., 2021; Guha et al., 2023), especially their (social) attractiveness (Xie et al., 2023). Speech rate is another auditory social cue (Feine et al., 2019) that should activate human-to-human communication schemas, which users then apply to decode the meanings conveyed by the VA (Bevacqua et al., 2010; Patel and Scherer, 2013). A slower speech tempo tends to evoke negative perceptions, by signaling boredom, sadness and disgust; a high tempo implies enthusiasm and happiness (Scherer et al., 1973). Such perceptions then inform the (social) attractiveness of the speaker (Street Jr. et al., 1983), including VAs (Choi et al., 2020; Dowding et al., 2024). Thus, we predict that speech rate positively influences the parasocial attractiveness of a VA in smart service encounters.
Interplay of VA voice intonation and speech rate
The separate effects of voice intonation and speech rate are relevant, yet communication entails more than one cue at a time (e.g. Black, 1961; Bond and Feldstein, 1982). Relatively less research considers the interplay of different paralinguistic features (Rodero et al., 2022), though a few studies indicate that voice intonation and speech rate exhibit additive effects, such that their combined effect exceeds their sum (de Waele et al., 2019; Guyer et al., 2019; Rodero, 2015). For example, a more dynamic intonation or faster speech rate could attract more attention and improve perceptions in isolation (Gnisci and Pace, 2014; Jackob et al., 2011; Rodero, 2020), but their complex interaction suggests that an optimal combination is needed to enhance parasocial attraction (e.g. Rodero et al., 2022), especially with VAs (e.g. van Pinxteren et al., 2020). According to communication (Megehee et al., 2003) and social psychology (Chattopadhyay et al., 2003; Moore et al., 1986) research, a rapid speech rate tends to decrease the effect of verbal content but increase the effects of peripheral cues, such as pitch. Vocal profiles that score high on expressivity, versus apathy or monotony, also positively shape social perceptions of speakers (Guyer et al., 2019; Rodero, 2020). Because voice intonation and speech rate effectively express emotional meaning (Scherer, 1974), high levels of both could result in an expressive balance (Patterson, 1973; Rodero et al., 2022); these coordinated paralinguistic cues might achieve what we refer to as a “high expressivity” balance.
In addition, people are attracted to others that resemble their ideal self-identity, which comprises traits that they aspire to acquire, improve or express (Klohnen and Luo, 2003; Wetzel and Insko, 1982). If a user proudly possesses a high expressivity vocal profile, a VA that exhibits similar features provides reinforcing input; for users which lack but wish they had this profile, such a VA might attract increased attention, by exhibiting dissimilarity to aspects they do not like about themselves (Klohnen and Mendelsohn, 1998). Therefore, the effect of VA voice intonation on parasocial attraction in smart service encounters should be moderated by speech rate, and this interplay exhibits coordinated effects:
Speech rate moderates the positive impact of voice intonation on consumers’ parasocial attraction to a VA, such that the impact of voice intonation is greater when the speech rate is high.
Consumers’ parasocial attraction to and social perceptions of a VA
Social perceptions of VAs, in which consumers attribute human characteristics to the computer interface (Lea and Spears, 1992), can take many forms. Human–computer interaction literature often focuses on perceptions related to parasocial attraction, anthropomorphism and trust (Bartneck et al., 2009b; Braun et al., 2019; Kühne et al., 2020; Lawson-Guidigbe et al., 2023; Salem et al., 2013; Wagner et al., 2019), though, generally without considering interrelations nor in the context of human-to-VA service interactions within smart service systems. We thus turn to social psychology literature to inform such explorations (Epley et al., 2007; Waytz et al., 2014), including the predicted role of parasocial attraction in driving social perceptions. Social cognition research highlights the importance of social factors for informing social perceptions of others (e.g. Rutherford and Kuhlmeier, 2013), including non-human interlocutors (Pan et al., 2018). When people experience illusory reciprocal human interaction with non-human entities, the parasocial interaction is controlled by the person who imagines it (Horton and Wohl, 1956). The interactions can even grow into artificial friendships, or parasocial relationships, including with VAs (Stern et al., 2007; Whang and Im, 2021), which then might drive key outcomes (Gkinko and Elbanna, 2023; Han and Yang, 2018).
Trust
Reflecting our focus on parasocial relationships, we prioritize trust as a core relational outcome (Blut et al., 2021; Guenzi et al., 2016). First, a lack of trust constitutes a persistent barrier to VA adoption in smart service systems, especially those involving sensitive data (Huang et al., 2024; Malodia et al., 2023), but its presence can encourage consumers to use VAs in the first place (Jain et al., 2022; Moussawi et al., 2021; Siau and Wang, 2018). Second, trust reflects the long-term nature of ongoing relationships (Doney and Cannon, 1997). Third, it is an established proxy for social proximity (Akhavan and Mariotti, 2023), in that it enables people to form imaginary parasocial relationships with non-human entities, which are perceived as real relationships and instrumental to their subjective social experience (Yuan et al., 2016). By interacting with a VA, consumers can build trust (Huang et al., 2024), beyond traditional conceptualizations of technology trust (Chi et al., 2021). The human-like interaction and intelligence provided by VAs broaden the scope of trust to include not only functional (i.e. fulfilling consumer needs) but also social aspects (i.e. following norms related to integrity and benevolence) (Connelly et al., 2018; Hu and Lu, 2021; Huang et al., 2024; Elkins and Derrick, 2013).
The notion that (para)social attraction promotes trust is well-documented in various contexts, including entrepreneurial networks (Ferguson et al., 2016), service employees (Kim, 2019) and VAs (Chen and Park, 2021; Siddike and Kohda, 2019). Compared to other forms of attraction (e.g. task or physical), social attraction offers the strongest predictor of trust in AI-powered entities (Qin et al., 2023). In line with the notion, derived from social agency theory, that VAs can function as a social partner (Mayer et al., 2003), we propose interpreting interactions of users and VAs according to a relationship-building perspective (Huang et al., 2024). If social agency can be established by high expressivity VA vocal profiles and users treat the VA as a social actor, it should activate human-to-human communication schemas (Atkinson et al., 2005), including emotion-as-information schema (Ashtar et al., 2024; Huang et al., 2024), according to which positive attitudes resulting from social attraction with an interlocutor arouse positive beliefs, including trust (Edwards and Cable, 2009). Social attraction also shapes perceptions of communication quality (Beattie et al., 2020; Nasirian et al., 2017), which fosters a trusting sense that the VA understands consumers’ needs (Chen and Park, 2021). Therefore, we propose the following hypothesis:
Consumers’ parasocial attraction to a VA has a positive effect on perceived trust in a VA.
Anthropomorphism
Many human–computer interaction studies examine the (social) attractiveness of more or less anthropomorphic computer design features (e.g. Qui and Benbasat, 2009; Roesler et al., 2021). Rather than investigating how combinations of vocal cues influence anthropomorphism though, we instead explore a potential mediating role of perceived anthropomorphism in the predicted effect of parasocial attraction on perceived trust, by integrating several research perspectives. First, people interact differently with others who share similar characteristics, so parasocial attraction might increase users’ tendencies to anthropomorphize non-human interlocutors (Epley et al., 2007; Tajfel and Turner, 1986). Second, humans possess basic social needs and have an inherent drive to form social connections. If they can fulfill such needs through parasocial attraction to a VA, they likely apply social scripts that make such interactions more familiar (Blut et al., 2021), and in doing so, they become more likely to anthropomorphize the VA (Chen and Park, 2021; Edwards et al., 2019; Waytz and Epley, 2012; Whang and Im, 2021). Third, a humans’ innate motivation to understand their environment may encourage heightened parasocial attraction to drive VA anthropomorphism (Chen and Park, 2021; Waytz et al., 2010), in that it serves as a compensatory mechanism for enhanced understanding. In line with social agency theory (Mayer et al., 2003; Atkinson et al., 2005), these predictions suggest that the human schemas (Han and Yang, 2018; Xu, 2020) triggered in verbal interactions with VAs result in a positive effect of consumers’ parasocial attraction on their perceived anthropomorphism of a VA.
When consumers perceive a non-human entity as human-like, it becomes more similar to themselves (Go and Sundar, 2019), reducing their social anxiety and evoking a sense of communicating with a real human instead (Hernandez-Ortega and Ferreira, 2021; Yuan et al., 2022). Across various research contexts, pertaining to AI in general (Glikson and Woolley, 2020; Kaplan et al., 2023; Pentina et al., 2023; Troshani et al., 2021), robots (Blut et al., 2021; Natarajan and Gombolay, 2020; van Pinxteren et al., 2019; Wünderlich et al., 2024) and VAs (Chen and Park, 2021; Liu et al., 2024; Rheu et al., 2021; Weitz et al., 2021), anthropomorphism has been shown to drive perceived trust. Anthropomorphizing VAs makes the interactions more personal and engaging (Epley et al., 2008), offers a sense of control over the environment by making the interlocutor appear more predictable (White, 1959) and adheres to social norm expectations (de Visser et al., 2016). Identifying these entities as social actors (Nass et al., 1994) triggers positive impressions (Aggarwal and McGill, 2007; van Doorn et al., 2017) that directly feed into consumers’ available knowledge and evidence to determine whether the VA is reliable to deliver intended promises (Komiak and Benbasat, 2006; Nasirian et al., 2017). Such expectations tend to encourage greater trust, including in VAs in smart service encounters (Huang et al., 2024; Malodia et al., 2023):
Consumers’ perceived anthropomorphism of a VA mediates the effect of parasocial attraction on perceived trust, such that (a) consumers’ parasocial attraction has a positive effect on perceived anthropomorphism, and (b) their perceived anthropomorphism has a positive effect on consumers’ perceived trust in the VA.
Method
To achieve the research objectives, we conducted an online experimental study, seeking to assess how VA paralinguistic time- and frequency-related features (voice intonation and speech rate) affect consumers’ parasocial attraction, along with subsequent social perception outcomes.
Design, sample and procedure
The conceptual model was tested with a 2 (low vs. high voice intonation) x 2 (low vs. high speech rate) between-subjects factorial experimental design. In line with existing service research (e.g. Larivière et al., 2024; Leiño Calleja et al., 2023), we recruited participants via Prolific. A total of 586 North American adults were paid (US) $9.37/hour for their participation and randomly assigned to one of the four conditions. We excluded one participant who failed the attention check, one participant who did not meet the minimum required level of English listening skills and four participants who completed the survey within 120 s – as the average reading speed is around 175 words per minute (Ketrow, 1990) and the voice clip alone lasted 32–35 s [4]. These exclusions left a final sample of 580 participants (53.1% female; 50.3% ≥ 40 years; Mage = 42.20 years; 98.8% native or advanced English listening skills; 88.1% prior experience with VAs).
Participants started by completing an attention check to ensure adequate functionality of sound transmission, in which they had to indicate the word which was verbally expressed in an audio fragment. After being informed that they were about to listen to a voice clip of a VA, participants were exposed to the focal audio fragment. The fragment depicts “Allison” – a fictional female-voiced VA, as most VAs on the market are configured with a female voice by default (Blakemore, 2024). To reinforce the notion that participants are listening to a computer-generated voice and not a human voice, “Allison” explains that she is a VA and briefly discusses capabilities and the potential of VAs. This content was selected as it reflects informational interactions common in everyday use of VAs. Participants could listen to the fragment as many times as needed. In turn, participants completed a set of questions measuring parasocial attraction, perceived anthropomorphism, trust and controls related to AI anxiety, AI privacy concerns, tech-savviness, gender, age, education, level of English listening skills and prior experience with VAs.
The AI anxiety, AI privacy concerns and tech-savviness control measures reflect stable predispositions and traits that could alter the primary mechanisms. First, AI anxiety, defined as fear or unease toward AI technologies/systems (Wang and Wang, 2022), could serve as an inhibiting factor for individuals in terms of engaging with AI systems (Wang et al., 2024). Second, “the perceived threat to an individual’s privacy due to the increased level of information that technology gathers on individuals beyond the individual’s knowledge and sometimes control” (McLean and Osei-Frimpong, 2019, p. 30), referred to as AI privacy concerns, was included as this predisposition has been shown to blur attitudes toward AI systems (Feng et al., 2017). Third, tech-savviness, i.e. an individual’s familiarity and affinity with technology (Ng, 2012), is controlled for as cues are used differently by experts than novices to form perceptions (Guha et al., 2023).
Experimental stimuli, pretest and measurement
The 32–35 s audio fragment – depending on the experimental condition – was created using IBM Watson’s text-to-speech converter (https://www.ibm.com/products/text-to-speech), which in turn served as input for our experimental stimulus. The speech rate and voice intonation of the audio fragment were manipulated using the software package “Praat”, version 6.1.40 (https://www.fon.hum.uva.nl/praat/). The specific parameters per condition are outlined in Table 1.
Parameters audio manipulations
| LI-LS | LI-HS | HI-LS | HI-HS | |
|---|---|---|---|---|
| Duration (in s) | 35.69 | 32.19 | 35.69 | 32.19 |
| Words per minute | 161.39 | 178.93 | 161.39 | 178.93 |
| f0 mean (Hz) | 161.27 | 161.23 | 165.39 | 165.56 |
| f0 standard deviation (Hz) | 17.20 | 17.23 | 32.6 | 32.57 |
| f0 relative standard deviation | 10.66% | 10.69% | 19.71% | 19.67% |
| f0 range (Hz) | 120.4 | 141.8 | 184.7 | 200.3 |
| LI-LS | LI-HS | HI-LS | HI-HS | |
|---|---|---|---|---|
| Duration (in s) | 35.69 | 32.19 | 35.69 | 32.19 |
| Words per minute | 161.39 | 178.93 | 161.39 | 178.93 |
| f0 mean (Hz) | 161.27 | 161.23 | 165.39 | 165.56 |
| f0 standard deviation (Hz) | 17.20 | 17.23 | 32.6 | 32.57 |
| f0 relative standard deviation | 10.66% | 10.69% | 19.71% | 19.67% |
| f0 range (Hz) | 120.4 | 141.8 | 184.7 | 200.3 |
Note(s): HS/LS = high/low speed; HI/LI = high/low intonation; f0 = fundamental frequency of voice
Source(s): Authors’ own work
The variability of pitch (i.e. voice intonation) was manipulated using the coefficient of variation (CV), or relative standard deviation (RSD), defined as the ratio of the standard deviation to the mean (Morgan and Rastatter, 1986). In our study, the RSDs across conditions differ approximately 9%, in line with the eeriness boundaries of speech rate. Figure 2 depicts the f0 (i.e. fundamental voice frequency) variability for the low intonation conditions and the high intonation conditions.
The image shows two side-by-side line charts. In both charts, the vertical axis is labeled “Pitch (Hertz)” and ranges from 75 to 300, and the horizontal axis is labeled “Time (seconds)”. In the left chart, the horizontal axis ranges from 0 to 32.19, and a jagged line extends across the full-time range with pitch values ranging approximately from 75 to 218 on the vertical axis. In the right chart, the horizontal axis ranges from 0 to 35.7, and a jagged line extends across the full-time range with pitch values ranging approximately from 100.742 to 283.792 on the vertical axis. Note: All numerical data values are approximated.Low voice intonation versus high voice intonation: f0 variability. Source: Authors’ own work
The image shows two side-by-side line charts. In both charts, the vertical axis is labeled “Pitch (Hertz)” and ranges from 75 to 300, and the horizontal axis is labeled “Time (seconds)”. In the left chart, the horizontal axis ranges from 0 to 32.19, and a jagged line extends across the full-time range with pitch values ranging approximately from 75 to 218 on the vertical axis. In the right chart, the horizontal axis ranges from 0 to 35.7, and a jagged line extends across the full-time range with pitch values ranging approximately from 100.742 to 283.792 on the vertical axis. Note: All numerical data values are approximated.Low voice intonation versus high voice intonation: f0 variability. Source: Authors’ own work
The relative difference between the low speech rate conditions and the high speech rate conditions, and thus the duration, was approximately 10%; with 161.39 and 178.93 words per minute, respectively [5]. These rates have a distinctive speed without causing excessive eeriness, present when approaching extreme values of <150 and >200 (Ketrow, 1990; Street Jr. et al., 1982).
To confirm the effectiveness of the speech rate and voice intonation manipulations within the boundaries of eeriness, a pretest was conducted with 40 North American adults, recruited from MTurk and randomly assigned to either the high speed/high intonation or the low speed/low intonation condition. Participants were instructed to rate the speech rate and the voice intonation on a 10-point semantic differential scale ranging from “very slow” to “very fast”, and “very monotone” to “very energetic”, respectively. The results of an independent samples t-test showed that participants reported significantly higher perceived intonation in the high intonation condition (Mintonationhigh = 4.85, SD = 2.25) compared to the low intonation condition (Mintonationlow = 2.55, SD = 1.28; t (38) = 3.97; p < 0.001). They also noted significantly faster speech rate perceptions in the high (Mspeechratehigh = 5.75, SD = 1.65) versus low (Mspeechratelow = 4.50, SD = 1.32; t (38) = 2.65; p = 0.012) speech rate condition. Thus, the manipulations appear to work as intended [6].
In the main experiment, measurement instruments from extant literature were employed to measure the constructs in our conceptual model (see Table 1). In particular, the measures for parasocial attraction (three items) were adopted from McLean and Osei-Frimpong (2019), perceived anthropomorphism (four items) was based on Bartneck et al. (2009a), perceived trust (twelve items) was taken from Elkins and Derrick (2013), whereas the control measures on AI anxiety (three items), AI privacy concerns (four items) and tech-savviness (two items) were respectively based on work from Wang and Wang (2022), and adopted from McLean and Osei-Frimpong (2019) and Ng (2012). Social attraction and the control variables AI anxiety, AI privacy concerns and tech-savviness were measured using seven-point Likert scales, ranging from “strongly disagree” to “strongly agree”. The remaining constructs relied on five-point semantic differential scales. As noted, we also measured gender, age, education, level of English listening skills, prior experience with VAs.
Results
In line with existing service research (e.g. Choi et al., 2024; Fritze et al., 2020), the proposed conceptual model is assessed with partial least squares structural equation modeling (PLS-SEM), an iterative combination of principal component analysis and ordinary least squares path analysis (Chin, 1998), using the software package SmartPLS 4.0 (Ringle et al., 2024). To generate robust standard errors and t-statistics, the bootstrapping procedure used 10,000 resamples (Hair et al., 2016).
Evaluation of measurement model
To evaluate the measurement model, we examine its internal reliability, convergent and discriminant validity (see Tables 2 and 3; Hair et al., 2016). First, the composite reliability values for all multi-item constructs – including control variables – ranged from 0.94 to 0.96, exceeding the recommended threshold value of 0.70 (Hair et al., 2011). Second, in support of acceptable convergent validity, all average variance extracted (AVE) values exceed 0.50 (Fornell and Larcker, 1981). Third, discriminant validity was established as the square root of the AVE exceeds the inter-construct correlations for all multi-item constructs (Fornell and Larcker, 1981). In addition, the highest heterotrait-monotrait (HTMT) value is 0.752, which is below the suggested threshold of 0.85 (Henseler et al., 2015; Voorhees et al., 2016).
Factor loadings, composite reliability and average variance extracted of the constructs and their items
| Components and manifest variables | Loading (t-value) |
|---|---|
| Parasocial attraction | CR: 0.959, AVE: 0.886 |
| I think Allison could be a friend of mine | 0.932 (113.66)*** |
| I had a good time with Allison | 0.937 (148.66)*** |
| I would like to spend more time with Allison | 0.955 (186.07)*** |
| Anthropomorphism | CR: 0.945, AVE: 0.812 |
| Please rate your impression of Allison: Fake – Natural | 0.900 (99.95)*** |
| Please rate your impression of Allison: Machinelike – Humanlike | 0.906 (89.19)*** |
| Please rate your impression of Allison: Unconscious – Conscious | 0.869 (67.15)*** |
| Please rate your impression of Allison: Artificial – Lifelike | 0.929 (120.49)*** |
| Trust | CR: 0.951, AVE: 0.620 |
| Please rate your impression of Allison: Undependable – Dependable | 0.783 (32.00)*** |
| Please rate your impression of Allison: Dishonest – Honest | 0.774 (39.43)*** |
| Please rate your impression of Allison: Unreliable – Reliable | 0.824 (51.31)*** |
| Please rate your impression of Allison: Unknowledgeable – Knowledgeable | 0.802 (42.22)*** |
| Please rate your impression of Allison: Unqualified – Qualified | 0.829 (49.75)*** |
| Please rate your impression of Allison: Unskilled – Skilled | 0.795 (38.21)*** |
| Please rate your impression of Allison: Uninformed – Informed | 0.808 (44.01)*** |
| Please rate your impression of Allison: Incompetent – Competent | 0.822 (51.40)*** |
| Please rate your impression of Allison: Unfriendly – Friendly | 0.686 (31.91)*** |
| Please rate your impression of Allison: Uncheerful – Cheerful | 0.767 (49.17)*** |
| Please rate your impression of Allison: Unkind – Kind | 0.773 (47.10)*** |
| Please rate your impression of Allison: Unpleasant – Pleasant | 0.774 (48.68)*** |
| AI anxiety | CR: 0.941, AVE: 0.842 |
| I find AI techniques/products (e.g. voice-controlled intelligent personal assistants) scary | 0.951 (22.19)*** |
| I find AI techniques/products (e.g. voice-controlled intelligent personal assistants) intimidating | 0.856 (16.10)*** |
| I do not know why, but AI techniques/products (e.g. voice-controlled intelligent personal assistants) scare me | 0.944 (24.68)*** |
| AI privacy concerns | CR: 0.943, AVE: 0.804 |
| I have my doubts about the confidentiality of my interactions with voice-controlled intelligent personal assistants | 0.884 (76.94)*** |
| I am concerned to perform a financial transaction via voice-controlled intelligent personal assistants | 0.852 (51.13)*** |
| I am concerned that my personal details stored with voice-controlled intelligent personal assistants could be stolen | 0.922 (86.76)*** |
| I am concerned that voice-controlled intelligent personal assistants collect too much information about me | 0.927 (113.48)*** |
| Tech savviness | CR: 0.938, AVE: 0.883 |
| I am constantly being sought after by people for advice on new digital technology | 0.936 (88.00)*** |
| I am typically one of the first to use new digital technology when it appears | 0.943 (92.10)*** |
| Components and manifest variables | Loading (t-value) |
|---|---|
| Parasocial attraction | CR: 0.959, AVE: 0.886 |
| I think Allison could be a friend of mine | 0.932 (113.66)*** |
| I had a good time with Allison | 0.937 (148.66)*** |
| I would like to spend more time with Allison | 0.955 (186.07)*** |
| Anthropomorphism | CR: 0.945, AVE: 0.812 |
| Please rate your impression of Allison: Fake – Natural | 0.900 (99.95)*** |
| Please rate your impression of Allison: Machinelike – Humanlike | 0.906 (89.19)*** |
| Please rate your impression of Allison: Unconscious – Conscious | 0.869 (67.15)*** |
| Please rate your impression of Allison: Artificial – Lifelike | 0.929 (120.49)*** |
| Trust | CR: 0.951, AVE: 0.620 |
| Please rate your impression of Allison: Undependable – Dependable | 0.783 (32.00)*** |
| Please rate your impression of Allison: Dishonest – Honest | 0.774 (39.43)*** |
| Please rate your impression of Allison: Unreliable – Reliable | 0.824 (51.31)*** |
| Please rate your impression of Allison: Unknowledgeable – Knowledgeable | 0.802 (42.22)*** |
| Please rate your impression of Allison: Unqualified – Qualified | 0.829 (49.75)*** |
| Please rate your impression of Allison: Unskilled – Skilled | 0.795 (38.21)*** |
| Please rate your impression of Allison: Uninformed – Informed | 0.808 (44.01)*** |
| Please rate your impression of Allison: Incompetent – Competent | 0.822 (51.40)*** |
| Please rate your impression of Allison: Unfriendly – Friendly | 0.686 (31.91)*** |
| Please rate your impression of Allison: Uncheerful – Cheerful | 0.767 (49.17)*** |
| Please rate your impression of Allison: Unkind – Kind | 0.773 (47.10)*** |
| Please rate your impression of Allison: Unpleasant – Pleasant | 0.774 (48.68)*** |
| AI anxiety | CR: 0.941, AVE: 0.842 |
| I find AI techniques/products (e.g. voice-controlled intelligent personal assistants) scary | 0.951 (22.19)*** |
| I find AI techniques/products (e.g. voice-controlled intelligent personal assistants) intimidating | 0.856 (16.10)*** |
| I do not know why, but AI techniques/products (e.g. voice-controlled intelligent personal assistants) scare me | 0.944 (24.68)*** |
| AI privacy concerns | CR: 0.943, AVE: 0.804 |
| I have my doubts about the confidentiality of my interactions with voice-controlled intelligent personal assistants | 0.884 (76.94)*** |
| I am concerned to perform a financial transaction via voice-controlled intelligent personal assistants | 0.852 (51.13)*** |
| I am concerned that my personal details stored with voice-controlled intelligent personal assistants could be stolen | 0.922 (86.76)*** |
| I am concerned that voice-controlled intelligent personal assistants collect too much information about me | 0.927 (113.48)*** |
| Tech savviness | CR: 0.938, AVE: 0.883 |
| I am constantly being sought after by people for advice on new digital technology | 0.936 (88.00)*** |
| I am typically one of the first to use new digital technology when it appears | 0.943 (92.10)*** |
Note(s): CR: composite reliability; AVE: average variance extracted; ***denotes p < 0.001
Source(s): Authors’ own work
Correlations and square root of the average variance extracted
| Multi-item construct | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| 1. Parasocial attraction | 0.941 | |||||
| 2. Anthropomorphism | 0.700 | 0.901 | ||||
| 3. Trust | 0.626 | 0.631 | 0.787 | |||
| 4. AI anxiety | −0.074 | −0.021 | −0.168 | 0.918 | ||
| 5. AI privacy concerns | −0.360 | −0.390 | −0.336 | 0.347 | 0.897 | |
| 6. Tech savviness | 0.293 | 0.170 | 0.135 | −0.154 | −0.151 | 0.939 |
| Multi-item construct | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| 1. Parasocial attraction | 0.941 | |||||
| 2. Anthropomorphism | 0.700 | 0.901 | ||||
| 3. Trust | 0.626 | 0.631 | 0.787 | |||
| 4. AI anxiety | −0.074 | −0.021 | −0.168 | 0.918 | ||
| 5. AI privacy concerns | −0.360 | −0.390 | −0.336 | 0.347 | 0.897 | |
| 6. Tech savviness | 0.293 | 0.170 | 0.135 | −0.154 | −0.151 | 0.939 |
Note(s): Values down the diagonal are the square roots of the AVE; all others are correlation coefficients
Source(s): Authors’ own work
Evaluation of structural model
Prior to evaluating the structural model and the hypothesized paths, we assess the overall fit of the model. As illustrated in Figure 3, the R2 values for all inner latent constructs range from 0.248 to 0.593, representing medium to large values (Chin, 1998). That is, the R2 values are 0.248, 0.507 and 0.593 for parasocial attraction, perceived trust and anthropomorphism, respectively.
The diagram shows two section headings at the top labeled “V A paralinguistic features” on the left and “Consumers’ social perceptions” on the right, with the right section enclosed by a dashed rectangular boundary. Under “V A paralinguistic features”, a rectangular box labeled “Intonation” is positioned on the left and connects with a straight rightward arrow to an oval labeled “Parasocial attraction” inside the dashed boundary. Above the arrow, a rectangular box labeled “Speech rate” is positioned above the arrow and links downward with a vertical arrow to the horizontal arrow between “Intonation” and “Parasocial attraction”, and this vertical arrow is labeled “0.295 asterisk”. Inside the dashed boundary under “Consumers’ social perceptions”, the oval labeled “Parasocial attraction” shows the label “R-squared equals 0.248” above it and connects with a diagonal upward rightward arrow labeled “0.597 triple asterisk” to an oval labeled “Anthropomorphism”, which shows the label “R-squared equals 0.593” above it. The oval labeled “Anthropomorphism” connects with a diagonal downward rightward arrow labeled “0.324 triple asterisk” to an oval labeled “Trust”, which shows the label “R-squared equals 0.507” above it. The oval labeled “Parasocial attraction” also connects with a straight rightward arrow to the oval labeled “Trust”, and this arrow is labeled “0.368 triple asterisk”.Structural model results. Notes: ***denotes p < 0.001, **denotes p < 0.01, *denotes p < 0.05. Source: Authors’ own work
The diagram shows two section headings at the top labeled “V A paralinguistic features” on the left and “Consumers’ social perceptions” on the right, with the right section enclosed by a dashed rectangular boundary. Under “V A paralinguistic features”, a rectangular box labeled “Intonation” is positioned on the left and connects with a straight rightward arrow to an oval labeled “Parasocial attraction” inside the dashed boundary. Above the arrow, a rectangular box labeled “Speech rate” is positioned above the arrow and links downward with a vertical arrow to the horizontal arrow between “Intonation” and “Parasocial attraction”, and this vertical arrow is labeled “0.295 asterisk”. Inside the dashed boundary under “Consumers’ social perceptions”, the oval labeled “Parasocial attraction” shows the label “R-squared equals 0.248” above it and connects with a diagonal upward rightward arrow labeled “0.597 triple asterisk” to an oval labeled “Anthropomorphism”, which shows the label “R-squared equals 0.593” above it. The oval labeled “Anthropomorphism” connects with a diagonal downward rightward arrow labeled “0.324 triple asterisk” to an oval labeled “Trust”, which shows the label “R-squared equals 0.507” above it. The oval labeled “Parasocial attraction” also connects with a straight rightward arrow to the oval labeled “Trust”, and this arrow is labeled “0.368 triple asterisk”.Structural model results. Notes: ***denotes p < 0.001, **denotes p < 0.01, *denotes p < 0.05. Source: Authors’ own work
Testing the proposed hypotheses [7], the results indicate that H2, H3a and H3b are statistically significant at p < 0.001, and H1 at p < 0.05. Specifically, VA voice intonation (β = 0.039, p = 0.708) and speech rate (β = −0.085, p = 0.396) do not drive parasocial attraction of a VA separately – rather these effects are qualified by a significant two-way interaction effect (β = 0.295, p = 0.042) [8]. As illustrated in Figure 4, parasocial attraction is virtually equal for a VA with high voice intonation (M = 2.84) and with low voice intonation (M = 2.77), when speech rate is low. Conversely, when speech rate is high, VA parasocial attraction is higher when voice intonation is high (M = 3.20) than for VAs with low voice intonation (M = 2.67). Taken together, these results provide support for H1.
The bar chart shows the vertical axis labeled “Parasocial attraction to a V A”, ranging from 0 to 4 in increments of 1 unit, and the horizontal axis labeled “Voice intonation” with two categories from left to right, “Low” and “High”. Each voice intonation category includes two vertical bars with error bars representing speech rate conditions. A legend on the right labels “Speech rate” with light bars representing “Low” and dark bars representing “High”. Under the “Low” voice intonation category, the bar for low speech rate shows a value of 2.77 with upper and lower error bars, and the bar for high speech rate shows a value of 2.67 with upper and lower error bars. Under the “High” voice intonation category, the bar for low speech rate shows a value of 2.84 with upper and lower error bars, and the bar for high speech rate shows a value of 3.20 with upper and lower error bars. A horizontal bracket labeled “0.002 double asterisk” spans across all three bars, covering the high speech rate bars under both “Low” voice intonation and “High” voice intonation, low speech rate bars for “High” voice intonation. Another horizontal bracket labeled “0.037 asterisk” spans only the two bars under the “High” voice intonation category, covering the low speech rate bar and the high speech rate bar.Two-way interaction effect of VA voice intonation and speech rate on parasocial attraction. Notes: 95% confidence interval error bars; ***denotes p < 0.001, **denotes p < 0.01, *denotes p < 0.05; only significant differences are highlighted. Source: Authors’ own work
The bar chart shows the vertical axis labeled “Parasocial attraction to a V A”, ranging from 0 to 4 in increments of 1 unit, and the horizontal axis labeled “Voice intonation” with two categories from left to right, “Low” and “High”. Each voice intonation category includes two vertical bars with error bars representing speech rate conditions. A legend on the right labels “Speech rate” with light bars representing “Low” and dark bars representing “High”. Under the “Low” voice intonation category, the bar for low speech rate shows a value of 2.77 with upper and lower error bars, and the bar for high speech rate shows a value of 2.67 with upper and lower error bars. Under the “High” voice intonation category, the bar for low speech rate shows a value of 2.84 with upper and lower error bars, and the bar for high speech rate shows a value of 3.20 with upper and lower error bars. A horizontal bracket labeled “0.002 double asterisk” spans across all three bars, covering the high speech rate bars under both “Low” voice intonation and “High” voice intonation, low speech rate bars for “High” voice intonation. Another horizontal bracket labeled “0.037 asterisk” spans only the two bars under the “High” voice intonation category, covering the low speech rate bar and the high speech rate bar.Two-way interaction effect of VA voice intonation and speech rate on parasocial attraction. Notes: 95% confidence interval error bars; ***denotes p < 0.001, **denotes p < 0.01, *denotes p < 0.05; only significant differences are highlighted. Source: Authors’ own work
Furthermore, parasocial attraction of a VA exerts a positive effect on perceived trust (β = 0.368, p < 0.001), in support of H2 [9]. Beyond this direct effect, heightened parasocial attraction drives trust indirectly (indirect effect: β = 0.194, p > 0.001), where parasocial attraction positively influences perceived anthropomorphism (β = 0.597, p < 0.001) [10] and perceived anthropomorphism influences trust (β = 0.324, p < 0.001), in line with H3a and H3b.
Discussion
Using social agency theory as a dominant theoretical lens, the reported experimental study examined how VAs’ frequency and time-related features shape consumers’ parasocial attraction to a VA, along with subsequent social perception outcomes. The empirical evidence presented in Figures 3 and 4 highlights several key findings.
The results provide empirical support for a moderating role of VA speech rate on the effect of voice intonation on consumers’ parasocial attraction to a VA. When the speech rate is low, parasocial attraction is roughly equal for low and high voice intonation; when it is high, however, parasocial attraction is higher for a VA with high voice intonation. In addition, it was found that consumers’ parasocial attraction directly drives trust in verbal interactions with a VA. Moreover, we found that perceived anthropomorphism mediates this effect, such that parasocial attraction to a VA positively influences perceived anthropomorphism, which in turn enhances perceived trust in a VA.
Theoretical implications
The current study makes several theoretical contributions. First, by applying social agency theory to clarify the role of parasocial attraction in consumers’ trust formation in smart service encounters, this study moves beyond previous research that prioritizes perceptions of convenience (Malodia et al., 2023), warmth and competence (Dandotiya et al., 2024) as drivers of trust, by devoting particular attention to key manifestations of parasocial relationships (Aw et al., 2022; Kang et al., 2024). In so doing, we respond to Mele and Russo-Spena’s (2024) assertion that adopting a relational ontology approach to smart service systems can offer a more comprehensive perspective that accounts for the significance of relationships and interconnectedness. In detail, we examine and establish how consumers’ parasocial attraction to a VA affects their trust formation, thereby offering new insights into the drivers of trust in voice-based smart service systems.
Second, our reliance on social agency theory as theoretical foundation also underpins the contributions we offer, regarding how a VA’s voice-related features interact to affect parasocial attraction. Rather than functional design aspects, like reliability and responsiveness (Dandotiya et al., 2024), our findings instead reiterate the importance of behavioral design aspects (Blut et al., 2021) for enhancing the overall customer experience in smart service systems. In accordance with predictions that the interplay of verbal behaviors determines the effectiveness of VAs’ communicative behaviors in this context (van Pinxteren et al., 2020), and particularly the interplay of frequency- and time-related paralinguistic features (de Waele et al., 2019; Rodero et al., 2022), we empirically demonstrate how VA voice intonation and speech rate – key vocal behavior characteristics – shape consumers’ parasocial attraction to a VA. Such novel insights into the complex nature and interplay of paralinguistic features (de Waele et al., 2019; Guyer et al., 2019; Rodero, 2015; van Pinxteren et al., 2020; Wetzels et al., 2023) helps deepen our understanding of the relationship dynamics among VA voice features and parasocial attraction in smart service systems.
Third, this study captures how consumers form trust in smart service systems, namely, through perceived anthropomorphism of a VA. Many factors can influence trust in VAs, such as warmth, competence (Dandotiya et al., 2024) and status seeking (Malodia et al., 2023) but the proposed and empirically supported mediation model, which incorporates anthropomorphism, offers additional insights into how perceptions of anthropomorphism that arise in smart service encounters influence trust in smart service systems.
Managerial implications
Successfully infusing VAs into smart service systems at service frontlines promises notable benefits for both service consumers and providers (Beverungen et al., 2019; de Keyser et al., 2019; Mahr and Huh, 2022). In particular, parasocial relationships with VAs (Marinova et al., 2017; Mele and Russo-Spena, 2024) can encourage users’ trust, even within service contexts where trust tends to be hard to establish. Concretely, VA designers and service managers can use the findings of this study to enhance their smart service interactions and unlock mutual stakeholder benefits.
Our findings challenge current practices that tend to focus on a subset of vocal cues, in isolation. For example, Amazon’s Alexa offers users the option to modify speech rate, along with an adaptive listening feature. That is, it provides consumers the option to take more time before the VA responds (Amazon, 2024b). Similarly, Apple’s Siri allows users to adjust the VA’s speaking rate and pause time (Apple, 2024). However, to facilitate parasocial human–VA relationships, designers and service managers must account for the interplay between individual vocal cues in VA communicative behaviors. Enabling service consumers to modify multiple VA vocal cues (Cheng, 2023), such as its voice intonation and speech rate, could make VAs more socially attractive. To help achieve a high expressivity balance, service consumers could be presented with vertically stacked sliders, in which, by default, but not limited to, adjusting one slider automatically moves the other slider proportionally. Designs of this nature could nudge consumers to maintain the balance.
The heightened parasocial attraction that likely results from allowing service consumers to tailor VAs within smart service systems, in turn, should carve new pathways for developing parasocial relationships with a VA, by facilitating human-like perceptions and the formation of trust. Vulnerable consumers could especially benefit from such targeted adjustments and expanded functionalities. These inclusive design principles could turn commercial VAs into assistive technologies that could effectively complement an individual’s skills (Masina et al., 2020), without fears of feeling exposed or losing autonomy and dignity associated with the typical assistive technology (Yusif et al., 2016). For instance, navigating portals for healthcare services, filing applications for government aid or managing finances can be overwhelming for elderly, disabled or low (digitally) literate service consumers (Abdolrahmani et al., 2018; Zoorob et al., 2022). Paradoxically, such groups could benefit most from its use. Reducing distrust by allowing these consumers to adjust a VA’s voice features to their liking, enables such smart service systems to simplify processes and guide vulnerable consumers step-by-step or perform (part of) these tasks directly. Unlocking a smart service system’s unique potential for personalization of accessibility features (Mende et al., 2024; Stead et al., 2022) offers clear paths forward to make services more inclusive for vulnerable consumers.
Limitations and future research directions
Although the present study offers insights into the role of VAs’ paralinguistic features in shaping consumers’ parasocial relationships with VAs and subsequent social perceptions, it also has some limitations. First, despite the carefully designed experiment in this study, and Prolific’s superior data quality in comparison with university subject pools (Peer et al., 2017) and other online crowdsourcing platforms (Douglas et al., 2023), service consumers are inherently more immersed in real-life service interactions (Leiño Calleja et al., 2023). Continued service research could, therefore, gather field data instead to further corroborate and extend the findings presented herein.
Second, while forming perceptions purely based on voice are driven mainly by paralinguistic features, and less so by linguistic content (Baird et al., 2017), it is possible that the linguistic content of the audio fragment (i.e. capabilities and potential of VAs) used in the present study has exerted some influence on the formation of social perceptions of the VA. Future research could use more neutral linguistic content to still reinforce that participants are listening to a computer-generated voice, though simultaneously limiting potential effects of discussing judgmental VA-related topics.
Third, this paper examines the interplay between VA voice intonation and speech rate. Building further on extant literature in communication (e.g. Burgoon et al., 1990), social psychology (e.g. Robbins and DeNisi, 1994) and human–computer interaction (e.g. Bartneck et al., 2009a; Wagner et al., 2019), future studies could further explore relevant non-vocal characteristics related to the service consumer (e.g. consumer gender; Chang et al., 2018), the context (e.g. type of service interaction or touch point; Hottat et al., 2023) and/or verbal cues (e.g. communication strategy; de Waele et al., 2019) in this particular setting.
Fourth, based on existing human–computer interaction research, this paper zooms in on consumers’ social perception outcomes related to anthropomorphism and VA trust, along with their interrelations. Using our findings as a blueprint, continued research could apply a similar logic and explore additional social perceptions of VAs, e.g. perceived agency, rapport, (emotional) intelligence, animacy, social presence, sociability (Appel et al., 2012; Blut et al., 2021; Gao et al., 2010). These concepts are well-documented in human–computer interaction literature and drawing on fundamental service and marketing principles could help to uncover their interrelationships. Such efforts would be particularly beneficial when moving beyond one-sided communication settings, that is, we encourage future research to further explore the paralinguistic properties of two-sided human-VA interactions.
Fifth, in a related extension, whereas the present study’s main focus is on the social perception outcomes of parasocial attraction to a VA, researchers in this field might explore other parasocial outcomes (e.g. parasocial interaction, parasocial attachment, parasocial identification; Giles, 2002; Rubin et al., 1987; Shan et al., 2020; Stever, 2017), or even performance-related outcomes (e.g. service performance, financial performance; Henkel et al., 2020; Marti et al., 2024) of these constructs. Such research efforts would deepen our understanding of the antecedents and outcomes of parasocial VA mechanisms and of the underlying communicative processes in smart service systems at technology-infused service frontlines. These insights are critical for enhancing service encounters and experiences, across the customer journey, for both consumers and employees in this context.
Sixth, consumers increasingly engage with smart services for highly sensitive tasks, like checking financial account balances (Rao, 2017), conducting monetary transactions (Rao, 2016), applying for official government documents such as passports (Parsons, 2019), and managing their personal health care appointments, laboratory results, or virtual consultations (One Medical, 2023). Despite the technological sophistication and convenience of these services, they continue to evoke substantial consumer distrust (Klaus and Zaichkowsky, 2020) and skepticism, often rooted in concerns about data privacy, security breaches, algorithmic intransparency and perceived impersonality. Considering the nuanced nature of consumer trust and its unique drivers across service contexts, we call for empirical research that investigates task sensitivity as a potential moderating variable.
Notes
That is, an individual’s assessment of how much an interlocutor can be trusted (Elkins and Derrick, 2013) and a crucial prerequisite to sustain an interpersonal relationship (O'Connor and Barclay, 2017), as well as parasocial relationships (Hudders and Lou, 2023).
Defined as the ability of an individual or entity to stimulate social interaction (Preece, 2001), and in the context of VAs as the extent to which individuals perceive a VA as a socially attractive communication partner (Lee et al., 2006).
Referring to the act of attributing human-like characteristics to non-human entities such as a computer, or robot (Becker-Olsen and Hill, 2006; Epley et al., 2007), which represents a cornerstone in connecting emotionally with an interlocutor (Blut et al., 2021) and facilitates social comparisons (Festinger, 1954). These processes are critical for developing and maintaining parasocial relationships with non-human entities (Giles, 2002; Klimmt et al., 2013).
As participants were given the option to listen to the fragment as many times as needed, we only excluded participants based on the lower bound of completion duration.
Categorizing speech rate as low or high remains a subjective process, with diverging categorizations across contexts (e.g. Pimsleur et al., 1977; Rodero, 2020; Tauroza and Allison, 1990). In response, we rely on the outer boundaries of causing excessive eeriness (Ketrow, 1990) instead and vary speech rates within this window.
In line with the pre-test, the result of the manipulation checks in the main study demonstrate that participants reported significantly higher perceived intonation in the high intonation conditions (Mintonationhigh = 4.36, SD = 2.39) in comparison to the low intonation conditions (Mintonationlow = 3.15, SD = 2.16; t (578) = 6.37; p < 0.001). In addition, significantly higher levels of perceived speech rate were found in the high speech rate conditions (Mspeechratehigh = 5.66, SD = 1.50) compared to the low speech rate conditions (Mspeechratelow = 4.71, SD = 1.47; t (578) = 7.70; p < 0.001).
Additional analyses revealed that, except for the main effect of VA voice intonation on perceived anthropomorphism (β = 0.262, p = 0.001), no statistically significant effects of VA voice intonation and/or speech rate on perceived anthropomorphism and trust were found.
Of the included controls, AI anxiety (β = 0.081, p = 0.043), AI privacy concerns (β = −0.351, p < 0.001), education (β = 0.184, p = 0.018; baseline: secondary education), level of English listening skills (β = −0.368, p = 0.002; baseline: non-native) and tech-savviness (β = 0.270, p < 0.001) reached statistical significance. For education, the category “None of the above” (N = 7), and for gender, the categories “Non-binary” (N = 6) and “Prefer not to say” (N = 5) were excluded pairwise due to low sample sizes.
From the set of included control variables, the effects of AI anxiety (β = −0.130, p = 0.001) and gender (β = 0.184, p = 0.002; baseline: male) were statistically significant.
Among the included controls, AI privacy concerns (β = −0.221, p < 0.001), age (β = 0.203, p < 0.001) and gender (β = 0.130, p = 0.022; baseline: male) were found to be statistically significant.

