This paper aims to provide a comprehensive understanding of the evolving landscape of Privacy Preservation (PP) techniques within the context of Web 3.0, where Social Media (SM) integrated with Internet of Things (IoT). It explores the challenges and opportunities inherent in safeguarding user privacy amidst the convergence of these technologies, with a focus on examining the efficacy of existing PP methods and identifying areas for further research and development.
This study adopts a structured approach, beginning with a detailed overview of SM, IoT and the data generated by IoT-integrated SM networks, emphasising their significance in contemporary applications. It proceeds to explore privacy concerns specific to these networks, followed by an exhaustive analysis of PP techniques, including Differential Privacy (DP), Federated Learning (FL) and blockchain. In addition, this study examines prevalent cyber threats and vulnerabilities related to PP in IoT-integrated SM networks, providing insights into emerging challenges and the need for robust security measures.
The analysis reveals the critical importance of PP in striking a delicate balance between harnessing the benefits of enhanced connectivity and data insights while ensuring the protection of user privacy. DP, FL and blockchain emerge as prominent techniques for PP, each offering unique advantages and limitations in preserving user privacy within IoT-integrated SM networks. Moreover, this study underscores the dynamic nature of the threat landscape, necessitating continual adaptation and innovation in cybersecurity practices to mitigate evolving risks effectively.
This paper contributes to the existing literature by offering a comprehensive and structured overview of PP techniques tailored specifically to the context of IoT-integrated SM networks. By synthesising insights from diverse sources and providing a detailed understanding of the challenges and opportunities in PP, it advances the ongoing discussion on privacy in the digital age. Furthermore, by outlining future research directions, this paper encourages further inquiry and innovation in safeguarding user privacy within IoT-integrated environments.
1. Introduction
In today’s digital era, social media (SM) has become an integral part of our daily lives, shaping the way we communicate, share experiences and engage with the world (Salim et al., 2022b). The present nature of SM platforms encourages individuals to openly express diverse facets of their lives, ranging from personal activities and preferences to opinions on various subjects. Beyond personal use, the industrial sector leverages the extensive reach of SM to advertise products, services and cultivate customer relationships. Simultaneously, the emergence of the Internet of Things (IoT) has revolutionised the connectivity and intelligence of everyday devices, facilitating a seamless integration of the physical and digital contexts.
The convergence of SM and IoT is not coincidental; it is driven by the striking similarities between user interactions on platforms like Facebook and Twitter, and the operational dynamics of IoT technologies (Ortner et al., 2018). As noted by Atzori et al. (2014), IoT technologies are often more relatable and understandable when framed in the context of everyday SM interactions. Building on this concept, the integration of Smart IoT devices into social networks is proposed, giving rise to IoT-integrated SM networks.
The integration of SM and IoT technologies yields transformative benefits, significantly extending the capabilities of both domains. It unlocks new dimensions of data collection and analysis, enabling the amalgamation of SM data with insights from a myriad of IoT-connected devices (Salim et al., 2020). This integration provides access to a rich, comprehensive data set, offering a deeper and more nuanced understanding of user behaviour and preferences. The SM 3.0 data set exemplifies this transformative potential, showcasing the powerful synergy between SM and IoT (Salim et al., 2022b).
This interdependent relationship between SM and IoT is reshaping the way we collect, analyse and apply data, opening up new avenues for innovation, personalisation and growth. As we explore the potential of IoT-integrated SM networks, we stand to unlock novel insights and applications that will drive the future of digital interaction and data-driven decision-making.
As IoT-integrated SM networks continue to emerge and expand, a range of challenges and concerns related to Privacy Preservation (PP) comes to the forefront (Roopa et al., 2019; Jia et al., 2020). Balancing the benefits of increased connectivity and enriched data insights with the critical need to protect user privacy becomes ever more essential in navigating this evolving digital landscape (Geng et al., 2020). PP presents a significant challenge in both IoT and SM, as these domains play a pivotal role in addressing privacy issues and safeguarding data against potential threats or unauthorised access (Keshk et al., 2021). Given that IoT-integrated SM networks generate vast and varied data from numerous devices and users, it becomes crucial to implement effective PP measures to protect users’ sensitive information from unauthorised exposure (He et al., 2017).
One notable example is the case of smart home security systems that have been exploited to gather personal information without the users’ consent. Attackers could potentially access real-time video feeds or alarm logs to gain insights into users’ activities and schedules. Another example involves fitness trackers integrated with SM platforms, where personal health data, including exercise routines and health conditions, could be harvested and misused if not adequately protected.
In such scenarios, PP measures are critical to ensuring that user data is not only protected from unauthorised access but also remains secure when used for ML applications. By implementing advanced PP techniques, data can be anonymised or encrypted, making it difficult for attackers to extract meaningful information from it. This approach not only enhances user privacy but also allows data consumers to use the data for ML applications without compromising individual privacy.
As illustrated in Figure 1, effective PP measures involve transforming the original data into a well-preserved version, which can then be used to enhance ML-based applications while minimising the risk of privacy breaches.
This paper aims to conduct a comprehensive survey of IoT-integrated SM networks and their potential implications, with a particular focus on PP. More specifically, it reviews and discusses the concepts and previous research presented in the domains of IoT-integrated SM networks and PP. This survey also aims to assess the efficacy of these techniques when used in distributed environments, as well as the extent to which privacy may be preserved without compromising its effectiveness.
1.1 Motivation
The integration of SM and IoT technologies presents a paradigm shift in how individuals interact with their environment and each other. This convergence offers immense opportunities for innovation, from enhancing user experiences to revolutionising business practices. However, amid these advancements, the paramount importance of PP cannot be overstated.
The motivation behind this survey stems from the critical need to address the escalating concerns surrounding privacy in IoT-integrated SM networks. As these networks emerge, they collect vast amounts of personal data from users and devices, raising significant privacy challenges. Individuals are increasingly wary of potential data breaches, unauthorised access and exploitation of their personal information. Moreover, regulatory bodies are tightening their scrutiny on data handling practices, necessitating robust privacy measures.
By conducting a comprehensive survey of IoT-integrated SM networks and their privacy implications, this paper aims to shed light on the existing landscape of PP techniques and challenges. Understanding the current state-of-the-art methodologies and identifying gaps in PP will pave the way for the development of more effective and resilient privacy solutions. Moreover, by exploring the intersection of SM, IoT and PP, this survey seeks to facilitate further research and innovation in this emerging field. More specifically, the motivation behind this survey lies in its endeavour to provide insights into the evolving landscape of IoT-integrated SM networks, with a particular emphasis on PP. By examining existing techniques, challenges and future directions, this paper aims to contribute to the ongoing discourse on privacy in the digital age and foster advancements in safeguarding user privacy in IoT-integrated environments.
1.2 Comparison with existing surveys
An extensive survey literature covers a wide range of complementary topics related to PP in IoT and SM networks (Seliem et al., 2018; Chanal and Kakkasageri, 2020; Jain et al., 2021; Saura et al., 2022; Ahmadvand et al., 2023; Nicolazzo et al., 2024). These surveys provide valuable insights into various aspects of privacy, security and data management in the context of emerging technologies. For instance, Seliem et al. (2018) explored PP techniques in IoT environments, discussing methods for protecting sensitive data generated by IoT devices. Similarly, Jain et al. (2021) discussed security and privacy issues in online social networks, examining threats such as data breaches and identity theft and proposing solutions to mitigate these risks. In addition, Keshk et al. (2021) investigated PP in Cyber-Physical Systems (CPSs), focusing on the unique challenges posed by integrating physical and digital elements. Moreover, Nicolazzo et al. (2024) explores privacy in blockchain-based Federated Learning (FL) systems, highlighting the role of decentralised technologies in ensuring data privacy and security.
As shown in Table 1, each of these surveys offers a comprehensive analysis of the challenges and solutions within their respective domains, providing valuable insights into emerging privacy concerns. However, our survey extends this body of literature by focusing specifically on the convergence of IoT and SM networks. We provide an in-depth examination of the challenges and strategies for PP in this context, considering the unique characteristics and complexities of integrating IoT devices with SM platforms.
Summary of related SM and IoT-PP surveys
| Reference | Focus | IoT | SM | IoT-integrated SM | Threats | Comments |
|---|---|---|---|---|---|---|
| Abi Sen et al. (2018) | PP in IoT | ✓ | X | X | X | Discusses the difference between privacy and security and outlines various techniques used to meet privacy needs within the IoT context |
| Seliem et al. (2018) | Towards privacy-preserving IoT environments | ✓ | X | X | X | Reviews existing research and proposed solutions addressing rising privacy concerns from various perspectives, including identification, tracking, monitoring and profiling |
| Chanal and Kakkasageri (2020) | Security and privacy in IoT | ✓ | X | X | X | Discusses main challenges such as confidentiality, integrity, authentication and availability for IoT briefly |
| Jia et al. (2020) | Location PP in social internet of vehicles (SIoV) | ✓ | ✓ | ✓ | X | Analyses the location privacy protection technology used in the field of SIoV |
| Jain et al. (2021) | Online social networks security and privacy | X | ✓ | X | ✓ | Reviews different security and privacy threats and existing solutions that can provide security to social network users |
| Keshk et al. (2021) | Privacy-preserving of CPSs | ✓ | X | X | ✓ | Reviews of the current PP techniques for protecting CPS systems and their data from cyberattacks |
| Saura et al. (2022) | Security and privacy issues of social networks | X | ✓ | X | ✓ | Analyses the main risks related to security and privacy of social networks based information systems in Industry 4.0 |
| Torre et al. (2023) | PP techniques for IoT devices | ✓ | X | X | X | Analyses and discusses the privacy-preservation techniques for IoT devices published in five different academic venues |
| Ahmadvand et al. (2023) | Privacy and security in software defined networking (SDN)-based IoT | ✓ | X | X | ✓ | Analyses security and privacy issues and solutions for SDN-based IoT systems |
| Nicolazzo et al. (2024) | Privacy in blockchain-based FL systems | – | – | – | ✓ | Explores the research efforts carried out by the scientific community to define privacy solutions in scenarios adopting blockchain-enabled FL |
| Our survey | Privacy preservation of Internet of Things–integrated social media network | ✓ | ✓ | ✓ | ✓ | Thoroughly examines the context of SM, IoT and their convergence in IoT-integrated SM networks, focusing on the challenges and strategies of PP. It provides the challenges in managing privacy in such distributed data-rich environments, discussing various PP and their applicability. Furthermore, it reviews attacks in the context of IoT-integrated SM networks while providing an overview of the associated threat models |
| Reference | Focus | IoT | SM | IoT-integrated SM | Threats | Comments |
|---|---|---|---|---|---|---|
| PP in IoT | ✓ | X | X | X | Discusses the difference between privacy and security and outlines various techniques used to meet privacy needs within the IoT context | |
| Towards privacy-preserving IoT environments | ✓ | X | X | X | Reviews existing research and proposed solutions addressing rising privacy concerns from various perspectives, including identification, tracking, monitoring and profiling | |
| Security and privacy in IoT | ✓ | X | X | X | Discusses main challenges such as confidentiality, integrity, authentication and availability for IoT briefly | |
| Location PP in social internet of vehicles (SIoV) | ✓ | ✓ | ✓ | X | Analyses the location privacy protection technology used in the field of SIoV | |
| Online social networks security and privacy | X | ✓ | X | ✓ | Reviews different security and privacy threats and existing solutions that can provide security to social network users | |
| Privacy-preserving of CPSs | ✓ | X | X | ✓ | Reviews of the current PP techniques for protecting CPS systems and their data from cyberattacks | |
| Security and privacy issues of social networks | X | ✓ | X | ✓ | Analyses the main risks related to security and privacy of social networks based information systems in Industry 4.0 | |
| PP techniques for IoT devices | ✓ | X | X | X | Analyses and discusses the privacy-preservation techniques for IoT devices published in five different academic venues | |
| Privacy and security in software defined networking (SDN)-based IoT | ✓ | X | X | ✓ | Analyses security and privacy issues and solutions for SDN-based IoT systems | |
| Privacy in blockchain-based FL systems | – | – | – | ✓ | Explores the research efforts carried out by the scientific community to define privacy solutions in scenarios adopting blockchain-enabled FL | |
| Our survey | Privacy preservation of Internet of Things–integrated social media network | ✓ | ✓ | ✓ | ✓ | Thoroughly examines the context of SM, IoT and their convergence in IoT-integrated SM networks, focusing on the challenges and strategies of PP. It provides the challenges in managing privacy in such distributed data-rich environments, discussing various PP and their applicability. Furthermore, it reviews attacks in the context of IoT-integrated SM networks while providing an overview of the associated threat models |
To ensure the relevance and rigour of the works cited, we followed a well-defined selection process. Inclusion criteria required that studies be published in reputable peer-reviewed journals or presented at respected conferences within the past seven years. We specifically included research that addressed PP techniques in the context of Web 3.0, SM and the IoT. Priority was given to works that provided empirical evidence, reviewed effective PP methods or discussed significant cyber threats in IoT-integrated SM networks. Exclusion criteria were applied to studies that were not peer-reviewed, published more than seven years ago or did not directly relate to the integration of PP techniques with IoT and SM. We also excluded theoretical studies without practical relevance or those that fell outside the scope of our focus on IoT-integrated SM networks. This careful selection process ensures that our survey includes only the most relevant and impactful research, providing a thorough and accurate reflection of the current state of the field.
While existing surveys often focus on specific aspects of either SM or IoT, this paper bridges the gap by addressing the integration of both domains. By exploring how SM and IoT technologies intersect and influence each other, this survey offers a holistic understanding of the privacy challenges and opportunities inherent in IoT-integrated SM networks.
Moreover, this survey goes beyond merely summarising existing techniques by providing detailed descriptions and evaluations of PP methods tailored to IoT-integrated SM networks. By examining techniques such as Differential Privacy (DP), FL and blockchain in this context, it offers insights into their applicability and effectiveness in safeguarding user privacy.
In addition, this survey discusses cybersecurity threats and vulnerabilities specific to IoT-integrated SM networks, highlighting the evolving threat landscape and the need for robust security measures. By addressing both PP techniques and cybersecurity concerns, this paper provides a comprehensive overview of the challenges and opportunities in securing user data in IoT-integrated environments.
Furthermore, this survey offers a forward-looking perspective by discussing future research directions and emerging directions in the field of PP in IoT-integrated SM networks. By identifying areas for further exploration and development, it contributes to advancing the state-of-the-art in PP and cybersecurity. Specifically, while existing surveys provide valuable insights into various aspects of privacy and security in emerging technologies, this survey stands out for its comprehensive coverage, in-depth analysis and forward-looking perspective on PP in IoT-integrated SM networks.
1.3 Original survey contributions and structure
This survey paper offers a comprehensive understanding of PP in IoT-integrated SM networks. The key contributions and the structure of the survey are as follows:
Comprehensive overview: The survey begins with a concise overview of SM, IoT and the data generated by IoT-integrated SM networks, highlighting their significance in real-world applications. This section lays the foundation for understanding the interconnectedness of SM and IoT and the implications for PP (subsections 2.1, 2.2 and 2.3 respectively).
Privacy concerns and techniques: The survey explores examples of privacy concerns arising from IoT-integrated SM networks and examines efforts made to define PP techniques, including discussions on privacy metrics. By exploring the landscape of privacy challenges, this section sets the stage for a deeper analysis of PP techniques (subsection 2.4).
Description of PP techniques: Detailed descriptions of various PP techniques are provided, with a particular focus on DP, FL and blockchain. These techniques are evaluated in the context of IoT-integrated SM networks, shedding light on their applicability and limitations in preserving user privacy (Section 3).
Cyberattacks and threats: An examination of cyberattacks and vulnerabilities related to PP in IoT-integrated SM networks is presented, highlighting the potential risks and challenges faced in safeguarding user privacy. This section offers insights into emerging threats and the need for robust security measures (Section 4).
Discussion of challenges and future directions: A brief discussion of the existing challenges in PP techniques applied to SM and IoT domains is provided, underscoring the significant contributions of this survey. Additionally, future research directions are outlined to address the evolving landscape of PP in IoT-integrated environments (Section 5).
Furthermore, the paper concludes with a summary of key findings and insights learned from the survey, emphasising the importance of ongoing research and innovation in safeguarding user privacy in IoT-integrated SM networks (Section 6). By following this structured approach, the survey offers a comprehensive analysis of the current state-of-the-art in PP techniques for IoT-integrated SM networks, identifying areas for further exploration and development.
2. Overview of Internet of Things–integrated social media and privacy preservation
This section provides an overview of essential concepts, beginning with defining SM, IoT and the integration of IoT with SM. In addition, we introduce the concept of PP and elaborate on its metrics, including data privacy and utility level, along with considerations of complexity.
2.1 Social media
SM is a form of informal communication. Initially, SM platforms were referred to as “micro-blogging” sites. The term “micro-blogging” was developed to differentiate SM from the dominant form of online communication at the time, blogging (Salim et al., 2022b). SM has eliminated all of the formality associated with writing a well-thought-out blog article, allowing micro-ideas to bloom and individuals to communicate much more readily than in the past (Toivonen et al., 2019; Bercovici, 2010). When SM was first developed, the principal objective was to provide a mechanism for individuals with little to no knowledge of computing or composition skills to communicate directly with others. Without knowing much about programming, coding, basic syntax or sentence structure, SM makes it simple to pass notes to others (Toivonen et al., 2019).
While there are conflicting claims as to who originated the term SM, it is believed to have first emerged in the early 1990s in line with the development of Web-based communication technologies that promoted online communication (Bercovici, 2010). However, it took until the mid-2000s for SM to become even more entrenched in public and scientific knowledge. Since then, SM has been popularly considered as a particular class of internet-based services (Ortner et al., 2018). However, exactly what is meant by SM is debatable, because it is highly difficult to provide a single definition that incorporates all of the tools and activities related to SM, particularly because SM is not characterised by any particular context, structure, content, user or provider.
Given the potential for wider definitions of SM, most researchers who use the term refer to a particular collection of online platforms that have arisen during the past two decades, such as blogs, micro-blogging and social networking sites (Bercovici, 2010). The emergence of such platforms and the corresponding applications is frequently regarded as the development of Web 2.0, which points to the appearance of a group of Web-based applications that provided all users with the possibility to write and contribute by publishing content and could even be accessed via various users. In reality, some researchers prefer the term “Web 2.0” over “SM” as it serves as an umbrella term for a larger set of online applications that allow expression and communication. The wider term of “Web 2.0” also implies other aspects of the technology, including the integration of the physical and digital worlds (Blackstock et al., 2011).
Since its inception in 1990 (Ortner et al., 2018), SM has made significant strides as the world’s largest data resource. SM has transitioned through several stages of development during its lifecycle, from SM 1.0 to SM 2.0, in line with the evolution of the World Wide Web, from Web 1.0 to Web 4.0. According to the literature, Web 1.0 is characterised as a web of information connections, Web 2.0 as a web of people connections, Web 3.0 as a web of knowledge connections and Web 4.0 as a web of intelligence connections (Choudhury, 2014). According to Berners-Lee (1992), the first execution of the Web, representing Web 1.0, might be regarded as the read-only Web. In other words, the early Web simply allowed users to search for and view content. Early applications of the shopping cart, usually used by e-commerce websites, fell within the Web 1.0 category, with the goal of offering items to potential customers without any active dialogue or information exchange between clients and producers. Web 2.0 emerged as a consequence of a lack of active user participation with web applications, announcing the “Read-Write-Publish” era (Ortner et al., 2018; Choudhury, 2014).
By allowing users to create content and interact with one another, Web 2.0 has had a huge influence on the Web ecosystem. During this time, web services’ users were introduced to new concepts such as blogs, SM and video streaming. As such, SM 1.0 advocated simplifying communication to connect with the people we desired to engage with. This has contributed to the formation of communities. Early online communities were platform-dependent, which meant that some groups exclusively used particular platforms to host their communities since that was where the bulk of individuals with similar interests were. SM 2.0 was the first to integrate SM networks (i.e. comment systems and sharing buttons) directly into the Web, allowing for content-based sharing rather than communication-based sharing. For the first time, SM 2.0 enabled individuals to exchange objects of interest, making it much easier to debate the media itself rather than needing to explain it (Ortner et al., 2018). Following the pattern of continual innovation, the Web is slowly but steadily shifting to a more data-centric phase in the context of Web 3.0. The timeline of the major SM platforms from 1990 to 2024 is depicted in Figure 2.
SM is often considered the social tether of modern times and provides ways to connect individuals from all over the world with their friends and family, strangers, celebrities, companies and news and entertainment services. Online SM platforms are used to meet new people, learn new skills, connect and reconnect with people known to you, find new jobs, disseminate policy and for public announcements (Zhang et al., 2018). SM data affords a massive record of humanity’s daily thoughts, emotions and activities at a resolution previously unbelievable. According to the SM statistics for 2022 of (SMPerth, 2022), there are 4.20 billion active SM users globally, with an average of 2 million joining per day. Because of the presence of relationships between users and the ability to exchange content with them, SM platforms are ideal for staying in touch with family, friends and co-workers.
Recently, big data researchers, such as those in Lytras and Visvizi (2020); Gella et al. (2018); and Gupta et al. (2018), have recognised that they can use SM data to make accurate predictions about the users’ preferences through the analysis of their behaviours and have used this information to create both authorised and illegitimate advertising and influence campaigns (Anandhan et al., 2018). To achieve highly targeted and relevant recommendations for content and advertising, Web services, such as e-commerce applications, frequently depend on the enormous volumes and variety of data SM platforms hold about individuals, especially users’ profile data including locations, interests, lifestyle, personality traits and IoT data on social platforms (Arachchige et al., 2020; Ju et al., 2019). Public policymakers, for example, analyse SM data to get demographic information that may be used to affect strategic decisions. The use of SM at the governmental level has both advantages and disadvantages, with SM regarded as inaccurate when compared to paid polling services, but may be accessible when these services are not, is significantly more timely and offers scale benefits that traditional polling services cannot compete with.
SM data sets’ availability is vital in all disciplines covering social users, such as research on PP (Ram Mohan Rao et al., 2018). As well as scientific research, market research, data analytics, measuring users’ influence and behaviour and for advertising purposes and recommendation systems (Ram Mohan Rao et al., 2018; Mendes and Vilela, 2017). The majority of research being conducted in these fields focuses on analysing SM data and using it as a foundation for further exploration. Therefore, a rising number of SM data sets are being used in the literature, with more being publicly available on a regular basis (Naseem et al., 2021; Morris et al., 2020). Since the general public’s acceptance of SM platforms like Facebook, LinkedIn, Twitter, Instagram, TikTok, YouTube and social IoT, they have been extensively investigated and there is a greater demand for ethically acquired and ground-truth data sets (Ram Mohan Rao et al., 2018).
Facebook is a general-purpose SM platform where individuals may list their favourite hobbies, literature and movies (Lipschultz, 2020), whereas LinkedIn is a professional network where individuals can declare areas of interest associated with their expert life to network employment opportunities. Twitter has been subtitled “micro-blogging” due to its use of brief messages and photos. Instagram and TikTok are specially developed for the purpose of sharing and re-sharing photos and short videos. YouTube is a video hosting site, but because of the commenting and sharing features, it is sometimes referred to as an SM network in its own right (Lipschultz, 2020). Users’ interactions in SM are frequently built on reciprocation; for example, on Facebook, a user must accept you to be linked. Twitter, on the contrary, based its users’ connections on following or being followed, with no requirement for reciprocation. Users are permitted to follow another user without being followed in return.
Despite the fact that SM data is the primary motivator for organisations who develop and maintain these decision-making systems, SM data availability and privacy remain concerns in this space (Di Minin et al., 2021; Areekijseree et al., 2018; Siddula et al., 2018). The early study on SM-related big data saw researchers gathering SM data from individuals through questionnaires, interviews and surveys (Wang, 2014). This has numerous drawbacks; these practices are labour-intensive, difficult to scale and are normally carried out in local groups, limiting their scope for study. In this regard, the development of online SM platforms has resulted in a substantial change in SM data research (Lytras and Visvizi, 2020), as it has considerably increased the availability and quantification of SM data. More specifically, many SM data sets have been acquired automatically using programs or scripts (Gella et al., 2018; Ding et al., 2016; Takac and Zabovsky, 2012), whereas complex SM platforms, such as Facebook and Twitter, offer data-collecting Application Programming Interfaces (APIs) (Areekijseree et al., 2018). As a consequence, the major obstacles to acquiring SM data turned out to be more related to processing power, storage capacity, data access rate and users’ privacy concerns (Wang, 2014).
Because SM platforms are often massive in terms of their user populations, amounts of user-generated information and update velocities that continue to rise fast, crawling them necessitates a robust system with massive storage capacity and processing power (Areekijseree et al., 2018; Gupta et al., 2018). Furthermore, in some instances, complex SM platforms limit data access rates and information availability. There are also licences for the acquired data, which complicate the concept of open and comprehensive data sets (Areekijseree et al., 2018). Due to these issues, collecting comprehensive SM data sets is sometimes difficult (Siddula et al., 2018). For this, some studies, such as those in Takac and Zabovsky (2012); Areekijseree et al. (2018); Siddula et al. (2018), used various SM crawling methods to acquire users’ profile data from a big SM platform for analytical purposes, whilst others, such as Ding et al. (2016), collected samples of users’ interaction data through non-rate-limited APIs. However, in terms of data quality, their success remains a challenge and the degree of representation of the acquired data to those in the original data set remains ambiguous (Mouris et al., 2018).
The failure to take into account the quality and potential bias of obtained data for a single stand-alone SM platform decreases the efficacy and validity of the findings provided. Because an individual can be a participant in many social networks at the same time, a composite social network in which the user can demonstrate varied behaviours while revealing certain shared underlying concerns and preferences might be better structured. These approaches have also highlighted the ethical concerns that such data collection methods entail. While many research studies (Ding et al., 2016; Takac and Zabovsky, 2012) have been undertaken to build SM data sets, the construction of realistic SM data sets that incorporate recent data characteristics settings is yet unexplored. More significantly, some data sets ignore IoT-related data, whereas others eliminate any new features. In some circumstances, the generating environment was unrealistic, whereas in others, the PP’s procedures were missing.
In reality, many SM users are members of many social networks at the same time. However, due to the limitations of each platform’s capabilities, each one only presents a partial representation of a user based on the data collected and the goal of the SM platform. As a result, combining data from diverse SM platforms is critical for improving user modelling and establishing accurate decision-making systems (Gupta et al., 2018). Unlike typical network-embedding data sets, where an entire structure is either a single platform or each platform included is a homogeneous network, there is a need to focus on multiple social platforms. Wang et al. (2019c), for example, designed an improved model for learning multi-view user representation using knowledge collected from numerous social networks to predict users’ behaviours, which may improve social advertising, preference predictions and service suggestions. However, appropriately combining knowledge in this context is difficult because it relies not only on integrating disparate data sources but also on objective applications for enhancing recommendation engines or assessing other models such as PP ones.
SM data holds significant potential as a tool for evaluating privacy models aimed at safeguarding user data (Ferrag et al., 2017). As noted by Cai et al. (2016), three key SM data sets are often used to explore PP issues. The first, the SNAP data set, includes user connections along with node attributes like gender, birthday, employer and location, with each attribute represented by a binary value indicating presence or absence (Leskovec and Krevl, 2014). The other two data sets, from Caltech and MIT, capture user associations at the California Institute of Technology in 2005 and include details such as role (student or faculty), gender, graduation year and academic major (Cai et al., 2016; He et al., 2017). Although these data sets are valuable for evaluating privacy methods, they remain limited and single-platform focused (Mouris et al., 2018).
Furthermore, in addition to the availability concerns of SM data for many data-based applications as well as for evaluating PP techniques, there are further two critical issues that would be discussed in this paper. First, there is a scarcity of heterogeneous data sources (i.e. data sets) that combine IoT with SM. There are also a limited number of ground truthable data sets that demonstrate PP’s reliability. As a result, there is a genuine need for the generation of a realistic data set that includes SM and IoT data to assess the trustworthiness of novel PP mechanisms.
Second, such data must be robustly and adequately safeguarded against privacy breaches, such as deanonymisation through inference attacks (Zhang et al., 2018; Seliem et al., 2018), particularly in distributed environments. Using sophisticated and oriented hacking techniques, attackers may be able to exploit and manipulate users’ data (Yang et al., 2018; Mohbey et al., 2020). This includes targeted phishing efforts, frauds, spear-phishing, opinion influencing and systems for deanonymisation and private data leaks.
Using such hacking techniques, sensitive information about users and their relationships might be improperly disclosed in SM, compromising the users’ privacy; consequently, this has become challenging (Zhang et al., 2018). Social networks are unique in this regard because they occupy a unique connection to the world, a combination of private data and are widely interconnected.
2.2 Internet of Things data
The IoT is the most recent internet transition, incorporating billions of smart devices, smartphones, wearable electronics and other internet-based sensors that interact through the internet, leveraging their functionality and data to offer novel smart facilities and services that benefit our community (Greengard, 2021). By 2025, internet connections are expected to be embedded in almost every applicable object, increasing the number of devices linked to the internet (Shafique et al., 2020). According to Cisco, there will be 500 billion internet-connected devices by 2030. Accordingly, as per Gartner’s IT Hype cycle (Domínguez Morales et al., 2023), the Internet Technologies (IT) services for the IoT ecosystem will offer a US$58bn potential in 2025, up 34% from 2020. Similarly, according to a recent Statista prediction (Vailshery, 2021), the IoT and its ecosystem would be worth $1.7tn by 2025, and the global installed base of IoT-linked devices is predicted to reach 30.9 billion units, up from 16.4 billion in 2022, as represented in Figure 3. IoT is enabling a paradigm shift towards a genuinely linked society in which ordinary things become networked, capable of communicating directly with one another, and able to provide smart services collectively. Thus, this potentially leads to a more efficient and functional environment.
The genesis of the IoT may be traced back to the 1980s with the concept of ubiquitous computing, which aimed to integrate technology into everyday life (Shafique et al., 2020). The entire concept of IoT concentrates upon the keyword “smartness”, which is defined as the ability to learn and apply information autonomously (Ahmed et al., 2016). As a result, the IoT applies to devices or sensors that are smart, uniquely accessible based on their communication protocols, extensible and autonomous and even have embedded security and privacy (Shafique et al., 2020). Shafique et al. (2020) defined IoT into three visions: Internet Oriented, which mainly focuses on object connection, Things Oriented, which focuses on generic objects, and Knowledge Oriented, which focuses on how to organise, collect and maintain knowledge. Currently, the IoT is projected on both an individual and professional level. Individually, the IoT plays a critical role in enriching lifestyle through a smart home, e-health and smart learning. Smart supply chain and transportation, logistics and remote monitoring are examples of IoT applications for professionals.
There is no broadly agreed-upon architecture for IoT. Various researchers have proposed various architectures (Mashal et al., 2015; Sethi and Sarangi, 2017; Khan et al., 2012). The most fundamental architecture is the three-layer architecture, as shown in Figure 4. It was mainly used in this domain while it was still in its infancy. The basis of that IoT architecture is made up of three layers, namely, perception, network and application layers (Greengard, 2021; Shafique et al., 2020; Yaqoob et al., 2017; Mashal et al., 2015).
The perception layer: this is the physical layer, which contains sensors for perceiving and acquiring data about the surroundings. It detects certain physical factors or recognises other smart devices in the environment.
The network layer: this is responsible for establishing connections with other smart objects, network devices and servers. Sensor data is also sent and processed using its features.
The application layer: this is responsible for providing the user with application-specific services. It outlines a number of applications for the IoT, including smart homes, smart cities and smart health.
The three-layer architecture encapsulates the core concept of the IoT; however, it is insufficient for IoT research because it frequently concentrates on smaller details of the IoT. As a result, many additional layered architectures have been presented in the literature (Ning and Wang, 2011; Sethi and Sarangi, 2017; Khan et al., 2012). The five-layer architecture (Khan et al., 2012), which also covers the processing and business layers, is one example. Perception, transport, processing, application and business are the five layers that make up the architecture. The perception and application layers provide the same purpose as the three-layer architecture. The remaining three layers are described in detail.
The transport layer transmits sensor data from the perception layer to the processing layer through networks such as Wireless connections.
The processing layer: is often referred to as the middleware layer. This layer receives, stores, analyses and processes massive volumes of data from the transport layer. It is capable of managing and providing a wide range of services to the lower layers. It makes use of a variety of technologies, including databases, cloud computing and big data processing systems.
The business layer: oversees the entire IoT system, including applications, growth and income models.
Another architecture developed by Ning and Wang (2011) is based on the human brain’s layers of processing. It is motivated by human intellect and the ability to think, feel, recall, make decisions and react to its surroundings. It is made up of three parts. The first is the human brain, which is equivalent to the data centre’s processing and data management unit. The second part of the body is the spinal cord, which is similar to a dispersed network of data processing nodes and smart gateways. The nerve network, which relates to networking components and sensors, is the third part.
It is generally believed that different architectural requirements exist for present and potential IoT applications, such as scalability, flexibility, interoperability, Quality of Service (QoS) support and security (Yaqoob et al., 2017). The capacity to manage connectivity among a large number of network devices without causing performance issues or bottlenecks is referred to as scalability. Flexibility refers to the supply of services in such a way that a particular system may be flexibly configured to improve the performance of certain applications. Interoperability facilitates connectivity and operation among diverse networks. Furthermore, as various types of applications such as low data rates and delay-sensitive applications are involved, QoS is one of the most essential architectural needs for IoT. One of the significant concerns of IoT is security because if a user’s data is compromised even once, it might negatively affect the user’s confidence in IoT. As such, while developing an IoT infrastructure, security must be prioritised.
There are two main types of IoT placement/design: centralised and decentralised (Greengard, 2021; Yaqoob et al., 2017). The majority of IoT systems are integrated with cloud architectures in which a central hub is used to deliver a number of back-end services to smart devices. In this design, smart devices function as consumers, while a central hub serves as a centralised node. The IoT platform’s key centralised features include event processing, event reporting and genuine analytics. On the contrary, in some contexts, independent communication between smart devices is necessary under the IoT paradigm without the requirement for a central hub. This design is a decentralised IoT platform. Peer-to-peer messaging, decentralised auditing and decentralised file-sharing are some examples of decentralised IoT platforms (Yaqoob et al., 2017).
Smart transportation, smart home, smart health care, smart grid, smart lighting and intelligent building are just a few examples of key IoT applications (Yaqoob et al., 2017). These applications help people in many aspects of their lives. By suggesting an alternate route, the smart transportation system lessens overcrowding. Furthermore, the prospective study of smart transportation data contributes to the reduction of road accidents. Residents of smart houses may control appliances remotely. Diseases may be identified early using smart health-care applications, saving lives. Smart meters are used to assess energy consumption levels in a smart grid setting, and readings are automatically relayed to the grid. Low-cost sensors and wireless communication may be incorporated into lights enabling smart lighting. Another prominent IoT use is intelligent building, in which the building is enabled by Information and Communication Technology. In essence, IoT applications make individuals’ life easier (Greengard, 2021).
The emerging emphasis on internet transition is encouraging more enterprises to enact IoT-enabled initiatives. While such initiatives enable enterprises to improve consumer experiences, build new revenue channels, or seek new partner ecosystems, collecting the data necessary to achieve these advantages might be challenging (ur Rehman et al., 2019). The massive volume of data generated by these devices, the variety of data obtained and the velocity with which data is produced present their own set of issues for such enterprises in terms of processing power, storage capacity and analytics (Yaqoob et al., 2017). The adoption of IoT has been exponential across all industries, but each industry has its own set of unique challenges. However, to reap the benefits of IoT, many enterprises leverage all of the massive data produced by IoT devices in their Machine Learning (ML) models and employ prescriptive and predictive analytics to make more informed, and ultimately better decisions (ur Rehman et al., 2019). However, in many applications, IoT data is considered sensitive and must be kept private and safe (Shafique et al., 2020; Ahmed et al., 2016; Ding et al., 2019; Stoyanova et al., 2020). Examples of sensitive IoT data include medical data gathered by biomedical sensors, location data provided by mobile phones and energy consumption data received by smart meters. The disclosure of such information might open the door to criminal activity, as well as cause significant harm or even death. As a result, IoT poses a significant challenge in regards to privacy, security and confidence, which are regarded to be among the remaining major impediments in the development of IoT applications (Shafique et al., 2020; Ding et al., 2019; Stoyanova et al., 2020).
As the IoT has raised concerns about data privacy and security, massive IoT data sets are required for analysing network flows, discriminating between normal and abnormal traffic and identifying malicious behaviour (Shafique et al., 2020; ur Rehman et al., 2019; Stoyanova et al., 2020). Specifically, the development of a genuine IoT data set is seen as a fundamental requirement for PP development (Mendes and Vilela, 2017; Batra et al., 2020; Salim et al., 2022b). IoT smart device data set research has largely evolved into one of three main categories. The first category includes studies that make use of specially designed laboratories with pre-existing sensors and actuators across several platforms(Intille et al., 2005). In many contexts, they are designed with use cases in mind, such as simulating home, commercial, or industrial systems. This category of studies is frequently focused on the human interaction features and research problems that are prominent in comprehending smart environments and centre around how individuals engage with a myriad of these interconnected systems. Individuals and the Personal Area Network (PAN) are the subjects of the second category of IoT smart device data set development. This may be used to monitor and track one’s health and well-being (O’Brien et al., 2017; Alemdar et al., 2013).
The third category of research is in simulated data sets, which allow for the simulation of enormous numbers of interconnected devices using tested data. This approach has both advantages and disadvantages; the scale and simplicity of development are significant. However, simulation has limitations and may not precisely reflect the implementation of actual devices (Salim et al., 2022b). In this regard, the literature has provided a number of data sets to help researchers in simulating IoT devices and developing IoT data sets, including O’Brien et al. (2017); Alemdar et al. (2013); Koroniotis et al. (2019); Al-Hadhrami and Hussain (2020); and Salim et al. (2022b). Although certain data sets remain private for different reasons, including privacy concerns and a lack of PP mechanisms, others have been publicly available. The claimed goal of these studies and the data sets related to them was to obtain proof of malicious activity. Although this is significant, additional efforts are needed to make social IoT data smart enough to give long-term classification and prediction of users’ preferences as well as behaviours through IoT device monitoring.
Overall, the IoT may be the wave of the future and finding ways to incorporate it into our daily lives is a critical issue to address (Jia et al., 2021; Sethi and Sarangi, 2017). As, even though IoT provides a variety of major and substantial services to the world, numerous issues might obstruct IoT’s development. According to the studies (Sethi and Sarangi, 2017; Khan et al., 2012; Farhan et al., 2018), these obstacles include scalability concerns, data quantities, data interpretation, interoperability, fault tolerance, power supply, wireless communication, privacy and security, among others. However, security and privacy are now the most urgent concerns that need to be addressed, because they are seen as a necessary complement to IoT development that must be embedded in the IoT architecture (Seliem et al., 2018; Al-Hadhrami and Hussain, 2020; Nicolazzo et al., 2020; Jia et al., 2021). Furthermore, by linking the physical and virtual worlds, the IoT aims to expand existing human-to-human communications to human-to-things and things-to-things communication in the context of social IoT. This aim may be achieved by using SM, which provides a point of entry for users or applications to engage with Web-based objects.
2.3 Internet of Things-integrated social media networks (social media 3.0)
SM has shaped the world by enabling anybody, even the bulk of the world’s population, to engage with family and friends, exchange messages with celebrities and share real-time ideas, places, videos and images (Salim et al., 2021). The rapid growth of the internet, which has gone a long way because the dial-up connections and Alta Vista search engines of the past, has attributed to SM’s success (Fuchs, 2021). Indeed, the transition from the information era to the SM era, and now to the era of the recently announced SM 3.0, has culminated in this evolution. In essence, SM 3.0 refers to the extension of current and developing SM platforms like Facebook, Twitter and TikTok, as well as the incorporation of new computing paradigms like the IoT, which results in a huge volume of IoT-integrated SM networks data (Salim et al., 2022b). The merging of these platforms as SM 3.0 offers users better integration, interactivity and seamless movement between physical venues.
SM 3.0 has the potential to significantly change how we engage with mobile devices, internet platforms and the world around us (Salim et al., 2022b). This integration with end users generates large-scale and heterogeneous data sources, allowing for a massive record of humanity’s daily thoughts, feelings and activities at previously unbelievable resolution. Recently, big data research, as in Takac and Zabovsky (2012); Ram Mohan Rao et al. (2018); ur Rehman et al. (2019); and Salim et al. (2022b), showed that they can leverage IoT-integrated SM networks, in other words, SM 3.0, data to generate accurate predictions about users’ preferences based on behavioural analysis, and they have exploited this information to construct both legitimate and illegitimate advertising and influence platforms.
According to Miranda et al. (2015), there is still a lot of potential for improvement in the integration of IoT technology into individuals’ lives. It is critical to develop this integration to make Smart devices accessible (Blackstock et al., 2011). As a framework for integrating Smart IoT devices into individuals’ social networks, we offer IoT-integrated SM networks. Every day, millions of genuine individuals use SM. As a result, adopting them as a channel of connectivity provides an environment that has been thoroughly tested to ensure complete availability and integrity (Siddula et al., 2018). So, we suggest that SM platforms be used not just to connect users but also to link things with network users, allowing them to operate smart devices or get sensor data notifications. As a result, the IoT expands owing to the addition of individuals, whereas the SM expands due to the addition of IoT devices. As we will explain in this paper, IoT-integrated SM networks provide us with comprehensive user data by combining SM data with data from other devices previously associated with SM users.
Such connectivity of the SM with IoT would result in big data that might be of use for many data-based applications (Ju et al., 2019; Wang, 2014; Lytras and Visvizi, 2020; Salim et al., 2022b). Although exchanging high-quality SM and IoT data is a necessity for many applications, especially knowledge-based preference prediction applications (Lytras and Visvizi, 2020), data in its raw form is regularly restricted as it may contain highly sensitive information about individuals. As a solution to this issue, there have been several proposals and schemes that aim to empower the usage of such valuable data whilst simultaneously preserving privacy. Such tools, processes and techniques are the basis for the field of PP. According to Garfinkel et al. (2015), PP is “techniques and methods for releasing data in a more hostile environment, so that the public data remains practically useful while individual privacy is safeguarded”.
Figure 1 simplifies the central concept of merging SM and IoT data to optimise future data-based applications. In this illustration, PP techniques are applied to the original data from IoT-integrated SM networks, resulting in a well-preserved version of the data. This preserved data can then be used by data consumers to enhance their ML-based applications, whereas attackers are less likely to obtain specific answers to their queries.
2.4 Privacy preservation and Internet of Things–integrated social networks
SM as well as IoT data sharing/publishing has a long history of resulting in data-driven innovations! Conventional data sharing refers to information transactions between a data provider and a data consumer. Electronic Data Interchange (EDI), which was developed in the late 1970s and is still in use today, is a successful implementation of such electronic information transactions between parties with a focus on the business side (Kelly, 2020). EDI benefits numerous enterprises that create, ship, buy, sell or provide care, ranging from retailers and manufacturers to logistics companies, airlines, health-care providers, insurers and others. Despite its long history, EDI is finding new applications today, enabling supply chain automation, digital transformation and even as a crucial component of workflow and business process automation (Yunitarini et al., 2018). Currently, the terms “data sharing” and “data publishing” relate not just to the conventional one-to-one model but also to more broad models including several data providers and data consumers.
Globally, the information technology revolution, as well as effective digital data exchange by governments, businesses and individuals, has opened up tremendous opportunities for knowledge-based decision-making and established a domain that supports large-scale data analysis (Mendes and Vilela, 2017). Furthermore, motivated by aggregated benefits, or by applications and services that require certain data to be available, there is a necessity for exchanging/sharing data among multiple parties. In California, for example, licenced hospitals are obliged to disclose demographic data on every patient released from their institution (Carlisle et al., 2007).
In June 2004, the President’s Information Technology Advisory Committee (PITAC) in the USA issued a draft report titled Revolutionising Health Care Through Information Technology (Thompson and Brailer, 2004). One of its main goals was to create a statewide system of electronic medical records that promotes the sharing of medical information through computer-assisted clinical decision support. The USA is not alone in this regard. Many governments throughout the world are concerned about the secure transmission of data through a connected health-care system. According to the Medical Journal of Australia (Makeham and Ryan, 2022), Australia is widely regarded as a world leader in the provision of privately managed health records, which are technologies that enable individuals to access, manage and share their health information in a private, secure and confidential setting. In February 2019, a My Health Record (MHR) was produced for all Australians unless they decided not to have one. In accordance with predictions, this opt-out participation strategy resulted in a vast proportion of Australians now having access to MHR, with 90.1% of Australians engaging in the system at the time of record creation (242). While a national electronic summary record has been accessible since July 2012 (Makeham and Ryan, 2022), the transition to an opt-out system with the continuous option to permanently remove their record at any time gives Australians a vital choice about how they choose to engage with their personal health information.
Data publishing is common across various sectors. For example, Netflix recently shared a data set of 500,000 subscriber ratings to improve movie recommendations. While data technology has enhanced our lives, raw data often contains sensitive information that could compromise privacy if published. There is growing concern about how organisations may misuse this data, as real-world examples show how challenging it is to balance data sharing with privacy protection.
One incident was reported in August 2006 (Barbaro et al., 2006), when American On-Line (AOL) disclosed a data set of 20 million Web searches on their website. To preserve consumer privacy, the disclosed data were meant to be anonymised and randomised. Despite this precaution, New York Times journalists have proved that the identities of the individuals may be re-identified and a significant quantity of information about them can be acquired with relatively little public information. Given the growing privacy concerns, data publishing has faced a new challenge (Mendes and Vilela, 2017). Having indicated how powerful its consumers are at extracting knowledge concealed inside massive data sets, it is critical to develop ways that can control the power of those consumers to protect data participants’ privacy (Mendes and Vilela, 2017).
As a result, PP practices have become critical for privacy-aware data publishing. Some of these practices are significantly based on regulations and policies that restrict the types of publishable data, as well as protocols for the usage and capacity of sensitive information. The limitations of these practices are that they either alter data excessively or entail an illogically high degree of confidence in a variety of data publishing circumstances.
Furthermore, regulations and policies cannot restrict attackers who do not obey the rules, and standards for sensitive information use and capacity cannot guarantee that sensitive information will not be irresponsibly shared and end up in the hands of malicious parties. So, developing techniques and approaches for publishing data in a hostile environment such that the released data remains beneficial while individual privacy is protected, becomes a big concern. This effort is known as PP, and it may be seen as a specific response to augment privacy protection rules. Although research in this field is still in its early stages, now is a decent time to discuss the hypotheses and appealing properties of PP, illuminate the differences and necessities that distinguish PP from other related issues and deliberately abbreviate and assess various approaches to dealing with PP. This literature review is primarily intended to accomplish these objectives.
2.4.1 Privacy preservation.
Accessing high-quality data is an imperative necessity in knowledge-based decision-making, nonetheless, data in its original form regularly restrains such necessity as it may contain highly sensitive information about individuals. As a solution to this issue, the strategies and tools of PP that empower the publishing of valuable data while preserving data privacy have been proposed. Such strategies and tools of privacy in data publishing are known as PP. As per Garfinkel et al. (2015), PP can be defined as “methods and tools for publishing data in a more hostile environment, so that the published data remain practically useful while individual privacy is preserved”.
Recently, PP has evolved as an extremely dynamic study field. This area of study considers how information or patterns may be extracted from a large data warehouse while adhering to authority or regulatory privacy restrictions. These restrictions are frequently related to the individuals who participate in the data set. Individuals are concerned about the massive volumes of data collected about them and how this data will be used, whereas data consumers aim to infer new bits of knowledge (insight) that will allow them to improve decision-making and acquire a better understanding of the market’s demands. PP intends to settle these conflicting issues.
2.4.2 Privacy preservation metrics.
There is no wide-ranging consensus on the standards that privacy metrics must meet (Mendes and Vilela, 2017; Langheinrich, 2018). From a mathematical perspective, a metric is a function that outlines a distance between each pair of elements of a set and requests to satisfy four conditions which are non-negativity, symmetry, the identity of indiscernible and triangle inequality, to be qualified as a metric (Herrmann, 2007). However, several of the metrics discussed in this study are not considered metrics from the mathematical standpoint because they do not justify all four conditions. However, to stay consistent with the literature (i.e. Wagner and Eckhoff, 2018; Rebollo-Monedero et al., 2013), we will treat all measurements that describe the degree of privacy in some way as privacy metrics.
Many authors have presented privacy metrics standards and guidelines. For example, Wagner and Eckhoff (2018) necessitated that privacy metrics be intelligible by scientifically oriented public, that they are orthogonal to utility metrics, and that they set limits on how well the attacker may identify individuals. According to Rebollo-Monedero et al. (2013), privacy metrics should be based on probabilities and have well-defined and comprehensible endpoints. They claim that a privacy metric should measure privacy based on the percentage of individuals who cannot be identified by an attacker and how uniformly distributed the attacker’s assumptions are.
Raghavan and Raghavan (2015), on the other hand, required that privacy metrics reflect how difficult it is for an attacker to succeed, that they do not rely on variables that cannot be determined or predicted, and that they reflect the resources required for successful privacy attacks rather than relying on probabilities. According to Zhao and Wagner (2018), privacy metrics must include three components of the attacker’s success: accuracy, uncertainty and correctness.
In practice, quantifying privacy is challenging because there is no universal standard definition of privacy (Mendes and Vilela, 2017). Despite this, certain metrics in the context of PP have been presented, which may be classified into three major categories based on the aspect of privacy they measure. The first of these three categories is privacy level metrics, which by definition measure how protected the information is from disclosure, whereas the second is data utility metrics, which evaluate the diminishing of quality, and finally, complexity metrics, which is the final metric, measure the efficiency and scalability of the various strategies.
2.4.2.1 Data privacy level.
According to Xu et al. (2014), data privacy is mostly defined as the level of difficulty attacker’s encounter in reconstructing/predicting the original/sensitive information from the preserved data; it is also the quality or state of being secluded from the presence or view by others. When merging all publicly available data, privacy level metrics are used to judge the level of isolation among the original data. As recently stated by Tan et al. (2021), the primary goal of PP strategies is to provide a certain level of privacy while enhancing utility. The level of privacy indicates how secure the data is from potential attacks.
2.4.2.2 Data utility level.
Data utility, and specifically the evaluation of data quality, are concepts that have received considerable attention in both research and practice. In theory, using PP techniques frequently causes data quality degradation. Data utility metrics (Loss Metric) seek to assess this loss of utility by comparing the aftereffects of a function on the original data and the privacy-preserved data. When evaluating data utility, three significant parameters are commonly estimated (Mendes and Vilela, 2017): accuracy, which evaluates how close the transformed data is to the original data, completeness, which evaluates the loss of individual data in the preserved data set, and consistency, which quantifies the loss of correlation in the preserved data.
2.4.2.3 Complexity.
When evaluating the complexity of PP techniques, the efficiency and scalability of these are often considered (Langheinrich, 2018). These factors are relevant across all techniques, so only a brief overview is provided here. Efficiency can be assessed using metrics for resource usage, such as time and space. CPU time or computational cost can estimate the time, whereas space metrics gauge the memory required to run the technique. In distributed settings, it may also be useful to measure the communication cost, which can be determined by time, the number of exchanged messages or data transfer capacity consumption. Generally, both time and space are approximated as functions of the input.
Scalability relates to how effectively a technique performs as data volumes grow. Because data are always expanding, this is a vitally crucial part of any PP approach. Increasing the number of data sources in distributed environments can significantly increase the volume of communication. As a result, PP techniques must be built to be scalable. The technique’s scalability may be empirically tested by subjecting it to varied data loads. To determine whether a PP technique is scalable, various experiments with growing data volume, for example, can be conducted to evaluate the loss of efficiency. The efficiency loss throughout tests may therefore be used to quantify scalability because a more scalable system would exhibit lower efficiency losses when subjected to the same load as a less scalable system.
3. Internet of Things–integrated social media privacy preservation techniques
A wide variety of techniques of PP for guaranteeing data privacy, dating from 1948 when privacy was recognised as a right in the Universal Declaration of Human Rights, have been proposed (Mendes and Vilela, 2017). There are no definitive means of sorting these techniques as, in the literature, the same technique may be categorised into different groups. In this survey, a classification based on the preservation mechanisms that the PP techniques are exhibited will be presented, in particular reconstruction techniques, heuristic techniques and cryptographic and learning techniques, as depicted in Figure 5.
The following subsections will outline each of these techniques with their merits and demerits based on how they safeguard data.
3.1 Reconstruction-based techniques
To preserve privacy during data collection, the raw data is modified by perturbing/randomising the acquired values before transmitting it to the data collector, who then reconstructs the distributions at an aggregate level for any potential ML-based data analysis and inferring procedures (Zhang et al., 2019). The assumption is that the data collector is untrustworthy (Mendes and Vilela, 2017). As a result, and to avoid data leakage, the original data is never saved and is only used in the transformation processes. Hence, for each captured data, reconstruction-based techniques (i.e. perturbation-based approaches) must be done individually.
To preserve privacy during data collection, raw data is perturbed or randomised before transmission to the data collector, who then reconstructs aggregate distributions for ML-based analysis and inference (Zhang et al., 2019). This approach assumes the data collector is untrustworthy (Mendes and Vilela, 2017), and thus the original data is never saved but only used in transformation processes. Consequently, each data point requires individual perturbation.
Reconstruction-based techniques set aside the original data and begin by perturbing what is termed the “knowledge base”. The idea is that, despite perturbation, a data collector can reconstruct the original distribution and develop an aggregated model from these reconstructed distributions (Andruszkiewicz, 2016). This allows for ML-based analysis without direct access to the original data.
Perturbation techniques typically involve adding noise with a known statistical distribution to the original data, allowing for the reconstruction of the data distribution, though not the individual data points. For example, if X represents the original data, Y denotes the noise distribution and Z represents the perturbed data, then:
The data collector can recover the original data using:
However, accurate reconstruction depends on the variance of Y and the number of samples n. High variance or insufficient samples can hinder the accurate reconstruction of X.
In addition to additive randomisation, multiplicative randomisation involves multiplying the original data by random values or applying random projections or rotations. For instance, the original attribute value is multiplied by a random value derived from a specific distribution (Andruszkiewicz, 2016). Rotation perturbation, which multiplies attribute vectors by a rotation matrix, can also be used but may be vulnerable if an adversary approximates the rotation matrix.
Data swapping involves exchanging sensitive attributes between different records to prevent linking records to identities. Synthetic data generation involves creating a statistical model based on the original data and generating synthetic values from this model (Abay et al., 2018).
Despite their utility, reconstruction-based techniques face limitations. Privacy concerns arise if high variance in the noise distribution or insufficient sample size impair the accuracy of reconstructing the original data (Chamikara et al., 2018; Zhang et al., 2019). In addition, perturbation may lead to a loss of data utility. While clustering and classification algorithms can perform adequately with perturbed data, other algorithms might suffer from diminished utility (Ray, 2019). Moreover, techniques like efficient data stream perturbation can become time-consuming as the number of sensor features grows (Dou et al., 2019). Addressing these limitations is crucial for enhancing the effectiveness of these techniques in IoT-integrated SM networks.
3.2 Heuristic-based techniques
In PP, data providers may seek to release anonymised data publicly. This allows for third-party data analysis without revealing the identity of the data participants (or records’ owner). In such circumstances, PP can be accomplished by anonymising (Murthy et al., 2019) the records before their release by employing a heuristic-based technique (Andruszkiewicz, 2016). A heuristic-based technique is one designed to solve a problem quickly when conventional techniques are too slow, or to find a near-optimum solution when classic techniques fail to find an exact solution (PatelShreya, 2018).
The aim of a heuristic-based technique is to deliver a solution in an extremely short time period that provides a satisfactory solution for the problem at hand. This solution may not be optimal for all potential solutions to this problem, but it is roughly comparable to the exact solution. However, it is still beneficial as locating a satisfactory (if not optimal) solution does not take an inordinate amount of time. Heuristic-based PP techniques process the records in a group-based manner (PatelShreya, 2018). They safeguard the data by anonymising it, making it very difficult for an adversary to determine who owns which data. Thus, the data is altered by various anonymisation techniques, while retaining sufficient utility to be safely released to third parties (Murthy et al., 2019).
Practically, to protect data anonymity and privacy, the naive and intuitive anonymisation approach is to eliminate attributes that explicitly identify participants (i.e. name and national social number). However, such a straightforward approach to data privacy fails to resist the potential of an anonymity attack (Murthy et al., 2019) with external data or background knowledge, as the record’s owner may still be identified by Quasi-Identifiers (QIDs) and sensitive attributes. A QID is a non-sensitive attribute (i.e. age, gender and Zip code) that does not explicitly identify the records’ owner, but can be combined with data from other public sources to de-anonymise the owner of a record, what is known as linkage attacks (Murthy et al., 2019). As per Aggarwal (2015), almost all of the anonymisation techniques (Murthy et al., 2019) involve the use of QIDs ignoring sensitive attributes because it is wrongly assumed that sensitive attributes do not form any threat. The author claimed that it is reasonable to assume that an attacker knows background information about its target, and so concludes that techniques that do consider sensitive attributes provide a higher privacy guarantee.
The anonymisation of data in a data set may be attained by realising heuristic-based PP models. The heuristic-based PP models face the difficulty of preserving the identity of the records’ owner by employing one or more of the preceding data sanitising mechanisms: generalisation, perturbation, suppression and anatomisation (Murthy et al., 2019).
Generalisation: is the substitution of a value for a more generic one. It allows numerical data to be represented using intervals, for example, an age of 23 may be stated as an interval in the form [20–30]; however, categorical data requires the creation of a hierarchy. A relevant example of a hierarchy is using the student as a parent value to represent all categories of students in the same profession property.
Perturbation: is the substitution of original data for synthetic values with equal statistical information. Data perturbation is exemplified by the randomisation approaches given in subsection 3.1.
Suppression: is the elimination of some attribute values to avoid the disclosure of information. This mechanism may also be executed in a data set column-wise, removing all values of an attribute or row-wise, removing a record.
Anatomisation: is the separation of QIDs and sensitive attributes into two different tables, making it much harder to relate QIDs to sensitive attributes. In this mechanism, the original values stay unchanged.
In view of these mechanisms, a set of heuristic-based PP models has been introduced as follows.
3.2.1 k-Anonymity model.
k-Anonymity, defined as “a notion of privacy perceived in the context of relational data”, has received a lot of attention during the past decade (Aggarwal, 2015). In essence, k-anonymity heuristic-based PP models provide a “blend into the crowd” approach to privacy. The key concept in this model is k-anonymity: if each data record’s identifiable attributes (QIDs) are indistinguishable from at least k − 1 other entries, the data set is said to be k-anonymous. In other words, an attacker using a k-anonymised data set would be unable to determine the identity of a single record’s owner because additional k − 1 comparable records exist. The set of k records is referred to as the equivalence class (Mendes and Vilela, 2017). Furthermore, the value k may be used as a privacy metric: the higher the value of k, the more difficult it is to de-anonymise records. Theoretically, the chance of de-anonymising a record in an equivalence class is . However, increasing k reduces the utility of the data because higher generality is required.
The k-anonymity model has been extensively studied in the context of SM as well as IoT data publishing (Arava and Lingamgunta, 2020; Mahanan et al., 2020; Khan et al., 2020), where data providers strive to ensure that no attacker would most likely link data to a specific data participant. As a result, several algorithms have been developed in this context, with the vast majority based on generalisation and suppression mechanisms (Murthy et al., 2019; Aggarwal, 2015; Arava and Lingamgunta, 2020; Mahanan et al., 2020; Su et al., 2019; Khan et al., 2020). This privacy model was one of the first models to be used for anonymisation, and it served as the base for more complicated models (Mendes and Vilela, 2017). The simplicity of definition and the large number of current algorithms are two of the advantages of the k-anonymity privacy model (Aggarwal, 2015). Nonetheless, there are two fundamental issues with this privacy model particularly when background knowledge is accessible to the attacker.
The assumptions underlying the k-anonymity model have been recently challenged, with high-profile data breaches like those of Netflix and AOL highlighting the model’s limitations in preserving anonymity (Su et al., 2019). A primary drawback of k-anonymity is its vulnerability to attribute disclosure, as it does not prioritise protecting sensitive attributes when creating a k-anonymised data set. This can result in equivalence classes where all members share identical sensitive attribute values, inadvertently exposing information to all individuals in that group. To address this weakness, researchers, including Khan et al. (2020), have proposed improvements to the model. Furthermore, Machanavajjhala et al. (2007) identified two significant attacks that exploit k-anonymity: the homogeneity attack and the background knowledge attack. In response, they introduced the l-diversity model, a new privacy approach designed to mitigate these specific vulnerabilities.
Homogeneity attack: Because all of the sensitive attribute values within a group of k degrees are identical, even though the data is k-anonymised, the estimation of the sensitive attribute for that group of k may be exactly predicted.
Background knowledge attack: In this attack, the attacker can use the association between one or more records and the sensitive attribute to further limit the sensitive attribute’s possible estimations.
3.2.2 l-Diversity model.
While k-anonymity is an interesting choice for preserving a record’s identification, it may not be practical in most circumstances for limiting the disclosure of sensitive information. Along these lines, the l-diversity model (Machanavajjhala et al., 2007) was introduced, which focuses not only on preserving the minimal size of the equivalence class of k but also on maintaining the diversity of the sensitive attributes. According to Machanavajjhala et al. (2007), an equivalence class is considered to have l-diversity if there are at least l well-represented values for the sensitive attribute. The term “well-represented” is not concrete and can be interpreted in a variety of ways in this model.
Moreover, there are different interpretations of the l-principle, each with its own definition (Parameshwarappa et al., 2021; Siddula et al., 2019). One of the most basic interpretations considers sensitive attributes to be “well-represented” if there are at least l-distinct values in an equivalence class, a concept known as distinct l-diversity (Parameshwarappa et al., 2021). A more powerful definition of l-diversity is the definition of entropy l-diverse, which is as follows. If the entropy of an equivalence class’s sensitive attribute value distribution is at least log(l), it is said to have entropy l-diversity (Siddula et al., 2019).
Similar to the k-anonymity model, l operates as a measure of privacy in both entropy and different l-diversity interpretations (Siddula et al., 2019). Increasing this value increases the range of sensitive attribute values in each equivalence class, lowering the probability of sensitive attribute disclosure. However, larger generalisations and a greater number of suppression procedures are required on the original data, resulting in a lower level of utility (Nicolazzo et al., 2020). Although the l-diversity model takes into account the diversity of sensitive attribute values within equivalence classes, it does not take into account their distribution. This may result in data breaches if the sensitive values are distributed in a skewed manner, which is often the case. As a result, numerous studies have integrated this privacy model with others to provide strict privacy assurances. Yin et al. (2017), for example, suggest combining k-anonymity with l-diversity to prevent skewed sensitive attribute distribution, hence preventing attribute disclosure attacks.
3.2.3 t-Closeness model.
Li et al. (2007) proposed the t-closeness heuristic-based PP model as a further enhancement to the concept of l-diversity, with the explicit end goal of addressing the drawback of maintaining all values of a given attribute in the same manner regardless of their distribution in the original data. As is commonly the case with real data, attribute values might be significantly skewed, making it difficult to offer appropriate l-diverse representations. Also, the attacker may use prior knowledge of global distribution to extract sensitive information from the data. Furthermore, not all attribute values are comparable from a sensitivity perspective (Nicolazzo et al., 2020). The t-closeness model considers the sensitive value distribution in each equivalence class to be “close” to the corresponding distribution in the original data, where “close” is upper bounded by the threshold t. That is, the distance between a sensitive attribute’s distribution in the original data and the same attribute’s distribution in any equivalence class is less than or equal to t.
The t-closeness concept has several interpretations depending on the distance function used to quantify the closeness (Li et al., 2007). The variation distance, the Kullback–Leibler (KL) distance and the Earth Mover’s distance (EMD) are the three most commonly used functions (Mendes and Vilela, 2017). The t-closeness model tends to be more powerful than numerous different PP techniques for the instance of numeric attributes. Thus, still, in the context of data preservation in the IoT, many researchers propose techniques leveraging t-closeness. Nicolazzo et al. (2020), for example, offered a PP approach to mitigate feature disclosure in several IoT scenarios (i.e. a scenario in which items might be grouped in networks that communicate with one other). They used k-anonymity and t-closeness to create a unified overview of IoT data and its features. Indeed, their usage of k-anonymity and t-closeness rendered the resulting groups more secure in terms of privacy. Not only was information disclosure prevented, but so was feature disclosure.
3.2.4 Personalised privacy.
The three previously stated heuristic-based PP models protect privacy by implying a universal and impartial level of privacy for all data and attributes. However, not all data providers or individuals are equally concerned with their privacy. For instance, when compared to individuals, data providers may have entirely different limits on the privacy of their data. This raises the typical challenge of requesting that the records in each data set be treated quite differently. From a technical perspective, this implies that the value of k for anonymisation is not fixed but may vary depending on who owns the data. Hence, the concept of personalised privacy was introduced by Xiao and Tao (Xiao and Tao, 2006), in which an individual can specify the level of privacy for her sensitive information.
The goal of this technique is to maintain the highest level of utility in terms of personal privacy preferences. This technique is performed by creating a taxonomy tree using generalisation and allowing data owners to define a guarding node. The personalised privacy model presumes that an individual may select a guarding node in the domain generalisation hierarchy to define the level of privacy that he can work with (Wu et al., 2020). However, such a technique may be difficult to implement in practice because moving towards the record owner is not always a suitable or practical option; and, because record owners do not have access to the distribution of sensitive attributes, the individual may over-secure the data by selecting more broad nodes (Mendes and Vilela, 2017). In addition, an individual’s privacy may be compromised in such a model if an attacker is allowed to infer any sensitive information from the taxonomic tree of the guarding node with a breach probability greater than a particular threshold. This model has the advantage of allowing for the direct protection of individuals’ sensitive values, as opposed to other privacy models, which are vulnerable to many types of attacks (Bhattacharjee et al., 2021).
3.2.5 Differential privacy.
For distributed ML-based applications, DP constitutes robust criteria for guaranteeing privacy (Jia et al., 2021). The core of DP, that differentiates it from other privacy models, is the notion of curating the privacy breach level called privacy loss, which is indicated by ϵ (Kang et al., 2020). This notion of privacy loss enables us to gauge the threat while publicly disclosing participants’ data. The higher the value of ϵ, the less sensitive the algorithm is towards the preservation of a participant’s data privacy (Wei et al., 2021).
Definition 1: [(ϵ,δ)-DP (Dwork et al., 2016)]. A randomised function F satisfies (ϵ,δ)-DP if, for any two possible neighbouring data sets Di, differing on at most one element, and all and any possible query result S:
We specify F: Di to refer to the sanitised execution of a query F. We refer to the public, sanitised results as S. If S are the sanitised query results that are published to the public, then S is the only indication an adversary has about the Di. Here ϵ is a small, positive number that curates the trade-off between utility and privacy levels. Setting a lower ϵ will provide more privacy at the cost of more noise. The original definition of ϵ-DP does not include the additive term δ. We exhibit the Dwork et al.’s (Dwork et al., 2016) version, which allows for the possibility that ϵ-DP can be broken with probability δ. The difference in results between Di and its neighbour is the difference that the perturbation noise must conceal for the published results to not indicate whether Di or is the original data set. The sensitivity of query F is the upper bound of this difference.
Definition 2: Given a function F, the global sensitivity is defined as:
over all pairs of neighbouring data sets Di, . The sensitivity of one function is the biggest potential difference, that any given data set may cause if an arbitrary data vector is added or removed from the dataset (Wei et al., 2021). Consequently, implementing a DP query F requires the injection of noise, that is aligned to conceal of the dataset.
Theorem 1: If F: Di SG is a G-arry function with sensitivity then the function is (ϵ,δ)-DP, where is a G-tuple of values sampled from a Gaussian distributed random noise with mean zero and standard deviation .
The standard deviation of the Gaussian noise (Dwork et al., 2016) is given by the expression:
Therefore, if the function is highly sensitive, or if ϵ is small, the noise will be significant. Besides the use of Gaussian noise, there are several other mechanisms that can be used to keep DP at a specific level. These are discussed as follows.
Laplace mechanism (ϵ-DP) (Dwork et al., 2016) – introduces additive noise from a Laplace distribution, resulting in an inherent trade-off between privacy and utility. The Laplace mechanism computes function F and applies noise derived from the Laplace distribution to each coordinate. The noise scale will be calibrated to [)/ϵ], where δ is always equal to 0. The Laplace mechanism maintains (ϵ,0)-DP or ϵ-DP. This is consistent with the notion that the more sensitive the query and the greater the intended privacy guarantee, the more noise is required to get that privacy guarantee (Fan et al., 2020).
Gaussian mechanism – is an alternative to the Laplace mechanism that incorporates Gaussian noise rather than Laplace noise (Salim et al., 2024a). The Gaussian technique does not provide pure ϵ,-DP, but it does provide (ϵ,δ)-DP privacy.
Exponential mechanism – is a more generic mechanism that preserves (ϵ,δ)-DP with a scoring function chosen at random (Liu, 2018). It is frequently adapted based on corresponding queries because it achieves the ϵ-DP by presenting a query in a Probability Density Function (PDF). Although it is not confined to numerical queries, it is difficult to find PDFs in a compact format for a multivariate distribution.
Hybrid mechanism – While the preceding mechanisms are relatively simple because they only need to compute one value in basic queries, more sophisticated ones must be deconstructed into numerous simpler ones for easier computation (Wang et al., 2019b).
DP operates by introducing statistical noise generated by one of those noise mechanisms into the data either to their inputs or to the output. Based on where the noise is added, DP is categorised into two categories; Local DP (Zhao et al., 2020b; Liu et al., 2020) and Global DP (Wei et al., 2020).
3.2.5.1 Local differential privacy.
Local Differential Privacy (LDP) is an enhancement to the standard DP concept. Standard DP necessitates the use of trustworthy data collectors, whereas LDP does not (Zhao et al., 2020b). In LDP, the perturbation is performed by the data participant. To preserve participants’ privacy, each participant executes a random Perturbation mechanism before sending the perturbed outcomes to the data collector/aggregator. Thus, LDP combines the features of standard DP protection while also expanding the randomised perturbation technique to prevent privacy attacks provided by untrusted third-party data collectors (Liu et al., 2020). Moreover, LDP provides a better privacy guarantee; data owners disrupt their sensitive information, by adding noise to the individual data points, to meet DP locally before disclosing it to an untrusted data aggregator or data consumer (Zhao et al., 2020b). In Definition 3, a formal definition of LDP is given.
Definition 3: ((ϵ,δ)-LDP) (Lyu et al., 2020a). A randomised algorithm F fulfils the (ϵ,δ)-LDP if and only if, for any input data point Xi and , we have
Furthermore, if the condition holds for δ = 0, F is said to maintain pure ϵ-LDP. Also, all noise mechanisms used for DP, such as the Laplace and Gaussian methods, may be applied independently by each participant to assure LDP in isolation (Lyu et al., 2020a; Zhao et al., 2020b). However, in a distributed environment, each participant must supply enough calibrated noise to assure LDP without the need for cryptographic technologies (Salim et al., 2022a). LDP’s appealing privacy properties consequently come with a significant utility loss, especially when significant numbers of individuals are involved.
Along this line, several previous works (Zhao et al., 2020b; Liu et al., 2020; Fan et al., 2020; Liu et al., 2022) have applied LDP in SM and IoT contexts. For example, Zhao et al. (2020b) developed four LDP mechanisms. These mechanisms efficiently maintain an individual’s privacy when acquiring data records and exhibiting reliable statistics in different data analysis operations. Furthermore, they linked the suggested LDP techniques with FL to allow IoT data-based applications to train an ML model while mitigating privacy threats and lowering communication costs. Although their mechanism outperforms existing ones, they did not test it with more sophisticated data analysis tasks.
Other studies that considered applying LDP to SM data have been presented. For instance, Liu et al. (2020) developed an LDP model for social network publishing that retains community structure information. Their model creates synthetic social network data as published versions while adhering to the structural limitations of edge probability reconstruction. However, while their model efficiently retains network structural features while assuring a high level of privacy, they only evaluated the preservation of node information in a static network not in dynamic social networks. Furthermore, by incorporating hybrid DP into FL, Liu et al. (2022) developed a reliable and secure FL method. They categorised users into two groups based on their varied privacy demands in an attempt to focus on the setting where the server is untrusted and has better privacy. Despite the fact that this study revealed an extended FL with DP that has a theoretical convergence guarantee, there is still work to be done in terms of adaptive privacy budget allocation and how to optimise utility and privacy.
3.2.5.2 Global privacy distinction.
Global Privacy Distinction (GDP) was originally designed for centralised environments where a trusted central data server, with direct access to all participants’ data, aims to respond to queries or publish statistics while preserving privacy by randomising the outcomes (Lyu et al., 2020a; Paul and Mishra, 2020). In the GDP model, the data collector or central aggregator introduces noise into the data’s outputs only once, at the end of the process, before releasing it to a third party (Wei et al., 2020). This approach helps protect user privacy against potential attacks from the data consumer. For standard DP, responses to queries only need to be consistent with changes in the data of a single user (or a few users), which may make up a small portion of the overall data set, simplifying the construction of DP guarantees. As a result, the GDP model offers significantly better privacy/utility trade-offs compared to the LDP model.
However, GDP is designed to deal with multiple data participants for training to converge and reach an appropriate trade-off between privacy and utility, likely to result in a convergence difficulty with a limited number of participants (Lyu et al., 2020a). Furthermore, GDP can only attain satisfactory performance with a large number of participants, making it unsuited to applications with a limited number of data participants. Meanwhile, in many applications, the assumption of a trustworthy server in GDP is impractical since it brings a single point of failure for data breaches (Lyu et al., 2020b). This unsuitability is because it imposes legal and ethical requirements on the trusted aggregator to keep users’ data protected. When the aggregator is untrustworthy, as is frequently the case in distributed contexts, LDP is anticipated to safeguard individuals’ data (Liu et al., 2022; Lyu et al., 2020b).
Until recently, there was no middle ground between the two possibilities of LDP and GDP. The choice was either to accept a substantially higher level of noise which is achieved by the LDP, or to gain raw data, which GDP provides. However, this dichotomy is beginning to change, as a result of recent work on a novel architecture. This new architecture design is termed ’Distributed DP’(Wang et al., 2019a; Lim and Kim, 2020), which is designed to bridge the gap between LDP and CDP while preserving each individual’s privacy by integrating with cryptographic protocols (Jia et al., 2021; Mustafa et al., 2018; Wang et al., 2019a; Lim and Kim, 2020).
Table 2 compares various aspects of LDP and GDP based on their definitions, privacy guarantees, utility, performance, trust assumptions and recent advancements. In addition, Table 3 provides a concise comparison of the two types of PP techniques: reconstruction and heuristic-based. It highlights their key properties and differences.
Comparison of LDP and global privacy GDP
| Aspect | LDP | GDP |
|---|---|---|
| Data collection | Each participant perturbs their data before sending it to the aggregator | Data collector introduces noise into the aggregated data before releasing it to a third party |
| Privacy guarantee | Provides better privacy guarantee by perturbing individual data points before aggregation | Privacy protection relies on noise introduced by the data collector, ensuring privacy from third-party attacks |
| Utility | Significant utility loss due to perturbation of individual data points, especially with a large number of participants | Improved privacy/utility trade-offs compared to LDP, particularly in scenarios with a large number of participants |
| Performance | May face convergence difficulties with a limited number of participants | Requires a large number of participants for satisfactory performance, making it unsuitable for applications with a limited number of participants |
| Trust assumption | Does not require a trusted aggregator; suitable for scenarios with untrusted data collectors | Assumes a trusted aggregator, which may not be practical in distributed contexts and introduces a single point of failure |
| Aspect | LDP | GDP |
|---|---|---|
| Data collection | Each participant perturbs their data before sending it to the aggregator | Data collector introduces noise into the aggregated data before releasing it to a third party |
| Privacy guarantee | Provides better privacy guarantee by perturbing individual data points before aggregation | Privacy protection relies on noise introduced by the data collector, ensuring privacy from third-party attacks |
| Utility | Significant utility loss due to perturbation of individual data points, especially with a large number of participants | Improved privacy/utility trade-offs compared to LDP, particularly in scenarios with a large number of participants |
| Performance | May face convergence difficulties with a limited number of participants | Requires a large number of participants for satisfactory performance, making it unsuitable for applications with a limited number of participants |
| Trust assumption | Does not require a trusted aggregator; suitable for scenarios with untrusted data collectors | Assumes a trusted aggregator, which may not be practical in distributed contexts and introduces a single point of failure |
Comparison of reconstruction and heuristic-based PP techniques
| Property | Reconstruction-based techniques | Heuristic-based techniques |
|---|---|---|
| Definition | Modify raw data to allow reconstruction by data collector | Alter records before public release for anonymity |
| Examples | Randomisation, data swapping, synthetic data generation | k-anonymity, l-diversity, t-closeness |
| Data utility | Preserves data utility, suitable for ML-based analysis | May result in loss of data utility due to anonymisation |
| ML algorithm requirement | May require specific ML algorithms but preserves data utility effectively | May require adjustments for certain ML algorithms due to anonymisation |
| Suitability | Protects individual data points, useful when data collector is untrustworthy | Effective for public data sharing while protecting individual identities |
| Complexity | Varies based on perturbation type and dataset size | Depends on specific heuristic used and level of anonymisation |
| Computational overhead | May have higher overhead due to individual perturbation | Generally lower overhead compared to reconstruction-based techniques |
| Property | Reconstruction-based techniques | Heuristic-based techniques |
|---|---|---|
| Definition | Modify raw data to allow reconstruction by data collector | Alter records before public release for anonymity |
| Examples | Randomisation, data swapping, synthetic data generation | k-anonymity, l-diversity, t-closeness |
| Data utility | Preserves data utility, suitable for ML-based analysis | May result in loss of data utility due to anonymisation |
| ML algorithm requirement | May require specific ML algorithms but preserves data utility effectively | May require adjustments for certain ML algorithms due to anonymisation |
| Suitability | Protects individual data points, useful when data collector is untrustworthy | Effective for public data sharing while protecting individual identities |
| Complexity | Varies based on perturbation type and dataset size | Depends on specific heuristic used and level of anonymisation |
| Computational overhead | May have higher overhead due to individual perturbation | Generally lower overhead compared to reconstruction-based techniques |
3.3 Cryptographic and learning-based techniques
The convergence of various technologies, including IoT, social networking, edge and cloud computing, has significantly increased the complexity of distributed systems (Tabassum et al., 2021). This integration enables modern systems to leverage diverse data sources, enhancing functionality and service delivery. In scenarios where multiple entities seek aggregated insights without revealing individual data, Secure Multiparty Computation (SMC) provides a solution for secure data sharing and cryptographic operations (Du et al., 2018). SMC enables participants to jointly compute a function using private data, ensuring that only the final result is shared, while individual inputs remain confidential. Secure transfer protocols underpin this approach, supporting PP across distributed networks.
The design of distributed data platforms creates a slew of complications in terms of safeguarding user privacy without sacrificing data utility (Tabassum et al., 2021; Du et al., 2018; Mustafa et al., 2018). When designing robust PP techniques for a distributed data environment, two of the primary challenges that must be addressed are extremely high dimensions and the massive distribution of data sources (Tabassum et al., 2021). Due to these challenges, PP techniques must be efficient and function in distributed data environments. Cryptographic techniques, such as Oblivious Transfer Protocol (OTP) and Homomorphic Encryption, offer strong privacy guarantees by ensuring that data is not exposed during communication and computation (Esmaeilzade et al., 2021; Jia et al., 2021). Learning-based techniques, like FL, enhance privacy by allowing models to be trained locally on distributed data without aggregating sensitive information (Mendes and Vilela, 2017). These approaches are essential for maintaining privacy in IoT-integrated SM networks, where data is widely distributed and requires sophisticated protection measures.
Techniques based on a set of secure protocols, which prevent data disclosure during communication and computation between entities and a set of primitive operations, which are frequently used in many ML algorithms and thus suitable for distributed privacy, are a few examples of cryptographic-based techniques that can provide well-distributed privacy. The secure sum, the secure set union, the secure size of the intersection, the scalar product and the set intersection are examples of the set of primitive operations (Mendes and Vilela, 2017). To prevent data disclosure between entities, this type of operation may also use encryption techniques, such as the OTP.
SMC is a key component in achieving PP-based secure data transfer in distributed data environments and numerous novel SMC-based technologies are emerging in a wide range of applications. In this section, we demonstrate two well-known SMC-based technologies: FL and blockchain.
3.3.1 Federated learning.
The IoT, SM and smartphones are just a few of the current dispersed networks that generate vast amounts of data every day (Li et al., 2020). By its very nature, the majority of personal data is created at the edge by end-user devices (e.g. smartphones, tablets and IoT devices). These end-user devices are increasingly capable computing devices with internet access. Given the widespread deployment of such personal devices, as well as the privacy concerns they raise, a trend of decentralised artificial intelligence (AI) has emerged. This AI combines Edge Computing (EC) (Tabassum et al., 2021) with AI techniques to move intelligence from the cloud to the edge. Because of the increasing computational power of these devices, as well as concerns about transmitting private data, storing data locally and pushing network computation to the edge is becoming increasingly appealing. Indeed, query processing in sensor networks and computing at the edge have been a decades-long topic of research (Tabassum et al., 2021).
Furthermore, in many settings, traditional cloud-centric ML techniques are no longer suitable, due to the need to comply with tight data preservation regulations on the massive aggregation and processing of personal data (Truong et al., 2021). In this context, FL provides an alternative to the centric ML approach that allows an ML model to be trained collaboratively while retaining original data on local devices, potentially mitigating data privacy-related concerns (Mothukuri et al., 2021). It is a multidisciplinary technique that enables end-users’ devices (i.e. participating entities) to locally train a global/shared ML model on local data, encompassing various computer science topics such as ML, distributed EC and data privacy. For model aggregation and updates, only parameters from the model training phase are shared. FL is therefore a step forward in distributed learning since it is designed to cope with imbalanced and Non-Independent Identically-Distributed (non-IID) data with sizes that can cover many orders of magnitude. Such diverse data sets are scattered over a large number of devices with insecure connectivity and limited transmission bandwidth (Kairouz et al., 2021).
FL is a distributed ML paradigm for training ML models without sacrificing privacy (Salim et al., 2024b). Entities in FL collaboratively work to train a model on their locally available training data set to compute the updates of the model. Instead of sharing their raw data, entities then upload only these abstract updates back to a central server. Because these updates are much more difficult to interpret than raw data, this approach is a significant step forward in terms of privacy. It could also be less costly to upload the model updates than to send the data directly to some applications with large volumes of data. Applications based on SM and IoT are examples of this, particularly where users produce massive quantities of data through interaction with the interfaces and devices (Salim et al., 2020). This data is often deeply sensitive in nature and ought not to be revealed with a central server in its entirety. FL still enables learning a global model using all this data, without sacrificing computational resources or missing out on brainier algorithms. Because more data is available, FL can even result in a better model than traditional ML techniques (Salim et al., 2025a).
By design, three significant contributing elements have enhanced ML’s results and success in recent years (Mothukuri et al., 2021). The availability of a vast amount of data accumulated over time is the first and most important contributing element. The second key element is computational power; technology has advanced to the point where we have moved away from traditional computing devices and towards highly scalable and integrated microcircuit devices, which allow the training of models more quickly and deploy them directly on devices with lower computational costs. For example, the intelligence encapsulated in smartphones with a pre-installed AI microchip are now available, making devices considerably smarter and more intelligent and aiding humans in day-to-day work more effectively (Mothukuri et al., 2021). DL models, the third major element, provide significant intelligence to ML models (Abay et al., 2018). Models based on self-taught DL algorithms have a high rate of success (Abay et al., 2018; Bu et al., 2020; Abri et al., 2019). Despite the enormous success of ML, there are several domains that have been unable to make use it. For many of these areas, there are two significant impediments that prevent this adoption:
Privacy and confidentiality of user data, as well as the regulations that govern them.
Difficulty in developing an ML model owing to insufficient data or high training cost and computation complexity entailed in training an ML model.
However, several cloud-based organisations are offering well-trained ML models, which can deliver ML expertise and computational capacity at a cost, privacy and confidentiality concerns remain unresolved (Truong et al., 2021; Mothukuri et al., 2021). To overcome these concerns, the community has identified FL. FL overcomes these concerns by offering a well-trained ML model that does not expose training data. FL additionally addresses the issue of insufficient data by introducing a trust factor across heterogeneous areas (Mothukuri et al., 2021). FL’s PP features encourage various domains to use it exclusively, preserving user data privacy and realising the benefits of having a model trained on broader landscape data.
3.3.2 Federated learning workflow cycle.
FL is viewed as an iterative process in which the central ML model is enhanced with each iteration. Generally, FL implementations may be broken down into the three phases listed below:
Model initiation: in this stage, the central pre-trained ML model (i.e. the initial global model) and its initial parameters are initiated, and the global ML model is subsequently shared with all entities in the FL environment.
Model training: following the distribution of the first ML model and parameters to all entities, the initial ML model at the participant level (referred to as local ML models) is developed using individual training data.
Model aggregation: at the participant level, local models are trained, and updates are forwarded to the central server to aggregate and train the global ML model. The global model is updated, and the improved model is sent to individual entities in preparation for the next iteration.
To maintain the global ML model updated across all entities, FL uses an iterative learning method that repeats the training phases 2 and 3 above. The FL design and learning technique are depicted in Figure 6.
Consider, for example, a generic FL framework with a single server and N participants. Let Di stands for the local data set held by the participant Pi, where . At the server, the objective is to train a global model MG over data that resides at the N participants. As a result, we can describe the size of the training data sets as with and . Generally, local training SM and IoT data sets are comprised of input-output pairs xi, yi, where the input vector with d features is and the labelled output for the input vector xi is . In a typical learning setting, we want to learn the model parameter vector that characterises the output yi with loss function (i.e. ), if we input the training vector xi (i.e. the SM and IoT data) to be aggregated by a central server as:
where MG is the resulting global model and Mi is the updates of the local model trained on the local dataset of ith participant. Consequently, the FL aim is to learn the global model with a parameter aggregation method subject to the local datasets storage and processing by participants in a federated optimisation problem, formulated by:
where is the loss function of the local model of ith participant. The solution to this optimisation problem’s objective function (8) will be converged by the global optimal model after a predefined number of communication rounds, through local training and exchanging of these training results among the server and its participants.
3.3.3 Federated learning types.
FL is highly beneficial in the development of ML models where data is shared across domains (Jiang et al., 2020). FL’s ability to keep data secure during the training process allows a wide range of domains to benefit from the smart characteristics of ML (Truong et al., 2021). FL assists in overcoming limits caused by domain-specific restrictions on user data and accessible smart advantages of ML by cooperating and boosting data benefits with multiple domains that may have similar/dissimilar points of interest. Knowledge, or more precisely statistical information obtained from a set of user data, is used in real time for a wide range of applications in many different disciplines. So, every click a user performs in cyberspace and social networks is recorded for the purpose of generating derived statistical information; use cases based on this derived data may refer to applications in the same domain or in separate domains (Mothukuri et al., 2021). User data for developing such derived statistical information can also be gathered from domains pertaining to other sectors. Thus, FL may be classified as horizontal, vertical or transfer learning based on the distribution features of data in the FL environment, i.e. the distinguishing and colliding elements across heterogeneous data, and the data flow between the entities participating in the FL environment which differentiates these subcategories (Mothukuri et al., 2021; Jiang et al., 2020). Figure 7 illustrates the difference between those three subcategories.
3.3.3.1 Horizontal federated learning.
Horizontal FL involves data sets that share the same features but differ in the instances they represent (Mothukuri et al., 2021). In this scenario, multiple entities with similar data characteristics collaborate to train a global model while maintaining data privacy.
One prominent example of this type is Google’s implementation of FL for its keyboard application. Google’s keyboard improves its predictive capabilities by learning from typing patterns across numerous Android devices without accessing individual users’ data directly (Hard et al., 2018). Updates are aggregated and averaged to enhance the model’s performance in predicting text input, demonstrating how horizontal FL can improve user experience while preserving privacy.
In the health-care domain, researchers are developing ML models to predict cancer risks using sensitive medical data. Given the privacy regulations surrounding medical records, Horizontal FL allows the aggregation of insights from multiple institutions without sharing raw data (Xu et al., 2021). This approach ensures that critical medical data remains secure while contributing to the development of more accurate predictive models.
In both examples, the essence of horizontal FL is to aggregate information from multiple sources with shared features, enabling the development of robust models without compromising the privacy of individual data sources.
3.3.3.2 Vertical federated learning.
Vertical FL is a type of FL where data from different domains or sources is used to create a unified global learning model while preserving user privacy (Jiang et al., 2020). This approach is particularly effective when data sets have a high degree of overlap in the instances they cover but differ significantly in the features they provide. Vertical FL enables the integration of these diverse features into a cohesive model without the need for direct data sharing between entities (Mothukuri et al., 2021).
In vertical FL, participating entities do not exchange information about the specific data they hold. Instead, each entity trains its local model on its unique data and shares only the model parameters with a central server. The server then aggregates these parameters to update the global model. Once the global model is refined, each entity receives only the updated model parameters relevant to their own data, rather than the raw data from other entities (Jiang et al., 2020). This process ensures that data remains confidential while allowing the collaborative enhancement of the model’s performance.
In the retail industry, vertical FL can be applied to combine data from different retail chains that have overlapping customer bases but different product offerings. For example, one retailer might have extensive data on customer purchase histories, whereas another may have detailed information on product preferences. By using vertical FL, these retailers can develop a comprehensive customer profile and improve personalised marketing strategies without exposing individual customer data to competitors (Mothukuri et al., 2021).
Similarly, in the banking sector, vertical FL can be used to enhance fraud detection models. Banks with different types of transaction data and customer behaviours can collaborate by training local models on their respective data and sharing aggregated model updates. This allows for the creation of a robust fraud detection system that benefits from diverse data sources while keeping sensitive information secure (Jiang et al., 2020).
While vertical FL generally does not require third-party intermediaries, some implementations may involve encryption mechanisms to ensure that only common statistical information is shared (Jiang et al., 2020). However, recent studies have shown that effective vertical FL can be implemented without such intermediaries (Yang et al., 2019b). Industries such as smart retail and smart banking, where extensive and varied user data is available, stand to benefit significantly from vertical FL by leveraging shared insights to enhance model accuracy and decision-making capabilities while maintaining data privacy (Jiang et al., 2020).
3.3.3.3 Federated transfer learning.
Federated Transfer Learning (FTL) is a specialised application of the traditional transfer learning methodology within the context of FL environments. Transfer learning involves leveraging a pre-trained model, originally trained on a broad data set, to address a new but related problem with a smaller, domain-specific data set (Mothukuri et al., 2021). The core objective of transfer learning is to adapt an existing model to a new application by refining it with data specific to the target domain, thereby enhancing performance and efficiency compared to training a model from scratch (Jiang et al., 2020).
In FTL, a global model is pre-trained on a large, diverse data set and then fine-tuned on data from various local sources. This approach allows entities with limited local data to benefit from the rich features learned by the global model. By sharing only model updates rather than raw data, FTL maintains data privacy while enabling the development of effective, application-specific models (Chen et al., 2020b).
Consider a health monitoring application where a global model is pre-trained on a vast data set of general health metrics. Individual health-care providers can then use their specific patient data to fine-tune this global model. This allows each provider to create a personalised health monitoring system for their patients without needing to share sensitive patient data. For instance, a global model trained on data from various health conditions can be adapted to predict specific health risks based on the data from individual medical facilities (Mothukuri et al., 2021).
In smart home environments, FTL can be used to personalise user experiences. A global model, trained on data from a wide range of smart home devices, can be adapted to cater to the specific needs of individual users. For example, a model trained to optimise energy usage across different homes can be fine-tuned to manage energy consumption effectively based on the unique usage patterns of a particular household. This approach enables the creation of highly personalised and efficient smart home systems while preserving user privacy (Jiang et al., 2020).
FTL offers substantial advantages by combining the broad knowledge of pre-trained models with the specificity of local data (Hassanin et al., 2025). It enhances the performance of ML applications in scenarios with limited data availability, making it particularly useful in federated environments where data privacy is crucial. By leveraging pre-trained models and fine-tuning them with local data, FTL enables the development of tailored solutions that are both effective and PP (Chen et al., 2020b).
3.3.4 Federated learning and Internet of Things–integrated social media networks privacy.
In IoT-integrated SM networks, FL (Yang et al., 2019a) provides a PP framework for distributed learning models. This approach allows models to be trained across multiple participants without requiring direct access to their private data. By aggregating only model updates rather than raw data, FL aims to enhance privacy while developing global models.
Despite its PP intentions, FL is not immune to privacy risks. Private information can be inferred through inference attacks, even when data remains localised. Studies (Geiping et al., 2020; Shokri et al., 2017; Olatunji et al., 2021; Truong et al., 2021) have highlighted vulnerabilities such as model inversion attacks (Geiping et al., 2020), where malicious adversaries can deduce sensitive information by analysing the patterns encoded in the updates from local models. As the global model is built from these local updates, the patterns in the local data can be exploited to reconstruct sensitive information about individuals.
One of the primary methods to mitigate privacy risks is DP (Wei et al., 2020). DP techniques add noise to either the local model updates or the participants’ data sets (Zhao et al., 2020b; Wei et al., 2020). This approach ensures that only perturbed data is used, which helps obscure individual contributions. For example, Lo et al. (2021) proposed a DP-based FL architecture that offers formal privacy guarantees, significantly reducing the efficacy of model inversion attacks while maintaining acceptable accuracy levels.
A more recent development in PP combines DP with SMC protocols (Truex et al., 2019). This hybrid approach protects the characteristics of data uploaded by each participant. However, SMC protocols can introduce substantial computational and communication overheads (Nguyen et al., 2021).
Beyond those risks and attacks, in large-scale FL systems, participants are often randomly selected for training rounds. This randomness can undermine privacy guarantees, as malicious actors might attempt to infer sensitive data or disrupt the process by corrupting or faking data/model updates (Truong et al., 2021; Tolpegin et al., 2020; Hu et al., 2019; Wang et al., 2021; Ning et al., 2021). Even well-meaning participants might inadvertently compromise privacy by seeking to extract information from the global model (Wei et al., 2020).
Moreover, FL systems face challenges related to accountability and free-riding. In a decentralised setting, it is difficult to monitor and verify the accuracy of local models or the data they are trained on. This lack of oversight makes it challenging to detect and address issues such as data poisoning, where malicious participants deliberately degrade the performance of the global model (Truong et al., 2021; Lin et al., 2019). Moreover, because local models are evaluated in isolation, it is hard to identify which participants are contributing negatively to the global model’s performance. This lack of accountability can hinder the effective use of FL (Lo et al., 2021).
To address these privacy and accountability challenges, blockchain technology (Le Nguyen et al., 2020) has been proposed as a complementary solution to FL (Ma et al., 2020). Blockchain offers an immutable and transparent transaction record, which helps ensure adherence to protocols and facilitates the detection of malicious activities, thereby improving privacy management and overall system integrity.
3.3.5 Blockchain technology.
Blockchain operates as a distributed, shared ledger, where blocks of data are chronologically linked to form an immutable chain (Le Nguyen et al., 2020). Each block references the previous one and once added, its transaction details cannot be altered or removed, preserving the system’s integrity. Blockchain architecture comprises multiple layers: the data layer, network layer, consensus layer, contract layer, incentive layer and application layer. Deployed within a peer-to-peer network, a blockchain system relies on a consensus mechanism to validate block data, ensuring consistent data storage across all participants (Zhang et al., 2021a). This consensus process is essential for maintaining trust and accuracy in the distributed network.
The key advantage of a blockchain is its use of distributed computing technology, which aids in the resolution of load-sharing issues (Zhang et al., 2021a). Because distributed computing technology provides graceful degradation, blockchain technology is extremely trustworthy for storing sensitive information such as transaction processing, medical records, management operations or voting (Zhang et al., 2021a). Blockchain technologies include cryptography, mathematics, algorithms and economic models. It is an integrated multi-field infrastructure architecture that combines peer-to-peer networks and uses distributed consensus algorithms to overcome classic distributed database synchronisation problems (Puthal et al., 2018). For more details, the six key features that characterise a blockchain, are listed below.
Decentralised: the core property of blockchain, which implies that data may be captured, stored, and updated on various systems rather than relying on a centralised one.
Transparent: the blockchain system’s data record is transparent to each participating entity, and each of these entities can further update the data, making it transparent and trustworthy.
Autonomy: due to the consensus basis, any entity on the blockchain system may securely transmit or update data; the aim is to build confidence from a single entity to the entire network, and no one can intervene.
Trustworthiness: data transactions are constantly reviewed and authenticated, with their integrity ensured by a cryptographic method (hash) established for each one and stored in a block over the blockchain, ensuring that it cannot be changed or amended (persistence).
Immutable: any validated transaction records will be preserved in perpetuity and will not be modified unless someone who controls more than 51% of the entire network verifies and flags them as valid blocks at the same time (Zhang et al., 2021a).
Anonymity: blockchain technology addressed the peer-to-peer trust problem, allowing data transmission or even transactions to be anonymous; all we need is the entity’s blockchain address (Xinyi et al., 2018).
3.3.6 Consensus mechanisms.
In the blockchain, how to obtain agreement/consensus among untrustworthy entities is a translation of the Byzantine Generals (BG) Problem, which was raised in (Lamport et al., 2019). More particularly, in the BG, a group of generals commanding a section of the Byzantine army circled the city. While some generals want to attack/fight, others prefer to withdraw. However, if only some of the generals attack the city, the attack will fail. As a result, they must agree on whether to attack or withdraw. In a distributed environment, reaching such an agreement might be difficult. Because the blockchain network is decentralised, it is also a challenge. There is no central node in the blockchain that assures all distributed ledgers are identical. To guarantee that ledgers at multiple entities remain consistent, several protocols are required. Following that, we’ll go through a few different ways to obtain a blockchain consensus.
Proof of Work (PoW): is the most widely used consensus mechanism, notably in Bitcoin (Ma et al., 2020). PoW requires participants, known as miners, to solve complex mathematical puzzles to validate transactions and create new blocks. This process is computationally intensive and demands significant energy resources (Ma et al., 2020). The primary advantages of PoW include its high security, as the substantial computational work required makes attacks costly, and its promotion of decentralisation, as anyone with the necessary hardware can participate in mining. However, PoW also has notable disadvantages, including scalability issues as the network grows and high energy consumption, which raises environmental concerns (Le Nguyen et al., 2020).
Proof of Stake (PoS): offers an energy-efficient alternative to PoW. In a PoS system, validators are chosen based on the number of coins they hold and are willing to stake as collateral, rather than solving puzzles (Lamport et al., 2019). The advantages of PoS include its lower energy consumption compared to PoW and improved scalability due to reduced computational demands. However, PoS has its own set of disadvantages, such as the potential for centralisation if a few participants hold a large proportion of the stake and the risk of attacks like long-range attacks, due to the lower cost of entry for validators. Examples of PoS implementations include Blackcoin, which uses a combination of stake size and randomness for block generation (Vasin, 2014), and Peercoin, which incorporates stake age into its selection process (King and Nadal, 2012).
Practical byzantine fault tolerance (PBFT): is designed to handle Byzantine faults and is used in permissioned blockchains such as Hyperledger Fabric (Castro et al., 1999; Androulaki et al., 2018). PBFT operates through a series of rounds where a primary node proposes a block, and other nodes vote to reach a consensus. The advantages of PBFT include its strong security, as it can tolerate up to 33.3% of faulty nodes, and its low computational requirements compared to PoW. However, PBFT also has disadvantages, such as scalability issues, as it is not well-suited for large, open networks due to high communication overhead, and limited decentralisation because it requires a known set of nodes.
The advantages and disadvantages of these consensus mechanisms vary. A simple comparison of them is presented in Table 4.
Comparison of consensus mechanisms
| Property | PoW | PoS | PBFT |
|---|---|---|---|
| Blockchain type | Public/open | Public/open | Private/permissioned |
| Computational power | High | Medium | Low |
| Energy efficiency | Low | High | High |
| Scalability | Limited | High | Limited |
| Security | High (due to high computational cost) | Moderate (risk of centralisation) | High (can tolerate up to 33.3% faulty nodes) |
| Decentralisation | High | Moderate (potential for centralisation) | Low (requires known participants) |
| Example | Bitcoin (Ma et al., 2020) | Peercoin (King and Nadal, 2012) | Hyperledger Fabric (Castro et al., 1999) |
| Property | PoW | PoS | PBFT |
|---|---|---|---|
| Blockchain type | Public/open | Public/open | Private/permissioned |
| Computational power | High | Medium | Low |
| Energy efficiency | Low | High | High |
| Scalability | Limited | High | Limited |
| Security | High (due to high computational cost) | Moderate (risk of centralisation) | High (can tolerate up to 33.3% faulty nodes) |
| Decentralisation | High | Moderate (potential for centralisation) | Low (requires known participants) |
| Example | Bitcoin ( | Peercoin ( | Hyperledger Fabric ( |
3.3.7 Blockchain architecture and types.
Figure 8 depicts the structure of the blockchain. A blockchain’s structure may be characterised as a series of blocks that contain several transactions and each block is linked to the preceding block in a chain-like pattern. A block consists of two parts: a header and a body. The current block index and version number, timestamp, nonce and the preceding as well as the current block’s hash value information are all contained in the block header. The block body keeps track of transaction data, transaction metadata and the overall number of transactions in an ever-growing list, as described in Table 5.
Block body contents
| Property | Description |
|---|---|
| Block index | Sequential number assigned to each block in the chain |
| Block version | Specifies the set of block validation criteria/rules to be used |
| Timestamp | Time of block generation in universal time as seconds |
| Nonce | Value calculated for each block’s hash, aiding in block creation and verification |
| Prev hash | Hash value of the preceding block in the chain |
| Current hash | Hash value of the current block |
| Transactions | Data stored in each block, dependent on the blockchain application |
| Property | Description |
|---|---|
| Block index | Sequential number assigned to each block in the chain |
| Block version | Specifies the set of block validation criteria/rules to be used |
| Timestamp | Time of block generation in universal time as seconds |
| Nonce | Value calculated for each block’s hash, aiding in block creation and verification |
| Prev hash | Hash value of the preceding block in the chain |
| Current hash | Hash value of the current block |
| Transactions | Data stored in each block, dependent on the blockchain application |
Based on their architectural design and application scenario, blockchain systems can be categorised into three types: public, private and consortium blockchain. These three blockchain types are described in detail below and reviewed in Table 6.
Types of blockchain and their properties
| Property | Public | Consortium | Private |
|---|---|---|---|
| Consensus operator | Miners | Selected miners | One dominant entity |
| Read permission | Public | Public/Restricted | Public/restricted |
| Consensus process | Permissionless | Permissioned | Permissioned |
| Immutability | Impossible | Possible | Possible |
| Efficiency | Low | High | High |
| Centralised | No | Partial | Yes |
| Property | Public | Consortium | Private |
|---|---|---|---|
| Consensus operator | Miners | Selected miners | One dominant entity |
| Read permission | Public | Public/Restricted | Public/restricted |
| Consensus process | Permissionless | Permissioned | Permissioned |
| Immutability | Impossible | Possible | Possible |
| Efficiency | Low | High | High |
| Centralised | No | Partial | Yes |
Public Blockchain: The public blockchain does not require any permissions. As anybody may join the network, a public blockchain is also known as a permissionless blockchain. It is an open-ended blockchain, which means the database is available to the world. Bitcoin is an example (Vujičić et al., 2018). A single entity cannot own or control a public blockchain. Anyone may read and write to a public blockchain (which means anyone can participate). More specifically, on such blockchain, all entities participate in reading, writing and verifying data. At any time, any entity can join or leave the blockchain network. Bitcoin is the most well-known public blockchain.
Private Blockchain: The term “private blockchain” refers to a blockchain that has been given the authorisation to use. A permissioned blockchain, also known as a private blockchain, necessitates the authentication of the network’s participants, who are normally known to one another (Lamport et al., 2019). It is a type of closed-ended blockchain. A single entity or private organisation with authority over the network is the operator of the private blockchain. Any data may be written to a private blockchain by a limited number of entities. All or some participants have the right to read. Entities cannot join the network unless the operator invites them and gives them permission. Hyperledger (Androulaki et al., 2018) is a prime illustration of this.
Consortium Blockchain: The consortium blockchain is a hybrid of public and private. Instead of a single entity, a consortium blockchain allows a pre-selected group of entities to manage the consensus process. Only a subset of participants can read the blockchain, or all entities may read it. Ethereum (Vujičić et al., 2018) is an example of a consortium blockchain.
3.3.8 Smart contract and blockchain.
Smart contracts (SCs) are a cornerstone of blockchain technology (Macrinici et al., 2018). They are defined as “a set of promises specified in digital form, including agreements on which contract parties can fulfil these commitments” (Bocek and Stiller, 2018). The inherent characteristics of blockchain – decentralisation, immutability and transparency – make it an ideal platform for executing SCs. By using SCs, blockchain can facilitate automated and secure transactions without the need for intermediaries, ensuring that all processes are traceable and tamper-proof (Vacca et al., 2021).
They offer a range of practical applications across various industries:
Financial services: SCs are widely used in decentralised finance platforms (Macrinici et al., 2018). For instance, in lending platforms like Compound and Aave, SCs manage the collateral and loan repayment processes automatically, ensuring that terms are met without intermediaries.
Supply chain management: SCs enhance transparency and traceability in supply chains. For example, IBM’s Food Trust network uses SCs to track the journey of food products from farm to table, ensuring quality and authenticity (Bocek and Stiller, 2018).
Real estate: SCs streamline property transactions. Platforms like Propy use SCs to automate the transfer of ownership and manage escrow services, reducing paperwork and accelerating the buying and selling process (Bocek and Stiller, 2018; Vacca et al., 2021).
Health care: SCs manage patient data and consent. For instance, the MediLedger project uses SCs to track and verify pharmaceutical supply chains, improving drug safety and compliance (Bocek and Stiller, 2018).
Legal and compliance: SCs automate legal agreements and compliance checks. For example, OpenLaw uses SCs to create and enforce legal agreements, providing a decentralised platform for contract management (Bocek and Stiller, 2018; Macrinici et al., 2018).
These examples highlight how SCs can automate and secure transactions, enforce agreements without intermediaries and ensure transparency and traceability in various domains. By integrating SCs with blockchain technology, these applications benefit from reduced costs, enhanced security and immutable records.
SCs on the blockchain provide:
contractual terms for blockchain transactions that be partially or completely enforced;
provides public ways for sending future transactions to the blockchain, which may later be used to convey data;
low-cost security in blockchain transactions;
performs a secure transaction without exhausting all of the resources of the mining entity, resulting in a denial of service (Bosamia and Patel, 2020);
calculates, saves data and transfers from one account to another;
when mining a new block, all miners must execute the SC, which makes SC execution quite costly than simple blockchain; and
SCs have a time restriction for their execution; if this time limit is surpassed, the transaction will be halted and deleted. The blockchain is thus immutable or tamperproof (Bosamia and Patel, 2020). These safeguards are in place to prevent malicious entities from restricting the deployment of new blocks on the blockchain.
Decentralisation provides an additional layer of security for blockchain systems. By maintaining copies of the blockchain across a widely distributed network, vulnerabilities are minimised. Attacking such a network requires immense computational power, making it highly resilient to attacks. Moreover, consensus protocols, which do not rely solely on mining, contribute further to network security. Access to the blockchain is controlled through authorisation, enhancing its robustness. Consequently, SCs function as a set of operational rules and data states embedded within the blockchain, supporting a secure and efficient system for managing digital agreements.
3.3.9 Blockchain and privacy preservation.
Many studies have been conducted to solve the trustworthiness concerns of FL systems, with the vast majority of them relying on blockchain because of its immutability and transparency. For example, Ma et al. (2020) and Nguyen et al. (2021) have proposed leveraging blockchain technologies to offer incentives to participants in the form of reputations/stakes and to track the quality of the local models’ updates in a trustworthy manner. In the same context, Nagar (2019) allowed any internet-empowered device to join and add data to a global privacy-preserved data-sharing network with blockchain technologies. The authors harnessed the ever-increasing data of IoT devices and combined it with a protected global learning system backed by a blockchain network with the potential to preserve on-device privacy. While these studies used blockchain to prevent single point failure (Truong et al., 2021) by replacing the central server, they eventually introduced third parties, identified as miners in blockchain, to store and aggregate models. Because the model parameters are visible to the blockchain miners, this would result in a model leakage issue. In addition, these studies did not theoretically analyse the convergence performance of loss function in blockchain-based FL.
Moreover, as a further enhancement, Zhao et al. (2020a) developed an FL scheme leveraging a reputation mechanism, and DP to help home appliance manufacturers to train an ML model based on their participants’ data while respecting their privacy. Also, to resolve security vulnerabilities in FL, Jia et al. (2021) designed an application model of blockchain-enabled FL in Industrial IoT, and formulated a data protection aggregation scheme based on this model. Specifically, the authors enabled multiple data protection in data and local model sharing through the use of DP and homomorphic encryption (Mendes and Vilela, 2017) methods with blockchain and FL.
Recently, Kumar et al. (2021) proposed a blockchain-based FL framework that could use up-to-date data to improve the recognition of images and share the data among hospitals while preserving privacy. They also developed a scheme for data normalisation to ensure accuracy as well as privacy. In addition, various platforms have lately been developed that expand the involvement of new technologies of the blockchain such as SCs, such as Le Nguyen et al. (2020); Lo et al. (2021); and El Rifai et al. (2020). More specifically, El Rifai et al. (2020) presented a blockchain-based FL architecture that used the benefits of both technologies in the medical setting. They presented, in particular, an SC implementation of a coordinating server for an FL algorithm to ensure transparency and consent while exchanging knowledge. Despite the provided conveniences, privacy issues are still emerging because intensive medical data sharing is required to improve service performance and data accuracy in such real-world applications.
4. Cyberattacks and threats vulnerabilities on privacy preservation techniques
In the early stages, PP models designed for data publishing resulted in an unprecedented development in the domain of social and individual private data (Malik et al., 2012). However, the types of IoT data and SM information vary widely, and so do the implementation means (Du et al., 2018). In this context, straightforward adaptation of such models to such distributed data environments would confront a number of difficult inconsistencies (Tabassum et al., 2021). At the same time, concerns about privacy and security are restricting the wider usage of SM and IoT data (Ma et al., 2020). For example, user data might be used for unintended uses (i.e. facial recognition, or location-based services for targeted social advertising and recommendation), exposing direct or potential privacy threats. As a result, SM, as well as IoT data, should not be directly exchanged without account for privacy.
Also, although FL precludes participants from explicitly revealing their private data, new studies (Wei et al., 2021; Fredrikson et al., 2015; Geiping et al., 2020; Truong et al., 2021) show that sharing gradients/updates in FL might leak sensitive information about the participants’ private data to either passive or active inference attacks. Model updates, for example, or two consecutive snapshots of the FL model parameters can leak unintended features of the participants’ training data to the adversarial participants. Even worse, an attacker can use trained models or model updates to recover the original training data as models with high capacity can simply recollect the training data (Wei et al., 2021). Modern Deep Learning (DL) models, in particular, have far more capacity than they require to perform well on their tasks (Ma et al., 2020), as indicated by the following attacks.
4.1 Threat model
Before reviewing FL attacks in the context of IoT-integrated SM networks, it is essential to first define the threat models that underpin these attacks. The level of PP and security a technique provides is influenced by the types of adversaries and the adversarial behaviours under consideration. In the literature, two main types of threat models are commonly identified in FL: internal and external attacks, which can occur either during or after the training phase.
Internal attacks originate from entities within the FL framework, such as the FL server or participating clients. These adversaries have direct access to the training process and can manipulate local models or data sets, potentially leading to greater harm. Because internal attackers are part of the trusted environment, their actions often amplify the scope and severity of the attack. For example, they may engage in data poisoning (Tolpegin et al., 2020), where they intentionally introduce malicious data or model updates, ultimately degrading the global model’s performance or accuracy.
Internal attacks are more dangerous than external ones because the adversaries are already embedded in the system and can exploit their privileged access. This is why our analysis of FL threats focuses primarily on internal attacks. The impact of these attacks is particularly concerning during the training phase, where adversaries can perform a range of actions to disrupt the learning process.
External attacks, on the contrary, are initiated by adversaries outside the FL framework. These attackers are not part of the training process and instead target the communication channels between FL participants and the server or seek to exploit the final deployed FL model. Eavesdropping on data exchanges between clients and servers is a common external attack strategy. For instance, an external adversary might intercept gradient updates or model parameters during transmission to infer sensitive information.
In some cases, external attacks can also occur post-training, when an attacker seeks to exploit the final FL model after it has been deployed as a service. While external attacks are generally less invasive than internal ones, they still pose significant privacy and security challenges, especially when secure communication protocols are not in place.
Attacks during the training phase typically aim to compromise the global model, often by leveraging the privileged access that participants or the server have to local data and model updates. For example, an adversary might employ poisoning attacks (Tolpegin et al., 2020) to corrupt the quality of the data or introduce bias into the learning process, undermining the global model’s integrity. In addition, various inference attacks (Fredrikson et al., 2015) may be conducted during training, targeting local updates or aggregated models to extract private information from participants.
After the training phase, adversaries may attempt to uncover details about the model itself, raising privacy concerns. These attacks depend heavily on the adversary’s knowledge of the model, including its architecture and training data. The success of post-training attacks often hinges on the ability to analyse model parameters or outputs to infer sensitive information, as discussed in the context of model inversion and membership inference attacks.
In FL, adversaries can exhibit different behavioural models. The semi-honest (or honest-but-curious) adversary adheres to protocol but seeks to extract additional information from the data or the global model parameters during the training process (Lyu, 2018). This type of adversary is common in real-world scenarios, where entities may not intentionally disrupt the learning process but still aim to learn more from the data than they are entitled to.
Conversely, malicious adversaries deviate from the protocol by manipulating data, adding noise or introducing fake updates into the system (Lyu, 2018). These adversaries may even cooperate with other corrupted participants to amplify the damage, leading to more severe attacks, such as coordinated model poisoning. The malicious behaviour model allows for a wide range of sophisticated and destructive attacks, which can severely degrade the system’s performance.
4.2 Inference attack
As previously stated, a trained ML model contains unintended features that might be exploited to extract private data. Thus, an adversary can use local model parameters/gradients from an FL process, or the difference between two subsequent snapshots of FL model parameters to infer private data, especially when combined with related background knowledge. Because gradients are created from participants’ private training data, and a learning model may be regarded as a representation of the high-level statistics of the data set it was trained on, they might cause privacy leakage (Lyu, 2018). Gradients of a specific layer are generated in DL models depending on the layer’s features and the error from the layer after training (i.e. backpropagation) (Melis et al., 2019). The gradients of the weights in the case of consecutive fully connected layers are the inner products of the current layer’s features and the error from the layer behind them. Likewise, the gradients of the weights for a convolutional layer are convolutions of the layer’s features and the error from the layer after (Melis et al., 2019).
The practical implications of inference attacks are significant: they could allow an adversary to reconstruct training samples (model inversion attacks) (Fredrikson et al., 2015; Geiping et al., 2020) or determine whether specific data points were included in the training set (membership inference attacks) (Shokri et al., 2017; Olatunji et al., 2021). For users and organisations, this means that sensitive data could be exposed, leading to potential breaches of privacy and security. In the broader IoT-integrated SM network ecosystem, such attacks undermine trust in data handling processes and can lead to widespread information leakage.
4.2.1 Model inversion attack.
In a model inversion attack, an adversary or a curious server attempts to reconstruct or steal participants’ local data sets. Even though the ith participant’s data set Di is held locally in FL, the local update Mi needs to be exchanged with the server, which may disclose the participants’ private information, as proven by model inversion attacks (Fredrikson et al., 2015). Authors of Fredrikson et al. (2015), for example, developed a model inversion attack that retrieves images from a system of facial recognition. Furthermore, the resulting global model MG might cause information leakage (Lu et al., 2019).
As illustrated in Figure 9, an adversary’s process to reconstruct the original training data set from local model updates typically follows these steps: collecting the exchanged local model updates, training an attack model that maps these updates back to raw input data and using the attack model to recover the raw data from the gathered model updates.
Along this line, an adversary may use model inversion to retrieve sensitive information from the training data set, for example, by reconstructing representatives of classes that characterise features in classification ML models. Model inversion attacks do not need active participation by the adversary in the training process (i.e. passive attacks or black box). For instance, by obtaining a properly weighted probability approximation for the target feature vectors, it is possible to reconstruct images from a face recognition model for a specific individual using a model inversion attack (Geiping et al., 2020). The experiment results in this instance reveal that this model inversion attack can construct images that are substantially comparable to the individuals’ personal images (Fredrikson et al., 2015). In the context of the FL, adversaries might not only examine the trained model parameters but also engage in the training process to observe the updates in the global models across many consecutive communications rounds (i.e. active attacks or white-box), which would enhance the attack results.
Hitaj et al. (2017), for example, developed a Generative Adversarial Networks (GAN) attack on deep FL models. A malicious participant in this attack intends to compromise any other participant. The GAN attack takes advantage of the FL learning process’s real-time characteristic, enabling the adversarial participant to train a GAN to generate prototype instances of the intended private training data that appear to be drawn from the same distribution as the training data. As a result, rather than the specific training data, the GAN technique is intended to reconstruct only the class representations. It should be noted that the GAN attack implies that all of the training data for a specific class comes from a single participant, implying that the GAN-constructed representations are comparable to the training data.
The practical implications of model inversion attacks are significant: if adversaries successfully reconstruct sensitive information, it can result in severe privacy breaches for individuals and undermine data security for organisations. In IoT-integrated systems, such attacks pose a substantial risk to network reliability and safety, as they expose sensitive data to potential exploitation, thereby compromising the integrity of the entire system (Tolpegin et al., 2020).
4.2.2 Membership inference attack.
An inference attack, as the name implies, is a method of inferring training data specifics. The Membership inference attack (Shokri et al., 2017) attempts to get knowledge by determining whether or not the data appears on a training data set. Hence, as a new inference attack that may be applied to any form of ML model, this membership attack was introduced by Shokri et al. (2017). In it, an adversary exploits the global model to obtain knowledge of the training data of other participants. An adversary’s goal, given a data record and access to the target/trained model, is to discover whether the data record originates from the target model/participant’s training data or not, i.e. its membership. An attack model is specifically trained to identify variations in the target model’s accuracy on data that it trained on against data that it did not experience while training. The attack model conducts a classification model on a data record, indicating whether or not the data record relates to the target participant’s training data set, based on the class of the data record and the output of the attack model on it. The attack model is also made up of a number of models, one for each of the output classes. To train the attack model, shadow models with the same structure and class as the target model are constructed and hence behave similarly to the target model.
Practically, this means that sensitive data can be inferred by an attacker, which may lead to privacy breaches for individuals whose data is exposed. For organisations, this attack can undermine trust in their data handling processes and expose proprietary or sensitive information. In the IoT-integrated SM network ecosystem, such attacks can result in significant privacy and security risks, as adversaries could identify and exploit vulnerabilities in the data handling mechanisms (Shokri et al., 2017).
4.3 Poisoning attack
Poisoning attacks present a critical threat to the robustness of FL systems. Unlike privacy attacks that primarily target sensitive data privacy, poisoning attacks undermine the integrity and effectiveness of the FL process (Tolpegin et al., 2020). These attacks exploit the fact that each participant in FL has access to local training data, increasing the risk of malicious updates being introduced into the global model. During the training phase, poisoning attacks can corrupt either the training data set or the local model, thereby degrading the overall performance and accuracy of the global model. The aggregation of updates from numerous participants means that the impact of a poisoning attack can be significant, as even a small fraction of compromised updates can severely affect the global model’s performance.
In practical terms, poisoning attacks can have severe consequences for various stakeholders (Tolpegin et al., 2020). For users, these attacks can lead to degraded model performance, resulting in inaccurate predictions and reduced trust in the system. For organisations, the repercussions include diminished model reliability, potential financial losses and damage to reputation. The broader IoT-integrated SM network ecosystem is also at risk, as widespread poisoning attacks can undermine the collective efficacy of the network, leading to cascading failures and decreased system resilience.
Poisoning attacks are broadly categorised into two types based on the adversary’s intent: unintended and intended poisoning attacks. It should be emphasised that during the training phase, both unintended and intended poisoning attacks can be conducted against the data and the model. The poisoned updates, as shown in Figure 10, can be produced from two poisoning attacks: data poisoning during local data gathering, and model poisoning during local model training. At a substantial level, both poisoning efforts aim to influence the FL model’s performance in an undesirable manner. Data poisoning attacks, on the contrary, are often less successful than model poisoning attacks due to the model-exchanging nature of FL with heterogeneous architectures. In general, in FL contexts, model poisoning overcomes data poisoning since data poisoning attacks eventually affect a portion of updates supplied to the model.
Unintended poisoning attacks, such as the Byzantine attack, involve arbitrary faulty model updates that aim to disrupt the global model’s performance. These attacks can cause significant practical issues by leading to system failures and reduced model accuracy. For instance, Byzantine attacks can degrade the system’s performance to the point where it becomes unreliable for critical applications, affecting both users and organisational operations (Blanchard et al., 2017). In such scenarios, users might experience erroneous predictions, and organisations might face operational disruptions and increased costs. A formal definition of a Byzantine attack may be found in Definition 3.
Definition 3: A truthful participant participates with local model update of its local model Mi = , but a malicious participant can participate with any arbitrary update.
where * stands for arbitrary updates and Li stands for i‐th participant’s local model loss function. Blanchard et al. (2017) demonstrated that if there is no privacy defence in the FL, the aggregation of FL can be fully compromised by a single Byzantine participant. Assuming N − 1 honest participants and a Byzantine participant, the server aggregates the updates . Let Y be any vector in . A single Byzantine participant could always turn the aggregated model to become Y. In particular, a single Byzantine participant can prevent convergence by uploading the following update:
Such a simple attack reveals FL’s vulnerability to poisoning attacks. Chen et al. (2020a) demonstrated that Byzantine participants may conceal their models’ updates and launch effective attacks when they employ momentum-based optimisers like Adam (Bu et al., 2020) in the FL. Baruch et al. (2019) offered a perturbation range in which the attacker can modify the local model updates without being noticed. The authors also showed that modifications within this range are sufficient to interfere with the learning process and the whole FL system. Thus, they offered the first non-trivial non-omniscient attack usable for distributed learning, making the attack effective and more realistic, and stated that the same configuration of the attack could overcome all state-of-the-art statistics-based defences.
On the contrary, intended poisoning attacks are designed with specific goals in mind (Hu et al., 2019). These attacks include label-flipping attacks and backdoor poisoning (Hu et al., 2019; Bagdasaryan et al., 2020). Label-flipping attacks alter the labels of training data to mislead the model, resulting in incorrect classifications (Zhang et al., 2020). For example, a label-flipping attack might cause a model to incorrectly categorise spam emails as non-spam, impacting users’ experience and potentially leading to increased exposure to malicious content (Lyu et al., 2020b). Backdoor poisoning attacks involve injecting a trigger into the model that causes it to behave maliciously under certain conditions (Ning et al., 2021; Bagdasaryan et al., 2020). These attacks can be particularly insidious, as the model may perform normally on general data but exhibit targeted failures when the trigger is present. This can compromise the security and reliability of applications relying on the model, affecting both users and organisations (Zhang et al., 2020; Ning et al., 2021).
Any FL participant or collusion on either the data or the local updates/gradients can carry out the intended poisoning attack. Lyu et al. (2020b) showed that a single, non-colluding malicious participant can force the model to consistently misclassify a set of input data. Also, the poisoned local updates can be produced by learning the local model on compromised local training data, as per Bagdasaryan et al. (2020), and only one attack might be enough to inject a backdoor into the resulting global model. Finally, we point out that the amount of training data poisoned and the extent to which malicious participants were involved in the attacks determine the impact on the learning model and that a much earlier poisoning study has concentrated on Byzantine or backdoor attackers.
4.4 Free-riding attack
In the context of FL, free-riding attacks pose significant risks to the fairness and efficacy of collaborative learning processes. Free-riding occurs when participants in the FL network benefit from the global model without contributing meaningful data or computational effort. This behaviour can undermine the collaborative fairness of the system (Lin et al., 2019).
Practically, free-riders exploit several motivations to avoid contributing: they may lack the necessary data to train a local model effectively; they might prioritise privacy concerns, opting to submit fake updates to avoid revealing their data; or they could be unwilling or unable to invest in the computational resources required for model training.
For users, the impact of free-riding is seen in diminished model performance and accuracy, as the contributions of free-riders do not reflect genuine data, potentially skewing the global model. Organisations that deploy FL systems face challenges in ensuring equitable contributions, which can result in inefficient use of resources and reduced overall system reliability. The broader IoT-integrated SM network ecosystem can experience weakened trust in collaborative systems, as free-riding undermines the value of collective data sharing and collaborative efforts. Addressing free-riding is crucial for maintaining the integrity and effectiveness of FL systems, requiring innovative solutions to incentivise genuine participation and penalise non-contributors (Lyu et al., 2020a).
5. Challenges and research opportunities in privacy preservation of social media and Internet of Things data
The new dynamic of real-time interaction between SM and IoT has turned almost every SM user’s activity into a consumable stream of data that can be analysed by ML algorithms (Salim et al., 2022b). With such an endless stream of SM and IoT data comes enormous complexity, along with privacy concerns (Ram Mohan Rao et al., 2018; Cai et al., 2016; He et al., 2017; Batra et al., 2020; Salim et al., 2022b). In this regard, FL has been proposed to address those concerns and offer collaborative ML with the PP mechanism (Wei et al., 2021; Yang et al., 2019a; Ma et al., 2020; Truong et al., 2021). FL ensures privacy by default by restricting the exchange of users’ data in the network. However, this privacy-aware ML framework (i.e. FL) seems to be brilliant in theory, but it is neither immune to attacks nor are the current improvements in its supporting technology sophisticated enough to be anticipated to address all privacy concerns by default, at least in the time being (Lu et al., 2019; Geiping et al., 2020; Truong et al., 2021; Tolpegin et al., 2020; Lyu et al., 2020a; Baruch et al., 2019; Lin et al., 2019). More specifically, there are still potential challenges that should be addressed to enhance the privacy and robustness of FL-based pp frameworks in IoT-integrated SM networks. Furthermore, multiple design objectives that are equally relevant to privacy and utility need to be addressed in PP.
In light of this reality, this section is dedicated to analysing the existing privacy concerns in IoT-integrated SM networks and significant recent advancements in FL-based PP frameworks to provide further insights for its future growth. Along with identifying several key findings and challenges, this section precisely outlines the following IoT-integrated SM network privacy-specified future directions.
Firstly, the key findings and open challenges related to the privacy of SM as well as IoT data are discussed as follows.
The availability of IoT-integrated SM network data: Data availability is a major challenge for evaluating PP techniques since many data sources do not release their data owing to privacy concerns. There is a scarcity of heterogeneous data sources (i.e., datasets) that combine IoT with SM. There are also a limited number of ground truthable data sets that illustrate the reliability of PP. As a result, there is a genuine need for the generation of a realistic dataset that includes SM and IoT data to assess the trustworthiness of novel PP mechanisms (Salim et al., 2020, 2022b; Mothukuri et al., 2021).
Adversarial zero-day attacks (Abri et al., 2019) and their accompanying mechanisms: FL’s current defence efforts are directed to defend against known vulnerabilities and particular pre-specified malicious behaviours, rendering them less helpful in identifying attacks outside of their specified boundaries when conducted (Mothukuri et al., 2021). Although this issue applies to nearly every ML application’s defence systems (Abri et al., 2019; Mothukuri et al., 2021), the likelihood is higher in FL because there aren’t many versions in practice that have proved the feasibility of multiple attacks. Recent advances in PP have revealed potential options for preventing such attacks (Popoola et al., 2021; Bu and Cho, 2021).
Assurance of traceability and accountability: Traceability of the global learning model over the life cycle of the underlying learning process is a key issue in FL (Ma et al., 2020; Truong et al., 2021). For example, if the global model is poisoned, we would have to be able to determine which participants’ local data resulted in that poisoning. If the logic behind learning model behaviour is a black box, we are compelled to abandon logical reality and rely entirely on sentient AI (Salim et al., 2025b). There have been a few efforts that leverage blockchain technology with FL to offer and track exchanged updates which are added to the global model, with the goal of achieving a more transparent tracing of the training process in the FL (Jia et al., 2021; Le Nguyen et al., 2020; Ma et al., 2020; Lu et al., 2019).
optimisation of the privacy-utility trade-off: Current studies demonstrate how to improve privacy level in FL without compromising the efficiency/accuracy of the data utility (Mothukuri et al., 2021). However, no study has been conducted to determine the optimal protection level or the amount of injected noise. If the level of privacy or the amount of injected noise is insufficient, the participants encounter the problem of privacy breach. On the contrary, if the privacy level is too high or there is too much noise in the data, the FL model tends to suffer from low accuracy in numerous ways (Geng et al., 2020).
FL participant selection: In FL, the learning process and basis for participant selection for training communication rounds are critical. Although the research in (Zhang et al., 2021b) proposes optimal ways, there is still a need for a sustainable plan for each learning model use case in FL.
Techniques for optimising various ML algorithms: To develop the FL model, there is a requirement for established and standardised optimisation algorithms based on various ML techniques. There are numerous suggested aggregation/improving algorithms that suggest optimising or enhancing FL, but there is still a need for dedicated research to develop FL-specific optimising algorithms for all existing ML applications/use cases. This makes it easier for future FL users/adaptors to provide FL-specific solutions (Mothukuri et al., 2021).
Ease of adaption of PP: It is clear that there is no easy and straightforward method for producing the FL environment. Although the research in Bonawitz et al. (2019) suggests several aspects to examine before going to implementation, there is still a need for well-established criteria for deploying a new use case in FL or converting an existing ML environment to a decentralised FL paradigm.
In light of these challenges, several research directions are essential for advancing FL frameworks and improving PP in IoT-integrated SM networks. Firstly, addressing data availability and generation is crucial for evaluating and advancing PP techniques. Future research should prioritise creating and disseminating realistic data sets that integrate SM and IoT data, closely mirroring real-world scenarios, and establishing robust ground truth data sets for validating these techniques (Salim et al., 2020, 2022b; Mothukuri et al., 2021).
In addition, mitigating zero-day attacks remains a priority. Developing proactive strategies and advanced methods to detect and neutralise these attacks will be critical for enhancing the security and resilience of FL frameworks (Abri et al., 2019; Popoola et al., 2021; Bu and Cho, 2021).
Improving traceability and accountability within FL systems is also vital. Leveraging technologies such as blockchain could provide innovative solutions for tracking the impact of local data on the global model and addressing issues like model poisoning (Jia et al., 2021; Le Nguyen et al., 2020; Ma et al., 2020; Lu et al., 2019).
Balancing the privacy-utility trade-off should be a major focus, with research aimed at optimising privacy protection while maintaining high model accuracy (Mothukuri et al., 2021; Geng et al., 2020). Effective participant selection strategies and the development of FL-specific optimisation techniques are also necessary to enhance the efficiency and performance of FL models (Zhang et al., 2021b; Mothukuri et al., 2021). Finally, addressing the practical challenges of adopting PP techniques and providing clear guidelines for transitioning to decentralised FL paradigms will facilitate broader implementation and acceptance of these technologies (Bonawitz et al., 2019).
To achieve this, it is crucial to prioritise advancing methodologies that tackle the current limitations of PP techniques. This includes developing realistic data sets that integrate IoT and SM data, devising innovative solutions to counter adversarial attacks, and refining optimisation algorithms tailored to FL. In addition, research should focus on establishing practical guidelines for deploying PP techniques effectively in real-world scenarios and fostering interdisciplinary collaboration to drive progress in this dynamic field.
By pursuing these research directions, the field can advance towards more robust, efficient and PP FL frameworks that effectively integrate SM and IoT data, ultimately advancing the state of the art in PP ML.
5.1 Ethical considerations of privacy preservation techniques in Internet of Things-integrated social media networks
In addition to the technical challenges and research opportunities outlined above, it is crucial to address the ethical considerations surrounding PP techniques in IoT-integrated SM networks (Truong et al., 2021). Key ethical aspects include the following:
Informed consent: It is crucial for users to be fully aware of what data is being collected, its usage and the purposes for which it is intended (Lipschultz, 2020). Ethical privacy practices require clear, comprehensible communication and obtaining explicit consent from users, ensuring they are informed about the scope and implications of data collection.
Data minimisation and purpose limitation: Collecting only the data necessary for specific purposes is essential for reducing privacy risks (Herrmann, 2007). Ethical practices should ensure that data collection is restricted to what is essential for achieving the intended objectives, thereby minimising unnecessary exposure.
Transparency and accountability: Organisations must operate with transparency regarding their data handling practices and be accountable for any misuse or breaches (Truong et al., 2021). Implementing robust security measures and clearly communicating data management practices are critical for maintaining user trust and confidence.
Incorporating ethical considerations is essential to developing PP techniques that align with core ethical principles. Policymakers should implement robust regulations and standards that ensure transparency and accountability. This includes frameworks mandating clear communication about data practices, strict consent protocols and specific protections for vulnerable groups. These measures support user rights and foster responsible, respectful data handling in IoT-integrated SM networks.
6. Conclusion
This paper provides an in-depth exploration of PP within IoT-integrated SM networks, emphasising the need to address privacy challenges in these complex, data-intensive environments. It begins by defining SM, IoT and their integration, setting the stage for discussing PP and its associated metrics. A variety of PP techniques are explored, from data reconstruction methods to advanced cryptographic solutions, establishing a comprehensive framework for protecting user privacy. The paper also discusses cyber threats and vulnerabilities, outlining the dynamic threat landscape facing IoT-integrated SM networks. Issues such as unintended data usage and potential data leaks in privacy-aware methods like FL underscore the importance of strengthened security.
Overall, this study advances the understanding of privacy protection in IoT-integrated SM networks, offering valuable insights into privacy techniques, emerging threats and effective mitigation strategies.











