Unveiling influencer-driven PII disclosures in social media discourse

Rosado, Eidan J.; Wang, Ling; Dringus, Laurie; Sun, Junping

doi:10.1108/DTS-05-2025-0112

Purpose

This study examined how influencer power and tier relate to social media engagement and personal information sharing, noting fluctuations of both over time.

Design/methodology/approach

A combination of content analysis, change point detection methods (CUSUM Drift, Change Point Detection) and time series modeling (ARIMA) was used to analyze social media conversations and identify temporal trends in engagement and personal information disclosures.

Findings

The study demonstrates a strong correlation between influencer reach, increased engagement and disclosures, with a declining trend in both engagements (79.78% of threads) and disclosures (66.12% of threads) across 183 conversational threads. An observed 29.41% of posts contained personally identifiable information (PII), with conservative sensitivity analysis suggesting an adjusted prevalence of 15.00% to 20.27% after accounting for automated detection error.

Research limitations/implications

The cross-platform comparison revealed architectural differences between Twitter/X's follower hierarchy and Reddit's community voting structures that influence influencer effects on PII disclosure. Rethinking influencer identification on community-based platforms requires tailored models that consider community norms instead of just follower counts. Manual validation of automated PII detection was infeasible due to data access constraints.

Originality/value

This project provides exploratory insights into how platform architecture may moderate influencer dynamics, with implications for privacy-conscious platform design and future comparative studies.

Introduction

Social media has reshaped industries, crisis communication and socio-cultural research, transforming our understanding of human behavior (Valle-Cruz, López-Chau, & Sandoval-Almazán, 2020) while blurring lines between private and public sharing. Social media influencers play a crucial role in these dynamics, persuading large audiences and impacting marketing campaigns and elections (Okuah, Scholtz, & Snow, 2019). Followers often adopt similar hashtags and language, increasing the risk of identifiable information being shared. Third-party consumers can access users' posts through APIs, relying on consent not to disclose identities. Once public, posts can be archived and used by marketers or adversaries, raising concerns over personal data control (Tene and Polonetsky, 2013; Trepte, 2020), particularly as personal information in viral content can be exploited (Beigi, Shu, Zhang, & Liu, 2018).

Users must be cautious about sharing information online, while organizations collecting this data are responsible for protecting privacy. Poor data handling can result in identifying individuals from ostensibly anonymous posts (Beigi et al., 2018), making it essential to understand extractable data types and potential attack methods (Orooji, Rabbanian, & Knapp, 2023). Despite privacy concerns, incentives motivate users to share personal information (Bélanger & Crossler, 2011; Yang & Huang, 2019), driven by factors such as FOMO (Alutaybi, Al-Thani, McAlaney, & Ali, 2020), guilt (Yang & Huang, 2019) and the desire to manage public personas (Zhang et al., 2020a, b). While research highlights these behavioral factors, it often overlooks how they relate specifically to influencer networks (Farivar, Wang, & Turel, 2022), where engagement-driven frequent posting can diminish self-control and risk assessment as users emulate influencers as role models (Gross & von Wangenheim, 2018; Ki & Kim, 2019). A limited understanding remains of the actual behaviors associated with follower engagement in these networks (Farivar et al., 2022).

Literature review

Social media networks consist of individuals with varying degrees of relationships, often featuring influencers who significantly impact followers' engagement behaviors (Okuah et al., 2019). This influence can produce similar posting patterns, exposing identifiable data that third parties can exploit. Engaging with trends that encourage personal disclosure can lead to privacy-invasive behaviors (Valle-Cruz et al., 2020), and users may not realize their shared data can be aggregated to identify them or be misused by malicious actors (Keküllüoğlu et al., 2020; Moura and Serrão, 2016). This study critically reviews the literature on influencer identification, network analysis methods, privacy risks, personally identifiable information (PII) and social media disclosure.

Factors in privacy-adverse behavior

Trust and behavioral mimicry

Zhang and Choi (2022) observed that influencers often share entertaining content to capture audience attention before building trust to achieve their objectives. However, social networking apps can create a deceptive sense of security, leading users to believe they are sharing information safely when third-party access may be extensive (Waldman, 2016). Even privacy-concerned users may still participate in social media to gain social capital or fulfill psychological needs, as obtaining social capital will often require compromising privacy (Westin & Chiasson, 2019).

Using social identity theory (Tajfel & Turner, 1979), users may disclose PII to align with perceived community norms or impress influencers they follow. If these norms evolve to discourage PII sharing, followers may alter their behavior accordingly. Through social learning theory (Bandura, 1977), PII disclosures may increase when individuals observe influencers or peers engaging in this behavior and mimic it. Conversely, declines in PII sharing might result from decreased influencer relevance or shifting audience norms around disclosure.

Social media as a “technology of self”

The variety of information shared on social media varies in potential risk. Yang and Huang (2019) explored types of self-disclosures and their motivations, introducing Guilty Information Disclosure. Their findings indicate that contradictory behaviors such as seeking confirmation and imitation are common, with social media serving as a means for self-expression where individuals convey guilt and other sensitive information.

When users reveal past actions or details to engage with others, they risk divulging PII such as names or location, sensitive health information (Liao, 2019) or relationship details (Yang & Huang, 2019). Such disclosures can lead to discrimination by entities like insurers or employers (Weston & Wells, 2020), and analysis of comment interactions can compound this risk through discrimination based on network connections (Beigi & Liu, 2020).

Influencers and influencer networks

The influencer marketing industry was projected to exceed $16 billion in 2022 (Shopify, 2023). Influencer marketing remains one of the most sought-after forms of marketing due to the ability of these social media users to reach other users and assuage their hesitations in purchasing decisions. In a survey by Ki and Kim (2019), respondents admitted they would commit to a purchase if an influencer endorsed the product or service. Influencers do not strictly apply to marketing activities. When targeting an audience, several types of influencers with specific goals and compatibility considerations exist.

Identifying influencers

Okuah et al. (2019) define an influencer as someone with credibility and a significant following, capable of impacting individuals' decisions. Rahman (2022) categorizes influencers by follower count into Nano (under 10,000), Micro (10,000–100,000), Macro (100,000–1 million) and Mega (1 million+) tiers, a classification crucial for marketing campaigns and consumer behavior research.

Influencer power varies in calculation methods across studies, from tools like Klout (Rao, Spasojevic, Li, & Dsouza, 2015) to custom calculations based on network analysis metrics (Kumar, Choudhury, Rawat, & Jayaraman, 2016; Stieglitz & Dang-Xuan, 2013) or combinations of retweets, followers, mentions and favorites (Essaidi, Zaidouni, & Bellafkih, 2020; Sharma, Agarwal, & Sardana, 2018). Alternative approaches include influence scores using user contacts and followers (Lahuerta-Otero & Cordero-Gutiérrez, 2016). Essaidi et al. (2020) highlighted the follower-following ratio, finding that higher values indicate greater influence power. Due to limitations in third-party software, centrality measures emerged as the most feasible method for this study, particularly eigenvector centrality, which accounts for the influence of interconnected nodes (Gunaratne, Coomes, & Haghbayan, 2019).

Classifying private data and calculating risk

Information disclosure types and associated attacks

Even with private account features, users can inadvertently reveal information through photos or text exposure (Keküllüoğlu et al., 2020; Powale & Bhutkar, 2013). Beigi and Liu (2020) identify two disclosure types: identity disclosure, mapping a dataset instance to an individual and attribute disclosure, where an adversary infers information from released data. Two corresponding threats were modeled in this study: the Identity Disclosure Attack, using social network data to map users to known identities, and the Attribute Disclosure Attack, using social network data to infer attributes for users within a network group.

PII and sensitive data types

Data elements like birthdates, real names, addresses, phone numbers, emails and financial details are considered personal data, increasing privacy risks in social interactions (Milne, Pettinico, Hajjat, & Markos, 2017). Milne et al. (2017) introduced the Information Sensitivity Typology, categorizing information into Basic Demographics, Secure Identifiers, Contact Information, Financial Information, Community Interaction and Personal Preferences, aligned with NIST and Homeland Security standards for classifying direct PII or potentially linkable data. Their study identified four risk categories (monetary, social, physical and psychological), with consumers perceiving higher risks for Secure Identifiers and lower risks for Basic Demographics.

Rosado (2023) expanded this work with PII-Codex, a Python package for detecting and assessing PII tokens using Microsoft Presidio (Microsoft, 2018), which employs rule-based and named entity recognition models to assign categories and severities based on Milne et al’s. (2017) typology (2016). NER models are effective for entity extraction but tend toward false positives, with one study reporting a precision of 0.82 and a recall of 0.81 (Macri et al., 2023).

Purpose of the study

This quantitative study examined how social media influencers affect followers' sharing of PII across platforms with different architectural designs. This impact is relevant for businesses choosing influencers and forming social media policies, as well as within private corporate networks where excessive sharing can expose sensitive information (Turban, Bolloju, & Liang, 2011). The research investigates Reddit's community-based architecture after an initial Twitter/X pilot, offering exploratory insights into how platform design influences influencer-driven disclosure patterns. No IRB approval was necessary since the study used publicly available posts, and no raw data were retained after analysis.

Research questions

RQ1.

How do influencer tier and influence power relate to follower engagement and PII disclosure?

RQ2.

Do these relationships vary between follower hierarchy platforms (Twitter/X) and community-based platforms (Reddit)?

RQ3.

Do engagement and PII disclosure rates exhibit temporal decay patterns in social media conversations?

Hypotheses

Past studies show that influence power, which incorporates follower count, tends to boost involvement in trending activities (Arora, Bansal, Kandpal, Aswani, & Dwivedi, 2019; Lahuerta-Otero & Cordero-Gutiérrez, 2016) and engagement likelihood increases with influencer prominence (Rahman, 2022). Therefore, influencers with higher influence power and tier likely encourage more followers to share personal information through increased interaction, even when the original author does not share their own details.

H1.

Influence Power positively correlates with follower engagements.

H2.

Influence Power positively correlates with PII disclosure detections.

H3.

Influencer Tier positively correlates with follower engagements.

H4.

Influencer Tier positively correlates with PII disclosure detections.

Engagement has been observed to decline over time due to trend or influencer irrelevance (Zhang, Zhao, Yang, Paris, & Nepal, 2019), and analyzing this temporal pattern aids understanding of how attitudes and behavior change with social media usage (Saha et al., 2019). Given this decline, the same was hypothesized for PII disclosure rates.

H5.

As time elapses, the engagements in a cluster will decrease.

H6.

As time elapses, the PII disclosure detections in a cluster will decrease.

The research model is presented in Figure 1.

Figure 1

View large Download slide

A diagram representing the research model of influencer impact on engagements and PII disclosures. The diagram includes three main components: Influence Power, Influencer Tier, and Time Elapsed, which are connected to two outcomes: Engagement and PII Disclosures. Influence Power is linked to Engagement through hypotheses H1 and H2, and to PII Disclosures through hypotheses H5 and H6. Influencer Tier is linked to Engagement through hypotheses H3 and H4, and to PII Disclosures through hypotheses H5 and H6. Time Elapsed is linked to Engagement through hypotheses H5 and H6, and to PII Disclosures through hypotheses H5 and H6. Arrows indicate the directional relationships between these components and outcomes.

Research model of influencer impact on engagements and PII disclosures

Methods

The next subsections show steps from data collection to final analysis and hypothesis testing.

Data collection

Throughout the pilot study, X (Twitter) served as the data source. Due to unprecedented changes in X's API offering, the data source was switched to Reddit for the main study. With both sources, some conversations may still have been developing when polling for trending topics; therefore, full conversational threads cannot be guaranteed, as conversations may evolve over time.

In the pilot study, posts from X were collected every 15 minutes on various days during the transition from legacy to new tier limits. The dataset used top conversations with keywords yielding four collections: Zelda, Jedi, Liverpool and Ferrari. When the final collection was retrieved, API limits and terms changed, prompting a pivot to Reddit. In the main study, Reddit posts were gathered at 30-min intervals over 11 days (September 12–22, 2023), modeled after Kim, Jang, Kim, and Wan (2018) X study methodology. Collections stayed within the Free Reddit API's limit of 100 requests per minute (Reddit API, n.d.; Stoddard, 2021) and occurred from 7 AM to 10 PM Mountain Standard Time to permit human oversight of data collection. Posts and subsequent thread interactions were recorded by examining comments on initial posts and replies that followed.

Data processing

PII identification and risk calculation

The PII identification and risk calculation allow the evaluation of the risk severity of information disclosed across the network graph. The PII-Codex (2023) was used for PII detections and categorizations in combination with the Not Identified, Identifiable and Identified categories by Schwartz and Solove (2011). The library uses a severity scale of 1, 2 and 3 for the categories of Not-Identified, Identifiable and Identified, respectively, to determine the risk score rs of a token. The isolated set of PII types and their associated categories and severities provided by Milne et al. (2017) and the PII-Codex Risk Values (2023) are presented in Table 1.

Table 1

Data typology for risk assessments with risk Enum coding from PII codex

Type	Cluster membership	NIST category	Homeland security category	PII-codex id	PII-codex risk value
Country of Citizenship	Basic Demographics	Linkable	Linkable	COUNTRY_OF_CITIZENSHIP	2
Zip code +4	Basic Demographics	Linkable	Not Mentioned	ZIPCODE	2
Gender	Basic Demographics	Linkable	Not Mentioned	GENDER	2
Birth Date	Basic Demographics	Linkable	Linkable	DATE	2
Online Screen Name	Personal Preferences	Directly PII	Not Mentioned	SCREEN_NAME	3
Religion	Personal Preferences	Linkable	Not Mentioned	NRP	2
Political Affiliation	Personal Preferences	Linkable	Not Mentioned	NRP	2
Email Address	Personal Preferences	Directly PII	Stand Alone PII	EMAIL_ADDRESS	3
IP Address	Contact Information	Directly PII	Not Mentioned	IP_ADDRESS	3
Phone Number	Contact Information	Directly PII	Stand Alone PII	PHONE_NUMBER	3
Address	Contact Information	Linkable	Not Mentioned	LOCATION	2
Social Network Profile	Community Interaction	Linkable	Not Mentioned	SCREEN_NAME	2
Credit Card Number	Financial Information	Directly PII	Stand Alone PII	CREDIT_CARD_NUMBER	3
Financial Account Numbers	Financial Information	Directly PII	Stand Alone PII	…Various	3
Home Address	Secure Identifiers	Directly PII	Stand Alone PII	LOCATION	3
Location	Secure Identifiers	Linkable	Not Mentioned	LOCATION	2

Type	Cluster membership	NIST category	Homeland security category	PII-codex id	PII-codex risk value
Country of Citizenship	Basic Demographics	Linkable	Linkable	COUNTRY_OF_CITIZENSHIP	2
Zip code +4	Basic Demographics	Linkable	Not Mentioned	ZIPCODE	2
Gender	Basic Demographics	Linkable	Not Mentioned	GENDER	2
Birth Date	Basic Demographics	Linkable	Linkable	DATE	2
Online Screen Name	Personal Preferences	Directly PII	Not Mentioned	SCREEN_NAME	3
Religion	Personal Preferences	Linkable	Not Mentioned	NRP	2
Political Affiliation	Personal Preferences	Linkable	Not Mentioned	NRP	2
Email Address	Personal Preferences	Directly PII	Stand Alone PII	EMAIL_ADDRESS	3
IP Address	Contact Information	Directly PII	Not Mentioned	IP_ADDRESS	3
Phone Number	Contact Information	Directly PII	Stand Alone PII	PHONE_NUMBER	3
Address	Contact Information	Linkable	Not Mentioned	LOCATION	2
Social Network Profile	Community Interaction	Linkable	Not Mentioned	SCREEN_NAME	2
Credit Card Number	Financial Information	Directly PII	Stand Alone PII	CREDIT_CARD_NUMBER	3
Financial Account Numbers	Financial Information	Directly PII	Stand Alone PII	…Various	3
Home Address	Secure Identifiers	Directly PII	Stand Alone PII	LOCATION	3
Location	Secure Identifiers	Linkable	Not Mentioned	LOCATION	2

The risk score mean provided by the library was calculated using the mean severity score of each token detected in a text.

Each post's risk score mean value is then added to the collection's final calculation of the risk score using the mean of means formula:

μ_{\overset{̅}{r s}} = \frac{\overset{̅}{{r s}_{1}} + \overset{̅}{{r s}_{2}} + . . . + \overset{̅}{{r s}_{n}}}{n}

(1)

The min, median and max calculations of this mean risk score, alongside what types of PII were detected with the input, are provided per node and per cluster within the final dataset for future evaluation.

Metrics collected

Influencer power, influencer tier, disclosure detections and the associated cluster details were collected per node, as shown in Table 2.

Table 2

Node metrics collected

Column name	Type	Description
Node ID	UUID	Unique identifier for post (replaces original platform identifier)
User ID	UUID	Unique identifier assigned for user (replaces original platform identifier)
Cluster Name	Str	Composite ID for subgraph using collection name and subgraph index
Influence Power	Float	Eigenvector centrality
Influencer Tier	Str	Categorical label calculated by follower count
Collection Name	Str	Trend collection assigned based on search query
Hashtags	Set(str)	The set of hashtags included in the node
PII Disclosed	Bool	Whether or not PII was disclosed
PII Detected	Set(str)	The detected token types in post
PII Risk Score	Float	The PII score for all tokens in a post
Is Comment	Bool	Whether or not the post is a comment or reply
Is Text Starter	Bool	Whether or not the post has text content
Community	Str	The group, community, channel, etc. associated with
Timestamp	Timestamp	Creation timestamp (provided by social media API)
Time Elapsed	Int	Time elapsed (seconds) from original influencer's post

Column name	Type	Description
Node ID	UUID	Unique identifier for post (replaces original platform identifier)
User ID	UUID	Unique identifier assigned for user (replaces original platform identifier)
Cluster Name	Str	Composite ID for subgraph using collection name and subgraph index
Influence Power	Float	Eigenvector centrality
Influencer Tier	Str	Categorical label calculated by follower count
Collection Name	Str	Trend collection assigned based on search query
Hashtags	Set(str)	The set of hashtags included in the node
PII Disclosed	Bool	Whether or not PII was disclosed
PII Detected	Set(str)	The detected token types in post
PII Risk Score	Float	The PII score for all tokens in a post
Is Comment	Bool	Whether or not the post is a comment or reply
Is Text Starter	Bool	Whether or not the post has text content
Community	Str	The group, community, channel, etc. associated with
Timestamp	Timestamp	Creation timestamp (provided by social media API)
Time Elapsed	Int	Time elapsed (seconds) from original influencer's post

Metrics summarizing the cluster details were collected, including influencer summaries (for all influencers within the cluster), risk score statistics, disclosure and engagement counts and ratios, the periods of each cluster and the average time elapsed between responses within. These metrics and details are enumerated with their respective types in Table 3.

Table 3

Cluster metrics and summarizations

Column name	Type	Description
Cluster Name	Str	Composite ID for subgraph using collection name and subgraph index
Influencer Tiers Frequencies	List[dict]	Frequency of influencer tiers of all users in the cluster
Top Influence Power Score	Float	Eigenvector centrality of top influencer
Top Influencer Tier	Str	Size tier of top influencer
Collection Name	Str	Trend collection assigned based on search query
Hashtags	Set(str)	The set of hashtags included in the cluster
PII Detection Frequencies	List[dict]	The detected token types in post with frequencies
Node Count	Int	Count of all nodes in the influencer cluster
Node Disclosures	Int	Count of all nodes with mean_risk_score >1*
Disclosure Ratio	Float	Sum of nodes with confirmed disclosed PII divided by cluster size
Mean Risk Score	Float	The mean risk score for an entire network cluster
Median Risk Score	Float	The median risk score for an entire network cluster
Min Risk Score	Float	The min risk score for an entire network cluster
Max Risk Score	Float	The max risk score for an entire network cluster
Time Span	Float	Total Time Elapsed

Column name	Type	Description
Cluster Name	Str	Composite ID for subgraph using collection name and subgraph index
Influencer Tiers Frequencies	List[dict]	Frequency of influencer tiers of all users in the cluster
Top Influence Power Score	Float	Eigenvector centrality of top influencer
Top Influencer Tier	Str	Size tier of top influencer
Collection Name	Str	Trend collection assigned based on search query
Hashtags	Set(str)	The set of hashtags included in the cluster
PII Detection Frequencies	List[dict]	The detected token types in post with frequencies
Node Count	Int	Count of all nodes in the influencer cluster
Node Disclosures	Int	Count of all nodes with mean_risk_score >1*
Disclosure Ratio	Float	Sum of nodes with confirmed disclosed PII divided by cluster size
Mean Risk Score	Float	The mean risk score for an entire network cluster
Median Risk Score	Float	The median risk score for an entire network cluster
Min Risk Score	Float	The min risk score for an entire network cluster
Max Risk Score	Float	The max risk score for an entire network cluster
Time Span	Float	Total Time Elapsed

Constructing social graphs

The submission_name and parent_id attributes from Reddit posts drove the cluster construction in the study. Since identifiers like the conversation_id, submission_name, id and parent_id attribute types can be used to track down pieces of a conversation thread, an internal unique identifier labeled post_uuid was used instead to track relationships between the nodes, as shown in the pseudocode in Algorithm 1.

ALGORITHM 1.

Reddit Social Graph and Summary Construction Overview.

Function build_social_graph_summaries Object posts
G = nx.Graph()
forall child c in posts do
if c's post_id is not in G then
G.add_node(c)
end
forall replies r in c.comments do
if c's post_id is not in G then
G.add_node(r)
G.add_edge(c, r)
end
end
end
influence_ratings = calculate_influence_ranks(G, posts)
return build_graph_summaries(G, posts, influence_ratings)
end

Influencer identification and power score calculation

The two most referenced methods of ranking influencers are centrality values from graph analysis and custom score calculations using favorites, mentions and retweets. This study employs eigenvector centrality as the Influence Power score, calculated using the NetworkX library (Hagberg, Schult, & Swart, 2008). For a cluster graph G = (V, E), NetworkX calculates eigenvector centrality for every node n. Since each trend contains multiple clusters, centrality values and top-influencing node selection are evaluated per sub-graph, with each social graph independently determining its top-influencing node.

Calculating engagements

Engagements in this study included shares and replies on X (Twitter) and comments on Reddit, which enhance post visibility based on platform algorithms. Only response-type engagements may reveal new PII disclosures, as favorites/likes on X or upvotes/downvotes on Reddit do not allow for new text sharing. Two metrics were used: Response Type Engagements, which count only responses, and Total Engagements, which include all engagement types. Limitations in X's API may restrict access to full conversation archives, whereas Reddit allows polling for submissions, though issues may arise from post or user deletions and late participants in discussions.

Data groupings and isolating variable effect

Time series analyses were performed on engagements and PII disclosures per cluster to test whether both trend downward over time (H5, H6). Cluster-wide metrics included total disclosures, total engagements, rates of both, mean risk score, total time span, primary influence power score and influencer tier. These metrics informed descriptive statistics and some analyses, but time series testing required individual node-level data points extracted per cluster.

Data were resampled into 5-, 10- or 15-min time bins based on data density: sparser conversations required larger bins (15-min) to ensure sufficient observations for stationarity testing, while denser conversations used smaller bins (5-min) to capture finer temporal dynamics without over-smoothing. Stationarity was assessed using the Augmented Dickey-Fuller test at α = 0.05, with differencing applied as needed. The differencing order combined with Autocorrelation and Partial Correlation results informed ARIMA model parameters and outcomes were plotted using Plotly.

To assess robustness, we incorporated CUSUM (Cumulative Sum Control Chart), Change Point Detection (CPD) and the Mann–Kendall test as comparative baselines. While ARIMA is parametric and sensitive to local fluctuations, the Mann–Kendall test provides a model-free evaluation of monotonic trends. CUSUM and CPD complement both approaches by detecting gradual shifts and distinct breakpoints in time-binned social media data that traditional methods may miss.

Analyzing within and between clusters

After data collection and processing, each daily collection contained multiple conversation groups, with each collection c having k clusters and every cluster containing n nodes (observations). Each trend collection represents a sample from the broader platform population, with individual groupings within. Figure 2 presents the data grouping breakdown.

Figure 2

A diagram of data groupings using clusters within trend collections.

View large Download slide

The diagram illustrates the process of data groupings using clusters within trend collections. It starts with social media data as the population, from which samples are taken to form collections. These collections are further divided into clusters. The diagram shows two main collections, each split into two clusters. Each cluster then leads to a series of outcomes, representing observations over time. The relationships between clusters and within clusters are depicted, showing the flow of data from collections to outcomes.

Data groupings using clusters within trend collections

With clustered data, correlation within the hierarchical structure derived from aggregated user-level interactions may violate independence assumptions (Nielsen, Smink, & Fox, 2021). This approach requires sufficient sample sizes to estimate random effects, both in cluster count and nodes per cluster. Between-cluster analysis used Spearman's Rank correlation, given non-normal distributions, confirmed via the Anderson-Darling test. Both tests used α = 0.05 and were conducted using SciPy.

Hypothesis testing

Spearman's correlation tests assessed the relationship between the primary influencer's power index and the dependent variables (Engagement and PII Disclosures), given non-normal distributions. Hypothesis pairs H1/H2, H3/H4 and H5/H6 were tested at the collection level using α = 0.05. Time series analysis was performed per cluster to visualize trends in engagements and disclosures.

Results

Data collection and composition

The pilot analysis used X (Twitter) as the data source, pulling 10,259 posts, many of which were one-off posts unrelated to conversational exchanges. Following X's API tier restructuring, which moved key endpoints to Pro and Enterprise tiers, the data source pivoted to Reddit. The main study collected 122,904 posts and subscriber/follower data for 93,982 users across 285 conversation clusters from Reddit (September 12–22, 2023). Table 4 presents the full dataset composition.

Table 4

Main study– Reddit dataset composition

Collection	Clusters	Posts	Users	String tokens	PII tokens	% nodes disclosed
2023–09–12	15	7,538	5,664	178,054	3,630	30.8835
2023–09–13	43	9,102	6,934	241,482	3,851	27.3237
2023–09–14	26	12,324	9,266	329,866	6,014	29.3817
2023–09–15	26	11,930	8,785	294,454	6,568	31.8022
2023–09–16	30	14,816	12,260	282,021	6,355	27.6795
2023–09–17	20	12,237	8,899	288,046	4,998	26.2074
2023–09–18	24	10,752	8,226	265,549	6,550	34.0402
2023–09–19	16	12,529	10,840	244,281	3,585	29.4668
2023–09–20	46	11,619	8,485	339,404	6,971	33.6512
2023–09–21	10	6,852	4,449	187,557	5,497	33.4501
2023–09–22	29	13,205	10,174	321,730	6,689	29.5798

Collection	Clusters	Posts	Users	String tokens	PII tokens	% nodes disclosed
2023–09–12	15	7,538	5,664	178,054	3,630	30.8835
2023–09–13	43	9,102	6,934	241,482	3,851	27.3237
2023–09–14	26	12,324	9,266	329,866	6,014	29.3817
2023–09–15	26	11,930	8,785	294,454	6,568	31.8022
2023–09–16	30	14,816	12,260	282,021	6,355	27.6795
2023–09–17	20	12,237	8,899	288,046	4,998	26.2074
2023–09–18	24	10,752	8,226	265,549	6,550	34.0402
2023–09–19	16	12,529	10,840	244,281	3,585	29.4668
2023–09–20	46	11,619	8,485	339,404	6,971	33.6512
2023–09–21	10	6,852	4,449	187,557	5,497	33.4501
2023–09–22	29	13,205	10,174	321,730	6,689	29.5798

Across daily collections, Person types (strings identified as potential individual names) constituted the vast majority of PII detections, followed by Datetime, NRP (Nationality, Religious or Political mentions), Location and one Medical License identification determined to be a false positive.

Main study results

The following sections cover the individual analysis attempts of the main study and its results.

Timeseries analysis

There were 183 total clusters when using the 30-node minimum threshold for clusters. Due to the volume of graphs, summaries for every collection are provided in lieu of providing every graph associated with the clusters. While some collections held a small set of clusters, others held significantly more, as presented by the September 15th collection in Figure 3.

Figure 3

Two line graphs showing engagements and disclosures over time on Reddit.

View large Download slide

Two line graphs display engagements and disclosures across Reddit conversations on September 15th. The top graph shows engagements with multiple colored lines representing different data sets, peaking around 10:00. The bottom graph shows disclosures with similar colored lines, also peaking around 10:00. Both graphs have time on the x-axis in 15-minute intervals and counts on the y-axis. A dashed black line represents the average for each graph. The highest average engagements and disclosures occur around 10:00. All values are approximated.

Engagements and disclosures across Reddit conversations on September 15th

Of the 183 clusters analyzed, ARIMA identified descending trends in 146 (79.78%) for engagements, with 73 showing statistically significant AR and MR coefficients. For disclosure-active clusters, 121 showed descending trends, with 68 statistically significant. CUSUM Drift detected declining trends in 73 clusters for engagements and 68 for disclosures; Change Point Decline detected 182 and 153, respectively; Mann-Kendall detected none. When requiring agreement among at least three methods, 61 clusters showed declining engagement trends, and 50 showed declining disclosure trends.

For each collection (Figure 4), consensus across decline detection methods was assessed, with maximum average agreement just under 3. As shown in Figure 5, Mann–Kendall contributed zero decline detections across all collections for both engagement and disclosure data, likely due to the noisy, non-monotonic nature of time-binned social media activity associated with cross-sectional sampling. Change Point Decline and ARIMA consistently identified declines, followed by CUSUM Drift.

Figure 4

A bar graph comparing average agreement scores for engagement and disclosure over different collection dates.

View large Download slide

A bar graph compares average agreement scores for engagement and disclosure over different collection dates. The horizontal axis represents the collection dates from 2023-09-12 to 2023-09-22, and the vertical axis represents the average agreement score ranging from 0 to 2.5. The graph features two sets of bars for each date: one in purple representing engagement average agreement and one in green representing disclosure average agreement. Notable trends include a peak in disclosure average agreement on 2023-09-16 and relatively consistent engagement average agreement scores across most dates. The highest engagement average agreement is observed on 2023-09-20, while the lowest disclosure average agreement is on 2023-09-18.

Agreement on declining trend detections across detection methods

Figure 5

A radar chart showing aggregate method sensitivity across conversations with two data series labeled Engagements and Disclosures.

View large Download slide

A radar chart with four axes labeled CUSUM Drift, Mann-Kendall, ARIMA, and Change Point Decline. The chart features two data series: Engagements in purple and Disclosures in green. The axes are marked with values ranging from 0 to 200. The Engagements series shows higher values on the Change Point Decline axis, while the Disclosures series shows higher values on the Mann-Kendall axis. Both series have lower values on the CUSUM Drift and ARIMA axes.

Aggregate method sensitivity across conversations

Correlation analysis

Reddit data clusters were more comprehensive than Twitter's, with the largest containing 7,470 nodes (September 16th) and the second largest 6,672 (September 19th). Of the 285 clusters, 210 contained more than one node, 183 had at least 30 nodes, 171 had at least 100 and only 30 (10.53%) exceeded 1,000. A baseline of 30 nodes was established to address sample size issues after time-of-post adjustments. Table 5 summarizes correlation results across varying size thresholds.

Table 5

Main study correlation results using dataset segmented by cluster size

	Run 1 (N = 183)	Run 2 (N = 182)	Run 3 (N = 30)
Influencer Tier & Engagements	0.1952**	0.1763*	−0.3025
Influencer Tier & Disclosures	0.1425	0.1223	−0.0837
Influence Power & Engagements	0.5728***	0.5658***	0.4972**
Influence Power & Disclosures	0.4646***	0.4559***	0.0527
Time Elapsed & Engagements	0.1608*	0.1677*	−0.1880
Time Elapsed & Disclosures	0.1645*	0.1711*	−0.1782

	Run 1 (N = 183)	Run 2 (N = 182)	Run 3 (N = 30)
Influencer Tier & Engagements	0.1952**	0.1763*	−0.3025
Influencer Tier & Disclosures	0.1425	0.1223	−0.0837
Influence Power & Engagements	0.5728***	0.5658***	0.4972**
Influence Power & Disclosures	0.4646***	0.4559***	0.0527
Time Elapsed & Engagements	0.1608*	0.1677*	−0.1880
Time Elapsed & Disclosures	0.1645*	0.1711*	−0.1782

Note(s): Asterisks denote p-value. *p < 0.05, **p < 0.01 and ***p < 0.001

Anderson-Darling tests confirmed non-normal distributions, necessitating Spearman's Rank Correlation across three robustness analyses. Run 1 (n = 183, 30-node minimum) ensured adequate sample size for time series resampling. Run 2 (n = 182, 50-node threshold) increased statistical power while maintaining breadth, showing consistently positive coefficients (p < 0.05). Run 3 (n = 30, 1,000+ nodes) isolated the largest conversations to test whether fuller thread capture strengthened effects. Influence power maintained significant correlations with engagement (p < 0.01), but influencer tier and time elapsed pairings showed negative coefficients with reduced significance, suggesting measurement challenges at this threshold.

Discussion

Platform architecture and model transferability

The unplanned platform transition from Twitter/X to Reddit due to API access changes created non-equivalent comparison groups with different sampling methods and user populations. However, this pivot revealed how platform architecture may shape influencer disclosure relationships. Twitter/X's follower hierarchy model showed clear correlations between influencer tier and both engagement (r = 0.26, p < 0.01) and disclosures (r = 0.20, p < 0.01), while Reddit's community-first architecture, where users follow subreddits rather than individuals, showed weaker tier correlations.

This architectural difference created measurement challenges: Reddit's “active user count” (15-min interaction window) proved unstable for tier classification compared to Twitter/X's persistent follower counts. Despite this, influence power (eigenvector centrality) maintained significant positive correlations with both engagement and disclosures across platforms, suggesting network position matters regardless of architecture. These exploratory observations suggest community-based platforms may diffuse influencer power differently than follower hierarchy models, but designed studies with equivalent sampling are needed to confirm this. The Mann–Kendall test contributed no decline detections, consistent with research showing monotonic trend tests are poorly suited to bursty social media patterns (Lehmann, Gonçalves, Ramasco, & Cattuto, 2012; Mathioudakis & Koudas, 2010), supporting the need for adaptive methods like CPD and ARIMA in social media analysis. Reddit's fuller conversation threads also made it more straightforward to observe engagement fluctuations and eventual decline, with more complete and numerous clusters facilitating the time series analyses. Using active user counts for Reddit tier classification introduces an additional reliability concern beyond architectural non-equivalence: these counts capture a 15-min interaction window and fluctuate substantially with time of day and algorithmic promotion, making tier assignments transient rather than stable proxies for audience size; subscriber counts or longitudinally averaged activity metrics would provide more consistent classifications in future studies.

Hypothesis testing and results

The dataset showed diverse correlations, with filtering revealing predominantly positive relationships among variable pairings. Some results may have limited reliability due to the mixed-media nature of threads (text, images, videos) and limited understanding of community context.

On Reddit, community groups rather than individual users are emphasized. Individual subscriber data were limited, so the active member count (accounts interacting within a 15-min window) was used instead. However, this count fluctuates over time, making influencer tier classification transient and less reliable than Twitter/X's persistent follower counts.

In the Twitter/X corpus, significant correlations (p < 0.01) were found between influencer tier and both engagements (r = 0.2569) and disclosures (r = 0.2001). Reddit's correlations were positive (r = 0.3887, r = 0.3133) but both p-values exceeded 0.05. Given these measurement challenges, influencer tier results (H3, H4) should be interpreted cautiously, and the cross-platform comparison suggests tier effects may be platform-dependent, with stronger effects in follower hierarchy architectures. Researchers studying influencer tiers on community-based platforms may not benefit from the tier assignment method used here.

For H1 and H2, influence power showed positive correlations with engagements and PII disclosures in the Reddit dataset, though variations may exist between image/video-initiated and text-initiated clusters. In the Twitter/X dataset, this was confirmed only after filtering smaller clusters, as eigenvector centrality values were significantly affected by cluster size.

For H5 and H6, time series analysis indicated post-peak declines for both engagements and disclosures, with 76.50% of Reddit clusters showing engagement declines and 66.12% showing disclosure declines. Among declining clusters, more than half achieved consensus across at least three trend detection methods. These hypotheses are supported, though caution is advised given that many conversations were still in progress when data collection ended.

Across all analyses, Twitter/X showed significant relationships for influence power with engagements and disclosures after filtering smaller clusters, while Reddit showed predominantly positive correlations at p < 0.05 or p < 0.01 across variable pairings. ARIMA analysis confirmed descending trends in most Reddit clusters and approximately half of the pilot dataset. Table 6 summarizes hypotheses and associated results per phase.

Table 6

Hypothesis testing and results summary

Hypothesis	Result
H1: Influence Power positively correlates with follower engagements	Supported
H2: Influence Power positively correlates with PII disclosure detections	Supported
H3: Influencer Tier positively correlates with follower engagements	Supported*
H4: Influencer Tier positively correlates with PII disclosure detections	Supported*
H5: As time elapses, the engagements in a cluster will decrease	Supported
H6: As time elapses, the PII disclosure detections in a cluster will decrease	Supported

Hypothesis	Result
H1: Influence Power positively correlates with follower engagements	Supported
H2: Influence Power positively correlates with PII disclosure detections	Supported
H3: Influencer Tier positively correlates with follower engagements	Supported*
H4: Influencer Tier positively correlates with PII disclosure detections	Supported*
H5: As time elapses, the engagements in a cluster will decrease	Supported
H6: As time elapses, the PII disclosure detections in a cluster will decrease	Supported

Note(s): Asterisks (*) denote reliability concerns from platform architectural differences. Caution advised

Limitations

This study faced several limitations. Firstly, the focus started exclusively on a single social media platform, X (Twitter), which underwent various changes during the data collection, leading to issues such as the unavailability of full conversation histories and reduced post availability due to updated API limits. The analysis expanded to Reddit's community-based architecture, providing exploratory cross-platform insights while introducing measurement challenges for influencer tier classification.

Additionally, the study relied solely on eigenvector centrality to measure influence, thereby overlooking factors such as community involvement and cross-platform activity. Manual validation of automated PII detection was not feasible due to the dataset size. As with any NER-based approach, automated detection introduces false positives and false negatives (Macri et al., 2023). Published evaluations of Microsoft Presidio report variable performance depending on configuration and domain, with precision estimates ranging from approximately 0.51 in baseline clinical evaluations to about 0.89 in customized implementations (Alrazihi, Biswas, & George, 2025; Kotevski et al., 2022).

To provide conservative uncertainty bounds using parameters reported within a single empirical evaluation, precision and recall estimates from the baseline study (precision = 0.51, recall = 0.74) were applied to the observed disclosure rate of 29.41%. This yields a precision-adjusted lower estimate of approximately 15.00%, representing confirmed true detections, and a recall-corrected prevalence estimate of approximately 20.27% that accounts for likely missed cases. Although the observed rate may overestimate absolute prevalence under conservative assumptions, the dominant PII categories identified in this study, including person names, datetime expressions and professional identifiers, are commonly reported in NER research to achieve comparatively higher precision due to stronger contextual cues (Macri et al., 2023). Since the same detection pipeline was consistently used throughout the dataset, it would be expected to introduce predominantly non-differential measurement error, which may reduce effect sizes but is unlikely to invalidate the correlational findings.

Data collection was intentionally limited to 7 AM through 10 PM Mountain Standard Time to allow for human oversight during data pulls, excluding overnight trending submissions. Reddit's “hot submissions” feature was selected as the closest analog to Twitter/X's trending conversations to maintain methodological consistency across platforms despite architectural differences. Its score is calculated based on how many upvotes vs downvotes it has received (Kuutila, Rantala, Li, Hosio, & Mäntylä, 2024). This approach resulted in some clusters being captured before peak activity, potentially obscuring full trend patterns. The study also did not account for varying content types in trending posts or the impact of post length differences, as Reddit allows much longer posts compared to X (Twitter).

Finally, the study was completed entirely without obtaining input from the platform users as to why they were interacting with the post or user. An individual's motivations to engage will differ throughout their time on the platform. Obtaining their input on why they engage in combination with the actual observed behavior limited the study by not accounting for individual differences, motivators and psychological factors.

Implications for theory and practice

This study shows a connection between influencer reach, measured by eigenvector centrality and the sharing of PII on social media. As reach grows, so does the chance for engagement and information disclosure, with an average of 29.41% of 122,904 posts containing PII (adjusted estimate: 15.00–20.27% after accounting for automated detection errors) in higher-interaction threads. Though viral discussions can start from non-influencers and platform algorithms may also play a role, influencers and marketers should be aware that some exchanges might put their followers' privacy at risk. Importantly, it was not determined whether conversation starters actively encouraged sharing or whether sensitive data were exchanged through media. These disclosures were made publicly by users; while regulations like GDPR and CCPA control how platforms handle data, user-initiated public disclosures create unique privacy challenges that go beyond current laws.

The observed behavioral shifts can be understood through Social Learning and Social Identity theories. Social Learning Theory suggests influencers can be targeted to model desirable privacy behaviors, while also accounting for undesirable disclosure patterns. Social Identity Theory suggests that group identity and shared norms-based campaigns can drive privacy-aware behaviors as part of a community's core values. Platform algorithms that prioritize high-engagement content may inadvertently amplify PII-containing threads, compounding individual privacy risks through increased visibility. Users must adopt privacy-conscious practices to reduce exposure to fraud or discrimination, and targeted campaigns (influencer-led or otherwise) can address these challenges when traditional educational efforts fall short.

A between-subjects experiment could assign participants to simulated social media threads where an influencer either discloses or withholds PII (independent variable 1: influencer disclosure condition) across hierarchical and community-based platform designs (independent variable 2: platform architecture). This design directly operationalizes the two key variables that emerged from the observational findings, providing the controlled, comparable sampling test needed to establish causal inference where the unplanned platform transition could not. Participant disclosure behavior, measured by the number and sensitivity of PII items shared in a mock posting task, would serve as the primary dependent variable, with a secondary self-report measure of perceived disclosure norms to capture attitudinal shifts. Based on the present findings, we predict that influencer disclosure increases follower PII sharing and that this effect is stronger under hierarchical designs, consistent with the observed attenuation on Reddit relative to Twitter/X.

Conclusion

This study examined how social media influencers impact followers' PII disclosures, revealing that influence power (eigenvector centrality) significantly and positively correlates with both engagement and disclosure rates across platforms (RQ1). Influencers with greater network reach drove higher engagement and disclosure rates, with 29.41% of posts in the Reddit dataset containing PII (adjusted estimate: 15.00% to 20.27% after accounting for automated detection error). Temporal analysis showed declining trends in both engagement (79.78% of threads) and disclosures (66.12% of threads), supporting H5 and H6.

The unplanned platform transition from Twitter/X to Reddit provided exploratory insights into how platform architecture moderates influencer effects (RQ2). Twitter/X's follower hierarchy model showed stronger correlations between influencer tier and disclosures (r = 0.26, p < 0.01) compared to Reddit's community-based structure, suggesting community-driven platforms may inherently diffuse influencer power. However, influence power maintained significant positive correlations across both architectures, indicating network position matters regardless of platform design. Temporal patterns were examined using a triangulated approach combining ARIMA, CUSUM Drift and CPD, each capturing distinct aspects of nonlinear, bursty engagement data. Convergent findings across methods strengthen confidence in the observed trends and offer a replicable framework for short-window social media analysis. The Reddit tier findings (H3, H4) warrant caution, as active user counts used for tier classification fluctuate within short windows, producing transient assignments less stable than Twitter/X's persistent follower counts. Likewise, the 29.41% PII disclosure rate reflects automated NER output, and the precision-adjusted estimate of 15.00% to 20.27% is the more defensible figure for downstream policy or design recommendations.

Future research should use designed platform comparisons with equivalent sampling to test whether community-based architectures offer privacy-protective effects. Longitudinal studies with complete conversation threads would clarify patterns of temporal decay, and enhanced PII detection using machine learning classifiers beyond NER could lower false positive rates while maintaining PII-Codex's categorization framework. Including user motivations through survey data would complement behavioral observations, addressing why users disclose PII despite privacy concerns. Standardizing influencer tier classification on community-based platforms through subscriber counts or longitudinally averaged activity metrics, rather than instantaneous active user counts, would also resolve the measurement instability identified here and enable more reliable cross-platform comparisons.

Ethics statement

This research used publicly accessible social media posts from Twitter/X and Reddit. Raw post content was not retained following analysis. All datasets (Rosado, 2024) were sanitized by replacing original platform identifiers such as user IDs, post IDS, usernames and similar identifiers with unique internal identifiers to prevent re-identification. No IRB approval was required as the study involved no direct interaction with human participants.

References

Alrazihi

,

L. A.

,

Biswas

,

S.

, &

George

,

J.

(

2025

).

Evaluating the accuracy of automated and semi-automated anonymization tools for unstructured health records

.

Surgical Neurology International

,

16

,

313

. doi:

https://doi.org/10.25259/SNI_459_2025

.

Google Scholar

Crossref

PubMed

Alutaybi

,

A.

,

Al-Thani

,

D.

,

McAlaney

,

J.

, &

Ali

,

R.

(

2020

).

Combating fear of missing out (FOMO) on social media: The fomo-r method

.

International Journal of Environmental Research and Public Health

,

17

(

17

), 6128. doi:

https://doi.org/10.3390/ijerph17176128

.

Google Scholar

Crossref

PubMed

Arora

,

A.

,

Bansal

,

S.

,

Kandpal

,

C.

,

Aswani

,

R.

, &

Dwivedi

,

Y.

(

2019

).

Measuring social media influencer index- insights from Facebook, Twitter and Instagram

.

Journal of Retailing and Consumer Services

,

49

,

86

–

101

. doi:

https://doi.org/10.1016/j.jretconser.2019.03.012

.

Google Scholar

Crossref

Bandura

,

A.

(

1977

).

Social learning theory

.

Englewood Cliffs, NJ

:

Prentice Hall

.

Google Scholar

Beigi

,

G.

, &

Liu

,

H.

(

2020

).

A Survey on privacy in social media

.

ACM/IMS Transactions on Data Science

,

1

(

1

),

7

–

38

. doi:

https://doi.org/10.1145/3343038

.

Google Scholar

Crossref

Beigi

,

G.

,

Shu

,

K.

,

Zhang

,

Y.

, &

Liu

,

H.

(

2018

).

Securing social media user data

. In

Proceedings of the 29th ACM Conference on Hypertext and Social Media

(pp.

165

–

173

). doi:

https://doi.org/10.1145/3209542.3209552

.

Google Scholar

Crossref

Bélanger

,

F.

&

Crossler

,

R. E.

(

2011

).

Privacy in the digital age: A review of information privacy research in information systems

.

MIS Quarterly

,

35

(

4

),

1017

–

1042

. doi:

https://doi.org/10.2307/41409971

.

Google Scholar

Crossref

Essaidi

,

A.

,

Zaidouni

,

D.

, &

Bellafkih

,

M.

(

2020

).

New method to measure the influence of Twitter users

. In

Proceedings of the 4th International Conference on Intelligent Computing in Data Sciences (ICDS)

(pp.

1

–

5

). doi:

https://doi.org/10.1109/icds50568.2020.9268726

.

Google Scholar

Crossref

Farivar

,

S.

,

Wang

,

F.

, &

Turel

,

O.

(

2022

).

Followers’ problematic engagement with influencers on social media: An attachment theory perspective

.

Computers in Human Behavior

,

133

, 107288. doi:

https://doi.org/10.1016/j.chb.2022.107288

.

Google Scholar

Crossref

Gross

,

J.

,

Wangenheim

,

F. V.

(

2018

).

The big four of influencer marketing. A typology of influencers

.

Marketing Review St. Gallen

,

35

(

2

),

30

-

38

.

Google Scholar

Gunaratne

,

K.

,

Coomes

,

E. A.

, &

Haghbayan

,

H.

(

2019

).

Temporal trends in anti-vaccine discourse on Twitter

.

Vaccine

,

37

(

35

),

4867

–

4871

. doi:

https://doi.org/10.1016/j.vaccine.2019.06.086

.

Google Scholar

Crossref

PubMed

Hagberg

,

A. A.

,

Schult

,

D. A.

, &

Swart

,

P. J.

(

2008

).

Exploring network structure, dynamics, and function using NetworkX

.

Proceedings of the 7th Python in Science Conference (SciPy2008)

,

11

–

15

.

Google Scholar

Crossref

Keküllüoglu

,

D.

,

Magdy

,

W.

, &

Vaniea

,

K.

(

2020

).

Analysing privacy leakage of life events on Twitter

. In

Proceedings of the 12th ACM Conference on Web Science

(pp.

287

–

294

). doi:

https://doi.org/10.1145/3394231.3397919

.

Google Scholar

Crossref

Ki

,

C. W. ‘C.

, &

Kim

,

Y. K.

(

2019

).

The mechanism by which social media influencers persuade consumers: The role of consumers’ desire to mimic

.

Psychology and Marketing

,

36

(

10

),

905

–

922

. doi:

https://doi.org/10.1002/mar.21244

.

Google Scholar

Crossref

Kim

,

H.

,

Jang

,

S. M.

,

Kim

,

S.-H.

, &

Wan

,

A.

(

2018

).

Evaluating sampling methods for content analysis of Twitter data

.

Social Media + Society

,

4

(

2

),

1

–

10

. doi:

https://doi.org/10.1177/2056305118772836

.

Google Scholar

Crossref

Kotevski

,

D. P.

,

Smee

,

R. I.

,

Field

,

M.

,

Nemes

,

Y. N.

,

Broadley

,

K.

, &

Vajdic

,

C. M.

(

2022

).

Evaluation of an automated presidio anonymisation model for unstructured radiation oncology electronic medical records in an Australian setting

.

International Journal of Medical Informatics

,

168

, 104880. doi:

https://doi.org/10.1016/j.ijmedinf.2022.104880

.

Google Scholar

Crossref

PubMed

Kumar

,

P.

,

Choudhury

,

T.

,

Rawat

,

S.

, &

Jayaraman

,

S.

(

2016

).

Analysis of various machine learning algorithms for enhanced opinion mining using Twitter data streams

. In

Proceedings of 2016 International Conference on Micro-Electronics and Telecommunication Engineering (ICMETE)

(pp.

265

–

270

). doi:

https://doi.org/10.1109/icmete.2016.19

.

Google Scholar

Kuutila

,

M.

,

Rantala

,

L.

,

Li

,

J.

,

Hosio

,

S.

, &

Mäntylä

,

M.

(

2024

).

What makes programmers laugh? Exploring the submissions of the subreddit r/ProgrammerHumor

. In

Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ’24)

(pp.

371

–

381

).

Association for Computing Machinery

. doi:

https://doi.org/10.1145/3674805.3686696

.

Google Scholar

Crossref

Lahuerta-Otero

,

E.

, &

Cordero-Gutiérrez

,

R.

(

2016

).

Looking for the perfect tweet. The use of data mining techniques to find influencers on twitter

.

Computers in Human Behavior

,

64

,

575

–

583

. doi:

https://doi.org/10.1016/j.chb.2016.07.035

.

Google Scholar

Crossref

Lehmann

,

J.

,

Gonçalves

,

B.

,

Ramasco

,

J. J.

, &

Cattuto

,

C.

(

2012

).

Dynamical classes of collective attention in Twitter

. In

Proceedings of the 21st International Conference on World Wide Web

(pp.

251

–

260

).

Association for Computing Machinery

. doi:

https://doi.org/10.1145/2187836.2187871

.

Google Scholar

Crossref

Liao

,

Y.

(

2019

).

Sharing personal health information on social media

. In

Proceedings of the 10th International Conference on Social Media + Society

(pp.

194

–

204

). doi:

https://doi.org/10.1145/3328529.3328560

.

Google Scholar

Crossref

Macri

,

C. Z.

,

Teoh

,

S. C.

,

Bacchi

,

S.

,

Tan

,

I.

,

Casson

,

R.

,

Sun

,

M. T.

, …

Chan

,

W.

(

2023

).

A case study in applying artificial intelligence-based named entity recognition to develop an automated ophthalmic disease registry

.

Graefe's archive for clinical and experimental ophthalmology = Albrecht von Graefes Archiv fur klinische und experimentelle Ophthalmologie

,

261

(

11

),

3335

–

3344

. doi:

https://doi.org/10.1007/s00417-023-06190-2

.

Google Scholar

Crossref

Mathioudakis

,

M.

, &

Koudas

,

N.

(

2010

).

TwitterMonitor: Trend detection over the twitter stream

. In

Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data

(pp.

1155

–

1158

).

Association for Computing Machinery

. doi:

https://doi.org/10.1145/1807167.1807306

.

Google Scholar

Crossref

Microsoft

(

2018

).

Microsoft Presidio: Context aware, pluggable and customizable PII anonymization service for text and images

.

Available from

: https://microsoft.github.io/presidio/

Milne

,

G. R.

,

Pettinico

,

G.

,

Hajjat

,

F. M.

, &

Markos

,

E.

(

2017

).

Information sensitivity typology: Mapping the degree and type of risk consumers perceive in personal data sharing

.

Journal of Consumer Affairs

,

51

(

1

),

133

–

161

. doi:

https://doi.org/10.1111/joca.12111

.

Google Scholar

Crossref

Moura

,

J.

, &

Serrão

,

C.

(

2016

). Security and privacy issues of big data. In

Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence

(pp.

20

–

52

).

IGI Global

. doi:

https://doi.org/10.4018/978-1-4666-8505-5

.

Google Scholar

Crossref

Nielsen

,

N. M.

,

Smink

,

W. A.

, &

Fox

,

J.-P.

(

2021

).

Small and negative correlations among clustered observations: Limitations of the linear mixed effects model

.

Behaviormetrika

,

48

(

1

),

51

–

77

. doi:

https://doi.org/10.1007/s41237-020-00130-8

.

Google Scholar

Crossref

Okuah

,

O.

,

Scholtz

,

B. M.

, &

Snow

,

B.

(

2019

).

A grounded theory analysis of the techniques used by social media influencers and their potential for influencing the public regarding environmental awareness

.

Proceedings of the South African Institute of Computer Scientists and Information Technologists 2019

,

36

,

1

–

10

. doi:

https://doi.org/10.1145/3351108.3351145

Google Scholar

Crossref

Orooji

,

M.

,

Rabbanian

,

S. S.

, &

Knapp

,

G. M.

(

2023

).

Flexible adversary disclosure risk measure for identity and attribute disclosure attacks

.

International Journal of Information Security

,

10

(

3

),

631

–

645

. doi:

https://doi.org/10.1007/s10207-022-00654-y

.

Google Scholar

Crossref

Powale

,

P. I.

, &

D Bhutkar

,

G.

(

2013

).

Overview of privacy in social networking sites (SNS)

.

International Journal of Computer Applications

,

74

(

19

),

39

–

46

. doi:

https://doi.org/10.5120/13005-0311

.

Google Scholar

Crossref

Rahman

,

K. T.

(

2022

).

Influencer marketing and behavioral outcomes: How types of influencers affect consumer mimicry?

.

SEISENSE Business Review

,

2

(

1

),

43

–

54

. doi:

https://doi.org/10.33215/sbr.v2i1.792

.

Google Scholar

Crossref

Rao

,

A.

,

Spasojevic

,

N.

,

Li

,

Z.

, &

Dsouza

,

T.

(

2015

).

Klout score: Measuring influence across multiple social networks

. In

Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

(pp.

2282

–

2289

). doi:

https://doi.org/10.1109/bigdata.2015.7364017

.

Google Scholar

Crossref

Reddit.com

(

n.d.-b

).

.

Available from:

https://www.reddit.com/wiki/api/

Rosado

,

E. J.

(

2023

).

PII-codex: A Python library for PII detection, categorization, and severity assessment

.

Journal of Open Source Software

,

8

(

86

), 5402. doi:

https://doi.org/10.21105/joss.05402

.

Google Scholar

Crossref

Rosado

,

E. J.

(

2024

).

Privacy vs. Social Capital: Social Media PII Disclosure Analyses (0.0.1)

.

Zenodo

, Available from: https://doi.org/10.5281/zenodo.13133302

Google Scholar

Saha

,

K.

,

Bayraktaroglu

,

A. E.

,

Campbell

,

A. T.

,

Chawla

,

N. V.

,

De Choudhury

,

M.

,

D’Mello

,

S. K.

,

Dey

,

A. K.

,

Gao

,

G.

,

Gregg

,

J. M.

,

Jagannath

,

K.

,

Mark

,

G.

,

Martinez

,

G. J.

,

Mattingly

,

S. M.

,

Moskal

,

E.

,

Sirigiri

,

A.

,

Striegel

,

A.

, &

Yoo

,

D. W.

(

2019

).

Social media as a passive sensor in longitudinal studies of human behavior and wellbeing

.

Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

,

1

–

8

. doi:

https://doi.org/10.1145/3290607.3299065

.

Google Scholar

Schwartz

,

P. M.

&

Solove

,

D. J.

(

2011

).

The PII problem: Privacy and a new concept of personally identifiable information

.

New York University Law Review

.

86

.

1814

-

2011

.

Google Scholar

Sharma

,

P.

,

Agarwal

,

A.

, &

Sardana

,

N.

(

2018

).

Extraction of influencers across Twitter using credibility and trend analysis

. In

Proceedings of the 11th International Conference on Contemporary Computing

(Vol.

IC3

, pp.

1

–

3

). doi:

https://doi.org/10.1109/ic3.2018.8530462

.

Google Scholar

Crossref

Shopify

. (

2023

).

The ROI of influencer marketing: How to measure and get the most out of your influencer efforts

.

Available from:

https://www.shopify.com/enterprise/roi-influencer-marketing

Stieglitz

,

S.

, &

Dang-Xuan

,

L.

(

2013

).

Social media and political communication: A social media analytics framework

.

Social Network Analysis and Mining

,

3

(

4

),

1277

–

1291

. doi:

https://doi.org/10.1007/s13278-012-0079-3

.

Google Scholar

Crossref

Stoddard

,

G.

(

2021

).

Popularity dynamics and intrinsic quality in reddit and hacker news

. In

Proceedings of the International AAAI Conference on Web and Social Media

(Vol.

9

, pp.

416

–

425

). doi:

https://doi.org/10.1609/icwsm.v9i1.14636

.

Google Scholar

Crossref

Tajfel

,

H.

, &

Turner

,

J. C.

(

1979

). An integrative theory of intergroup conflict. In

W. G.

,

Austin

&

S.

,

Worchel

(Eds.),

The Social Psychology of Intergroup Relations

(pp.

33

–

47

).

Monterey, CA

:

Brooks/Cole

.

Google Scholar

Tene

,

O.

&

Polonetsky

,

J.

(

2013

).

Big data for all: Privacy and user control in the age of analytics

.

Northwestern Journal of Technology and Intellectual Property

,

11

(

5

), 1.

Google Scholar

Trepte

,

S.

(

2020

).

The social media privacy model: Privacy and communication in the light of social media affordances

.

Communication Theory

,

31

(

4

),

549

–

570

. doi:

https://doi.org/10.1093/ct/qtz035

.

Google Scholar

Crossref

Turban

,

E.

,

Bolloju

,

N.

, &

Liang

,

T.-P.

(

2011

).

Enterprise social networking: Opportunities, adoption, and risk mitigation

.

Journal of Organizational Computing and Electronic Commerce

,

21

(

3

),

202

–

220

. doi:

https://doi.org/10.1080/10919392.2011.590109

.

Google Scholar

Crossref

Valle-Cruz

,

D.

,

López-Chau

,

A.

, &

Sandoval-Almazán

,

R.

(

2020

).

Impression analysis of trending topics in Twitter with classification algorithms

. In

Proceedings of the 13th International Conference on Theory and Practice of Electronic Governance

(pp.

430

–

441

). doi:

https://doi.org/10.1145/3428502.3428570

.

Google Scholar

Crossref

Waldman

,

A. E.

(

2016

).

Privacy, sharing, and trust: The Facebook study

.

Case Western Reserve Law Review

,

67

(

1

). doi:

https://doi.org/10.2139/ssrn.2726929

Google Scholar

Westin

,

F.

, &

Chiasson

,

S.

(

2019

).

Opt out of privacy or Go home

. In

Proceedings of the New Security Paradigms Workshop

(pp.

57

–

67

). doi:

https://doi.org/10.1145/3368860.3368865

.

Google Scholar

Crossref

Weston

,

H.

, &

Wells

,

B. P.

(

2020

).

Social media as a factor in personal injury underwriting: Risk, rate and regulation

.

Journal of Insurance Regulation

,

39

(

1

). doi:

https://doi.org/10.52227/22019.2020

.

Google Scholar

Yang

,

Y.

, &

Huang

,

Y.

(

2019

).

Dumping the closet skeletons online: Exploring the guilty information disclosure behavior on social media

. In

Proceedings of the International AAAI Conference on Web and Social Media

(Vol.

13

, pp.

663

–

666

). doi:

https://doi.org/10.1609/icwsm.v13i01.3267

.

Google Scholar

Crossref

Zhang

,

Z.

,

Zhao

,

W.

,

Yang

,

J.

,

Paris

,

C.

, &

Nepal

,

S.

(

2019

).

Learning influence probabilities and modelling influence diffusion in Twitter

. In

Companion Proceedings of the 2019 World Wide Web Conference

(pp.

1087

–

1094

). doi:

https://doi.org/10.1145/3308560.3316701

.

Google Scholar

Crossref

Zhang

,

M.

,

Beltran

,

F.

, &

Liu

,

J.

(

2020a

).

Incentive mechanism for social network data pricing under privacy preservation

. In

Proceedings of the 2nd ACM International Symposium on Blockchain and Secure Critical Infrastructure

(pp.

85

–

95

). doi:

https://doi.org/10.1145/3384943.3409425

.

Google Scholar

Crossref

Zhang

,

X.

, &

Choi

,

J.

(

2022

).

The importance of social influencer-generated contents for user cognition and emotional attachment: An information relevance perspective

.

Sustainability

,

14

(

11

),

6676

. doi:

https://doi.org/10.3390/su14116676

.

Google Scholar

Crossref

Zhang

,

Z.

,

Jiménez

,

F. R.

, &

Cicala

,

J. E.

(

2020b

).

Fear of missing out scale: A self‐ concept perspective

.

Psychology and Marketing

,

37

(

11

),

1619

–

1634

. doi:

https://doi.org/10.1002/mar.21406

.

Google Scholar

Crossref

Unveiling influencer-driven PII disclosures in social media discourse Open Access

Introduction

Literature review

Factors in privacy-adverse behavior

Trust and behavioral mimicry

Social media as a “technology of self”

Influencers and influencer networks

Identifying influencers

Classifying private data and calculating risk

Information disclosure types and associated attacks

PII and sensitive data types

Purpose of the study

Research questions

Hypotheses

Methods

Data collection

Data processing

PII identification and risk calculation

Metrics collected

Constructing social graphs

Influencer identification and power score calculation

Calculating engagements

Data groupings and isolating variable effect

Analyzing within and between clusters

Hypothesis testing

Results

Data collection and composition

Main study results

Timeseries analysis

Correlation analysis

Discussion

Platform architecture and model transferability

Hypothesis testing and results

Limitations

Implications for theory and practice

Conclusion

Ethics statement

References

Further reading

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Sharing Unavailable

Unveiling influencer-driven PII disclosures in social media discourse