This study examined how influencer power and tier relate to social media engagement and personal information sharing, noting fluctuations of both over time.
A combination of content analysis, change point detection methods (CUSUM Drift, Change Point Detection) and time series modeling (ARIMA) was used to analyze social media conversations and identify temporal trends in engagement and personal information disclosures.
The study demonstrates a strong correlation between influencer reach, increased engagement and disclosures, with a declining trend in both engagements (79.78% of threads) and disclosures (66.12% of threads) across 183 conversational threads. An observed 29.41% of posts contained personally identifiable information (PII), with conservative sensitivity analysis suggesting an adjusted prevalence of 15.00% to 20.27% after accounting for automated detection error.
The cross-platform comparison revealed architectural differences between Twitter/X's follower hierarchy and Reddit's community voting structures that influence influencer effects on PII disclosure. Rethinking influencer identification on community-based platforms requires tailored models that consider community norms instead of just follower counts. Manual validation of automated PII detection was infeasible due to data access constraints.
This project provides exploratory insights into how platform architecture may moderate influencer dynamics, with implications for privacy-conscious platform design and future comparative studies.
Introduction
Social media has reshaped industries, crisis communication and socio-cultural research, transforming our understanding of human behavior (Valle-Cruz, López-Chau, & Sandoval-Almazán, 2020) while blurring lines between private and public sharing. Social media influencers play a crucial role in these dynamics, persuading large audiences and impacting marketing campaigns and elections (Okuah, Scholtz, & Snow, 2019). Followers often adopt similar hashtags and language, increasing the risk of identifiable information being shared. Third-party consumers can access users' posts through APIs, relying on consent not to disclose identities. Once public, posts can be archived and used by marketers or adversaries, raising concerns over personal data control (Tene and Polonetsky, 2013; Trepte, 2020), particularly as personal information in viral content can be exploited (Beigi, Shu, Zhang, & Liu, 2018).
Users must be cautious about sharing information online, while organizations collecting this data are responsible for protecting privacy. Poor data handling can result in identifying individuals from ostensibly anonymous posts (Beigi et al., 2018), making it essential to understand extractable data types and potential attack methods (Orooji, Rabbanian, & Knapp, 2023). Despite privacy concerns, incentives motivate users to share personal information (Bélanger & Crossler, 2011; Yang & Huang, 2019), driven by factors such as FOMO (Alutaybi, Al-Thani, McAlaney, & Ali, 2020), guilt (Yang & Huang, 2019) and the desire to manage public personas (Zhang et al., 2020a, b). While research highlights these behavioral factors, it often overlooks how they relate specifically to influencer networks (Farivar, Wang, & Turel, 2022), where engagement-driven frequent posting can diminish self-control and risk assessment as users emulate influencers as role models (Gross & von Wangenheim, 2018; Ki & Kim, 2019). A limited understanding remains of the actual behaviors associated with follower engagement in these networks (Farivar et al., 2022).
Literature review
Social media networks consist of individuals with varying degrees of relationships, often featuring influencers who significantly impact followers' engagement behaviors (Okuah et al., 2019). This influence can produce similar posting patterns, exposing identifiable data that third parties can exploit. Engaging with trends that encourage personal disclosure can lead to privacy-invasive behaviors (Valle-Cruz et al., 2020), and users may not realize their shared data can be aggregated to identify them or be misused by malicious actors (Keküllüoğlu et al., 2020; Moura and Serrão, 2016). This study critically reviews the literature on influencer identification, network analysis methods, privacy risks, personally identifiable information (PII) and social media disclosure.
Factors in privacy-adverse behavior
Trust and behavioral mimicry
Zhang and Choi (2022) observed that influencers often share entertaining content to capture audience attention before building trust to achieve their objectives. However, social networking apps can create a deceptive sense of security, leading users to believe they are sharing information safely when third-party access may be extensive (Waldman, 2016). Even privacy-concerned users may still participate in social media to gain social capital or fulfill psychological needs, as obtaining social capital will often require compromising privacy (Westin & Chiasson, 2019).
Using social identity theory (Tajfel & Turner, 1979), users may disclose PII to align with perceived community norms or impress influencers they follow. If these norms evolve to discourage PII sharing, followers may alter their behavior accordingly. Through social learning theory (Bandura, 1977), PII disclosures may increase when individuals observe influencers or peers engaging in this behavior and mimic it. Conversely, declines in PII sharing might result from decreased influencer relevance or shifting audience norms around disclosure.
Social media as a “technology of self”
The variety of information shared on social media varies in potential risk. Yang and Huang (2019) explored types of self-disclosures and their motivations, introducing Guilty Information Disclosure. Their findings indicate that contradictory behaviors such as seeking confirmation and imitation are common, with social media serving as a means for self-expression where individuals convey guilt and other sensitive information.
When users reveal past actions or details to engage with others, they risk divulging PII such as names or location, sensitive health information (Liao, 2019) or relationship details (Yang & Huang, 2019). Such disclosures can lead to discrimination by entities like insurers or employers (Weston & Wells, 2020), and analysis of comment interactions can compound this risk through discrimination based on network connections (Beigi & Liu, 2020).
Influencers and influencer networks
The influencer marketing industry was projected to exceed $16 billion in 2022 (Shopify, 2023). Influencer marketing remains one of the most sought-after forms of marketing due to the ability of these social media users to reach other users and assuage their hesitations in purchasing decisions. In a survey by Ki and Kim (2019), respondents admitted they would commit to a purchase if an influencer endorsed the product or service. Influencers do not strictly apply to marketing activities. When targeting an audience, several types of influencers with specific goals and compatibility considerations exist.
Identifying influencers
Okuah et al. (2019) define an influencer as someone with credibility and a significant following, capable of impacting individuals' decisions. Rahman (2022) categorizes influencers by follower count into Nano (under 10,000), Micro (10,000–100,000), Macro (100,000–1 million) and Mega (1 million+) tiers, a classification crucial for marketing campaigns and consumer behavior research.
Influencer power varies in calculation methods across studies, from tools like Klout (Rao, Spasojevic, Li, & Dsouza, 2015) to custom calculations based on network analysis metrics (Kumar, Choudhury, Rawat, & Jayaraman, 2016; Stieglitz & Dang-Xuan, 2013) or combinations of retweets, followers, mentions and favorites (Essaidi, Zaidouni, & Bellafkih, 2020; Sharma, Agarwal, & Sardana, 2018). Alternative approaches include influence scores using user contacts and followers (Lahuerta-Otero & Cordero-Gutiérrez, 2016). Essaidi et al. (2020) highlighted the follower-following ratio, finding that higher values indicate greater influence power. Due to limitations in third-party software, centrality measures emerged as the most feasible method for this study, particularly eigenvector centrality, which accounts for the influence of interconnected nodes (Gunaratne, Coomes, & Haghbayan, 2019).
Classifying private data and calculating risk
Information disclosure types and associated attacks
Even with private account features, users can inadvertently reveal information through photos or text exposure (Keküllüoğlu et al., 2020; Powale & Bhutkar, 2013). Beigi and Liu (2020) identify two disclosure types: identity disclosure, mapping a dataset instance to an individual and attribute disclosure, where an adversary infers information from released data. Two corresponding threats were modeled in this study: the Identity Disclosure Attack, using social network data to map users to known identities, and the Attribute Disclosure Attack, using social network data to infer attributes for users within a network group.
PII and sensitive data types
Data elements like birthdates, real names, addresses, phone numbers, emails and financial details are considered personal data, increasing privacy risks in social interactions (Milne, Pettinico, Hajjat, & Markos, 2017). Milne et al. (2017) introduced the Information Sensitivity Typology, categorizing information into Basic Demographics, Secure Identifiers, Contact Information, Financial Information, Community Interaction and Personal Preferences, aligned with NIST and Homeland Security standards for classifying direct PII or potentially linkable data. Their study identified four risk categories (monetary, social, physical and psychological), with consumers perceiving higher risks for Secure Identifiers and lower risks for Basic Demographics.
Rosado (2023) expanded this work with PII-Codex, a Python package for detecting and assessing PII tokens using Microsoft Presidio (Microsoft, 2018), which employs rule-based and named entity recognition models to assign categories and severities based on Milne et al’s. (2017) typology (2016). NER models are effective for entity extraction but tend toward false positives, with one study reporting a precision of 0.82 and a recall of 0.81 (Macri et al., 2023).
Purpose of the study
This quantitative study examined how social media influencers affect followers' sharing of PII across platforms with different architectural designs. This impact is relevant for businesses choosing influencers and forming social media policies, as well as within private corporate networks where excessive sharing can expose sensitive information (Turban, Bolloju, & Liang, 2011). The research investigates Reddit's community-based architecture after an initial Twitter/X pilot, offering exploratory insights into how platform design influences influencer-driven disclosure patterns. No IRB approval was necessary since the study used publicly available posts, and no raw data were retained after analysis.
Research questions
How do influencer tier and influence power relate to follower engagement and PII disclosure?
Do these relationships vary between follower hierarchy platforms (Twitter/X) and community-based platforms (Reddit)?
Do engagement and PII disclosure rates exhibit temporal decay patterns in social media conversations?
Hypotheses
Past studies show that influence power, which incorporates follower count, tends to boost involvement in trending activities (Arora, Bansal, Kandpal, Aswani, & Dwivedi, 2019; Lahuerta-Otero & Cordero-Gutiérrez, 2016) and engagement likelihood increases with influencer prominence (Rahman, 2022). Therefore, influencers with higher influence power and tier likely encourage more followers to share personal information through increased interaction, even when the original author does not share their own details.
Influence Power positively correlates with follower engagements.
Influence Power positively correlates with PII disclosure detections.
Influencer Tier positively correlates with follower engagements.
Influencer Tier positively correlates with PII disclosure detections.
Engagement has been observed to decline over time due to trend or influencer irrelevance (Zhang, Zhao, Yang, Paris, & Nepal, 2019), and analyzing this temporal pattern aids understanding of how attitudes and behavior change with social media usage (Saha et al., 2019). Given this decline, the same was hypothesized for PII disclosure rates.
As time elapses, the engagements in a cluster will decrease.
As time elapses, the PII disclosure detections in a cluster will decrease.
The research model is presented in Figure 1.
A diagram representing the research model of influencer impact on engagements and PII disclosures. The diagram includes three main components: Influence Power, Influencer Tier, and Time Elapsed, which are connected to two outcomes: Engagement and PII Disclosures. Influence Power is linked to Engagement through hypotheses H1 and H2, and to PII Disclosures through hypotheses H5 and H6. Influencer Tier is linked to Engagement through hypotheses H3 and H4, and to PII Disclosures through hypotheses H5 and H6. Time Elapsed is linked to Engagement through hypotheses H5 and H6, and to PII Disclosures through hypotheses H5 and H6. Arrows indicate the directional relationships between these components and outcomes.Research model of influencer impact on engagements and PII disclosures
A diagram representing the research model of influencer impact on engagements and PII disclosures. The diagram includes three main components: Influence Power, Influencer Tier, and Time Elapsed, which are connected to two outcomes: Engagement and PII Disclosures. Influence Power is linked to Engagement through hypotheses H1 and H2, and to PII Disclosures through hypotheses H5 and H6. Influencer Tier is linked to Engagement through hypotheses H3 and H4, and to PII Disclosures through hypotheses H5 and H6. Time Elapsed is linked to Engagement through hypotheses H5 and H6, and to PII Disclosures through hypotheses H5 and H6. Arrows indicate the directional relationships between these components and outcomes.Research model of influencer impact on engagements and PII disclosures
Methods
The next subsections show steps from data collection to final analysis and hypothesis testing.
Data collection
Throughout the pilot study, X (Twitter) served as the data source. Due to unprecedented changes in X's API offering, the data source was switched to Reddit for the main study. With both sources, some conversations may still have been developing when polling for trending topics; therefore, full conversational threads cannot be guaranteed, as conversations may evolve over time.
In the pilot study, posts from X were collected every 15 minutes on various days during the transition from legacy to new tier limits. The dataset used top conversations with keywords yielding four collections: Zelda, Jedi, Liverpool and Ferrari. When the final collection was retrieved, API limits and terms changed, prompting a pivot to Reddit. In the main study, Reddit posts were gathered at 30-min intervals over 11 days (September 12–22, 2023), modeled after Kim, Jang, Kim, and Wan (2018) X study methodology. Collections stayed within the Free Reddit API's limit of 100 requests per minute (Reddit API, n.d.; Stoddard, 2021) and occurred from 7 AM to 10 PM Mountain Standard Time to permit human oversight of data collection. Posts and subsequent thread interactions were recorded by examining comments on initial posts and replies that followed.
Data processing
PII identification and risk calculation
The PII identification and risk calculation allow the evaluation of the risk severity of information disclosed across the network graph. The PII-Codex (2023) was used for PII detections and categorizations in combination with the Not Identified, Identifiable and Identified categories by Schwartz and Solove (2011). The library uses a severity scale of 1, 2 and 3 for the categories of Not-Identified, Identifiable and Identified, respectively, to determine the risk score rs of a token. The isolated set of PII types and their associated categories and severities provided by Milne et al. (2017) and the PII-Codex Risk Values (2023) are presented in Table 1.
Data typology for risk assessments with risk Enum coding from PII codex
| Type | Cluster membership | NIST category | Homeland security category | PII-codex id | PII-codex risk value |
|---|---|---|---|---|---|
| Country of Citizenship | Basic Demographics | Linkable | Linkable | COUNTRY_OF_CITIZENSHIP | 2 |
| Zip code +4 | Basic Demographics | Linkable | Not Mentioned | ZIPCODE | 2 |
| Gender | Basic Demographics | Linkable | Not Mentioned | GENDER | 2 |
| Birth Date | Basic Demographics | Linkable | Linkable | DATE | 2 |
| Online Screen Name | Personal Preferences | Directly PII | Not Mentioned | SCREEN_NAME | 3 |
| Religion | Personal Preferences | Linkable | Not Mentioned | NRP | 2 |
| Political Affiliation | Personal Preferences | Linkable | Not Mentioned | NRP | 2 |
| Email Address | Personal Preferences | Directly PII | Stand Alone PII | EMAIL_ADDRESS | 3 |
| IP Address | Contact Information | Directly PII | Not Mentioned | IP_ADDRESS | 3 |
| Phone Number | Contact Information | Directly PII | Stand Alone PII | PHONE_NUMBER | 3 |
| Address | Contact Information | Linkable | Not Mentioned | LOCATION | 2 |
| Social Network Profile | Community Interaction | Linkable | Not Mentioned | SCREEN_NAME | 2 |
| Credit Card Number | Financial Information | Directly PII | Stand Alone PII | CREDIT_CARD_NUMBER | 3 |
| Financial Account Numbers | Financial Information | Directly PII | Stand Alone PII | …Various | 3 |
| Home Address | Secure Identifiers | Directly PII | Stand Alone PII | LOCATION | 3 |
| Location | Secure Identifiers | Linkable | Not Mentioned | LOCATION | 2 |
| Type | Cluster membership | NIST category | Homeland security category | PII-codex id | PII-codex risk value |
|---|---|---|---|---|---|
| Country of Citizenship | Basic Demographics | Linkable | Linkable | COUNTRY_OF_CITIZENSHIP | 2 |
| Zip code +4 | Basic Demographics | Linkable | Not Mentioned | ZIPCODE | 2 |
| Gender | Basic Demographics | Linkable | Not Mentioned | GENDER | 2 |
| Birth Date | Basic Demographics | Linkable | Linkable | DATE | 2 |
| Online Screen Name | Personal Preferences | Directly PII | Not Mentioned | SCREEN_NAME | 3 |
| Religion | Personal Preferences | Linkable | Not Mentioned | NRP | 2 |
| Political Affiliation | Personal Preferences | Linkable | Not Mentioned | NRP | 2 |
| Email Address | Personal Preferences | Directly PII | Stand Alone PII | EMAIL_ADDRESS | 3 |
| IP Address | Contact Information | Directly PII | Not Mentioned | IP_ADDRESS | 3 |
| Phone Number | Contact Information | Directly PII | Stand Alone PII | PHONE_NUMBER | 3 |
| Address | Contact Information | Linkable | Not Mentioned | LOCATION | 2 |
| Social Network Profile | Community Interaction | Linkable | Not Mentioned | SCREEN_NAME | 2 |
| Credit Card Number | Financial Information | Directly PII | Stand Alone PII | CREDIT_CARD_NUMBER | 3 |
| Financial Account Numbers | Financial Information | Directly PII | Stand Alone PII | …Various | 3 |
| Home Address | Secure Identifiers | Directly PII | Stand Alone PII | LOCATION | 3 |
| Location | Secure Identifiers | Linkable | Not Mentioned | LOCATION | 2 |
The risk score mean provided by the library was calculated using the mean severity score of each token detected in a text.
Each post's risk score mean value is then added to the collection's final calculation of the risk score using the mean of means formula:
The min, median and max calculations of this mean risk score, alongside what types of PII were detected with the input, are provided per node and per cluster within the final dataset for future evaluation.
Metrics collected
Influencer power, influencer tier, disclosure detections and the associated cluster details were collected per node, as shown in Table 2.
Node metrics collected
| Column name | Type | Description |
|---|---|---|
| Node ID | UUID | Unique identifier for post (replaces original platform identifier) |
| User ID | UUID | Unique identifier assigned for user (replaces original platform identifier) |
| Cluster Name | Str | Composite ID for subgraph using collection name and subgraph index |
| Influence Power | Float | Eigenvector centrality |
| Influencer Tier | Str | Categorical label calculated by follower count |
| Collection Name | Str | Trend collection assigned based on search query |
| Hashtags | Set(str) | The set of hashtags included in the node |
| PII Disclosed | Bool | Whether or not PII was disclosed |
| PII Detected | Set(str) | The detected token types in post |
| PII Risk Score | Float | The PII score for all tokens in a post |
| Is Comment | Bool | Whether or not the post is a comment or reply |
| Is Text Starter | Bool | Whether or not the post has text content |
| Community | Str | The group, community, channel, etc. associated with |
| Timestamp | Timestamp | Creation timestamp (provided by social media API) |
| Time Elapsed | Int | Time elapsed (seconds) from original influencer's post |
| Column name | Type | Description |
|---|---|---|
| Node ID | UUID | Unique identifier for post (replaces original platform identifier) |
| User ID | UUID | Unique identifier assigned for user (replaces original platform identifier) |
| Cluster Name | Str | Composite ID for subgraph using collection name and subgraph index |
| Influence Power | Float | Eigenvector centrality |
| Influencer Tier | Str | Categorical label calculated by follower count |
| Collection Name | Str | Trend collection assigned based on search query |
| Hashtags | Set(str) | The set of hashtags included in the node |
| PII Disclosed | Bool | Whether or not PII was disclosed |
| PII Detected | Set(str) | The detected token types in post |
| PII Risk Score | Float | The PII score for all tokens in a post |
| Is Comment | Bool | Whether or not the post is a comment or reply |
| Is Text Starter | Bool | Whether or not the post has text content |
| Community | Str | The group, community, channel, etc. associated with |
| Timestamp | Timestamp | Creation timestamp (provided by social media API) |
| Time Elapsed | Int | Time elapsed (seconds) from original influencer's post |
Metrics summarizing the cluster details were collected, including influencer summaries (for all influencers within the cluster), risk score statistics, disclosure and engagement counts and ratios, the periods of each cluster and the average time elapsed between responses within. These metrics and details are enumerated with their respective types in Table 3.
Cluster metrics and summarizations
| Column name | Type | Description |
|---|---|---|
| Cluster Name | Str | Composite ID for subgraph using collection name and subgraph index |
| Influencer Tiers Frequencies | List[dict] | Frequency of influencer tiers of all users in the cluster |
| Top Influence Power Score | Float | Eigenvector centrality of top influencer |
| Top Influencer Tier | Str | Size tier of top influencer |
| Collection Name | Str | Trend collection assigned based on search query |
| Hashtags | Set(str) | The set of hashtags included in the cluster |
| PII Detection Frequencies | List[dict] | The detected token types in post with frequencies |
| Node Count | Int | Count of all nodes in the influencer cluster |
| Node Disclosures | Int | Count of all nodes with mean_risk_score >1* |
| Disclosure Ratio | Float | Sum of nodes with confirmed disclosed PII divided by cluster size |
| Mean Risk Score | Float | The mean risk score for an entire network cluster |
| Median Risk Score | Float | The median risk score for an entire network cluster |
| Min Risk Score | Float | The min risk score for an entire network cluster |
| Max Risk Score | Float | The max risk score for an entire network cluster |
| Time Span | Float | Total Time Elapsed |
| Column name | Type | Description |
|---|---|---|
| Cluster Name | Str | Composite ID for subgraph using collection name and subgraph index |
| Influencer Tiers Frequencies | List[dict] | Frequency of influencer tiers of all users in the cluster |
| Top Influence Power Score | Float | Eigenvector centrality of top influencer |
| Top Influencer Tier | Str | Size tier of top influencer |
| Collection Name | Str | Trend collection assigned based on search query |
| Hashtags | Set(str) | The set of hashtags included in the cluster |
| PII Detection Frequencies | List[dict] | The detected token types in post with frequencies |
| Node Count | Int | Count of all nodes in the influencer cluster |
| Node Disclosures | Int | Count of all nodes with mean_risk_score >1* |
| Disclosure Ratio | Float | Sum of nodes with confirmed disclosed PII divided by cluster size |
| Mean Risk Score | Float | The mean risk score for an entire network cluster |
| Median Risk Score | Float | The median risk score for an entire network cluster |
| Min Risk Score | Float | The min risk score for an entire network cluster |
| Max Risk Score | Float | The max risk score for an entire network cluster |
| Time Span | Float | Total Time Elapsed |
Constructing social graphs
The submission_name and parent_id attributes from Reddit posts drove the cluster construction in the study. Since identifiers like the conversation_id, submission_name, id and parent_id attribute types can be used to track down pieces of a conversation thread, an internal unique identifier labeled post_uuid was used instead to track relationships between the nodes, as shown in the pseudocode in Algorithm 1.
Reddit Social Graph and Summary Construction Overview.
Function build_social_graph_summaries Object posts
G = nx.Graph()
forall child c in posts do
if c's post_id is not in G then
G.add_node(c)
end
forall replies r in c.comments do
if c's post_id is not in G then
G.add_node(r)
G.add_edge(c, r)
end
end
end
influence_ratings = calculate_influence_ranks(G, posts)
return build_graph_summaries(G, posts, influence_ratings)
end
Influencer identification and power score calculation
The two most referenced methods of ranking influencers are centrality values from graph analysis and custom score calculations using favorites, mentions and retweets. This study employs eigenvector centrality as the Influence Power score, calculated using the NetworkX library (Hagberg, Schult, & Swart, 2008). For a cluster graph G = (V, E), NetworkX calculates eigenvector centrality for every node n. Since each trend contains multiple clusters, centrality values and top-influencing node selection are evaluated per sub-graph, with each social graph independently determining its top-influencing node.
Calculating engagements
Engagements in this study included shares and replies on X (Twitter) and comments on Reddit, which enhance post visibility based on platform algorithms. Only response-type engagements may reveal new PII disclosures, as favorites/likes on X or upvotes/downvotes on Reddit do not allow for new text sharing. Two metrics were used: Response Type Engagements, which count only responses, and Total Engagements, which include all engagement types. Limitations in X's API may restrict access to full conversation archives, whereas Reddit allows polling for submissions, though issues may arise from post or user deletions and late participants in discussions.
Data groupings and isolating variable effect
Time series analyses were performed on engagements and PII disclosures per cluster to test whether both trend downward over time (H5, H6). Cluster-wide metrics included total disclosures, total engagements, rates of both, mean risk score, total time span, primary influence power score and influencer tier. These metrics informed descriptive statistics and some analyses, but time series testing required individual node-level data points extracted per cluster.
Data were resampled into 5-, 10- or 15-min time bins based on data density: sparser conversations required larger bins (15-min) to ensure sufficient observations for stationarity testing, while denser conversations used smaller bins (5-min) to capture finer temporal dynamics without over-smoothing. Stationarity was assessed using the Augmented Dickey-Fuller test at α = 0.05, with differencing applied as needed. The differencing order combined with Autocorrelation and Partial Correlation results informed ARIMA model parameters and outcomes were plotted using Plotly.
To assess robustness, we incorporated CUSUM (Cumulative Sum Control Chart), Change Point Detection (CPD) and the Mann–Kendall test as comparative baselines. While ARIMA is parametric and sensitive to local fluctuations, the Mann–Kendall test provides a model-free evaluation of monotonic trends. CUSUM and CPD complement both approaches by detecting gradual shifts and distinct breakpoints in time-binned social media data that traditional methods may miss.
Analyzing within and between clusters
After data collection and processing, each daily collection contained multiple conversation groups, with each collection c having k clusters and every cluster containing n nodes (observations). Each trend collection represents a sample from the broader platform population, with individual groupings within. Figure 2 presents the data grouping breakdown.
The diagram illustrates the process of data groupings using clusters within trend collections. It starts with social media data as the population, from which samples are taken to form collections. These collections are further divided into clusters. The diagram shows two main collections, each split into two clusters. Each cluster then leads to a series of outcomes, representing observations over time. The relationships between clusters and within clusters are depicted, showing the flow of data from collections to outcomes.Data groupings using clusters within trend collections
The diagram illustrates the process of data groupings using clusters within trend collections. It starts with social media data as the population, from which samples are taken to form collections. These collections are further divided into clusters. The diagram shows two main collections, each split into two clusters. Each cluster then leads to a series of outcomes, representing observations over time. The relationships between clusters and within clusters are depicted, showing the flow of data from collections to outcomes.Data groupings using clusters within trend collections
With clustered data, correlation within the hierarchical structure derived from aggregated user-level interactions may violate independence assumptions (Nielsen, Smink, & Fox, 2021). This approach requires sufficient sample sizes to estimate random effects, both in cluster count and nodes per cluster. Between-cluster analysis used Spearman's Rank correlation, given non-normal distributions, confirmed via the Anderson-Darling test. Both tests used α = 0.05 and were conducted using SciPy.
Hypothesis testing
Spearman's correlation tests assessed the relationship between the primary influencer's power index and the dependent variables (Engagement and PII Disclosures), given non-normal distributions. Hypothesis pairs H1/H2, H3/H4 and H5/H6 were tested at the collection level using α = 0.05. Time series analysis was performed per cluster to visualize trends in engagements and disclosures.
Results
Data collection and composition
The pilot analysis used X (Twitter) as the data source, pulling 10,259 posts, many of which were one-off posts unrelated to conversational exchanges. Following X's API tier restructuring, which moved key endpoints to Pro and Enterprise tiers, the data source pivoted to Reddit. The main study collected 122,904 posts and subscriber/follower data for 93,982 users across 285 conversation clusters from Reddit (September 12–22, 2023). Table 4 presents the full dataset composition.
Main study– Reddit dataset composition
| Collection | Clusters | Posts | Users | String tokens | PII tokens | % nodes disclosed |
|---|---|---|---|---|---|---|
| 2023–09–12 | 15 | 7,538 | 5,664 | 178,054 | 3,630 | 30.8835 |
| 2023–09–13 | 43 | 9,102 | 6,934 | 241,482 | 3,851 | 27.3237 |
| 2023–09–14 | 26 | 12,324 | 9,266 | 329,866 | 6,014 | 29.3817 |
| 2023–09–15 | 26 | 11,930 | 8,785 | 294,454 | 6,568 | 31.8022 |
| 2023–09–16 | 30 | 14,816 | 12,260 | 282,021 | 6,355 | 27.6795 |
| 2023–09–17 | 20 | 12,237 | 8,899 | 288,046 | 4,998 | 26.2074 |
| 2023–09–18 | 24 | 10,752 | 8,226 | 265,549 | 6,550 | 34.0402 |
| 2023–09–19 | 16 | 12,529 | 10,840 | 244,281 | 3,585 | 29.4668 |
| 2023–09–20 | 46 | 11,619 | 8,485 | 339,404 | 6,971 | 33.6512 |
| 2023–09–21 | 10 | 6,852 | 4,449 | 187,557 | 5,497 | 33.4501 |
| 2023–09–22 | 29 | 13,205 | 10,174 | 321,730 | 6,689 | 29.5798 |
| Collection | Clusters | Posts | Users | String tokens | PII tokens | % nodes disclosed |
|---|---|---|---|---|---|---|
| 2023–09–12 | 15 | 7,538 | 5,664 | 178,054 | 3,630 | 30.8835 |
| 2023–09–13 | 43 | 9,102 | 6,934 | 241,482 | 3,851 | 27.3237 |
| 2023–09–14 | 26 | 12,324 | 9,266 | 329,866 | 6,014 | 29.3817 |
| 2023–09–15 | 26 | 11,930 | 8,785 | 294,454 | 6,568 | 31.8022 |
| 2023–09–16 | 30 | 14,816 | 12,260 | 282,021 | 6,355 | 27.6795 |
| 2023–09–17 | 20 | 12,237 | 8,899 | 288,046 | 4,998 | 26.2074 |
| 2023–09–18 | 24 | 10,752 | 8,226 | 265,549 | 6,550 | 34.0402 |
| 2023–09–19 | 16 | 12,529 | 10,840 | 244,281 | 3,585 | 29.4668 |
| 2023–09–20 | 46 | 11,619 | 8,485 | 339,404 | 6,971 | 33.6512 |
| 2023–09–21 | 10 | 6,852 | 4,449 | 187,557 | 5,497 | 33.4501 |
| 2023–09–22 | 29 | 13,205 | 10,174 | 321,730 | 6,689 | 29.5798 |
Across daily collections, Person types (strings identified as potential individual names) constituted the vast majority of PII detections, followed by Datetime, NRP (Nationality, Religious or Political mentions), Location and one Medical License identification determined to be a false positive.
Main study results
The following sections cover the individual analysis attempts of the main study and its results.
Timeseries analysis
There were 183 total clusters when using the 30-node minimum threshold for clusters. Due to the volume of graphs, summaries for every collection are provided in lieu of providing every graph associated with the clusters. While some collections held a small set of clusters, others held significantly more, as presented by the September 15th collection in Figure 3.
Two line graphs display engagements and disclosures across Reddit conversations on September 15th. The top graph shows engagements with multiple colored lines representing different data sets, peaking around 10:00. The bottom graph shows disclosures with similar colored lines, also peaking around 10:00. Both graphs have time on the x-axis in 15-minute intervals and counts on the y-axis. A dashed black line represents the average for each graph. The highest average engagements and disclosures occur around 10:00. All values are approximated.Engagements and disclosures across Reddit conversations on September 15th
Two line graphs display engagements and disclosures across Reddit conversations on September 15th. The top graph shows engagements with multiple colored lines representing different data sets, peaking around 10:00. The bottom graph shows disclosures with similar colored lines, also peaking around 10:00. Both graphs have time on the x-axis in 15-minute intervals and counts on the y-axis. A dashed black line represents the average for each graph. The highest average engagements and disclosures occur around 10:00. All values are approximated.Engagements and disclosures across Reddit conversations on September 15th
Of the 183 clusters analyzed, ARIMA identified descending trends in 146 (79.78%) for engagements, with 73 showing statistically significant AR and MR coefficients. For disclosure-active clusters, 121 showed descending trends, with 68 statistically significant. CUSUM Drift detected declining trends in 73 clusters for engagements and 68 for disclosures; Change Point Decline detected 182 and 153, respectively; Mann-Kendall detected none. When requiring agreement among at least three methods, 61 clusters showed declining engagement trends, and 50 showed declining disclosure trends.
For each collection (Figure 4), consensus across decline detection methods was assessed, with maximum average agreement just under 3. As shown in Figure 5, Mann–Kendall contributed zero decline detections across all collections for both engagement and disclosure data, likely due to the noisy, non-monotonic nature of time-binned social media activity associated with cross-sectional sampling. Change Point Decline and ARIMA consistently identified declines, followed by CUSUM Drift.
A bar graph compares average agreement scores for engagement and disclosure over different collection dates. The horizontal axis represents the collection dates from 2023-09-12 to 2023-09-22, and the vertical axis represents the average agreement score ranging from 0 to 2.5. The graph features two sets of bars for each date: one in purple representing engagement average agreement and one in green representing disclosure average agreement. Notable trends include a peak in disclosure average agreement on 2023-09-16 and relatively consistent engagement average agreement scores across most dates. The highest engagement average agreement is observed on 2023-09-20, while the lowest disclosure average agreement is on 2023-09-18.Agreement on declining trend detections across detection methods
A bar graph compares average agreement scores for engagement and disclosure over different collection dates. The horizontal axis represents the collection dates from 2023-09-12 to 2023-09-22, and the vertical axis represents the average agreement score ranging from 0 to 2.5. The graph features two sets of bars for each date: one in purple representing engagement average agreement and one in green representing disclosure average agreement. Notable trends include a peak in disclosure average agreement on 2023-09-16 and relatively consistent engagement average agreement scores across most dates. The highest engagement average agreement is observed on 2023-09-20, while the lowest disclosure average agreement is on 2023-09-18.Agreement on declining trend detections across detection methods
A radar chart with four axes labeled CUSUM Drift, Mann-Kendall, ARIMA, and Change Point Decline. The chart features two data series: Engagements in purple and Disclosures in green. The axes are marked with values ranging from 0 to 200. The Engagements series shows higher values on the Change Point Decline axis, while the Disclosures series shows higher values on the Mann-Kendall axis. Both series have lower values on the CUSUM Drift and ARIMA axes.Aggregate method sensitivity across conversations
A radar chart with four axes labeled CUSUM Drift, Mann-Kendall, ARIMA, and Change Point Decline. The chart features two data series: Engagements in purple and Disclosures in green. The axes are marked with values ranging from 0 to 200. The Engagements series shows higher values on the Change Point Decline axis, while the Disclosures series shows higher values on the Mann-Kendall axis. Both series have lower values on the CUSUM Drift and ARIMA axes.Aggregate method sensitivity across conversations
Correlation analysis
Reddit data clusters were more comprehensive than Twitter's, with the largest containing 7,470 nodes (September 16th) and the second largest 6,672 (September 19th). Of the 285 clusters, 210 contained more than one node, 183 had at least 30 nodes, 171 had at least 100 and only 30 (10.53%) exceeded 1,000. A baseline of 30 nodes was established to address sample size issues after time-of-post adjustments. Table 5 summarizes correlation results across varying size thresholds.
Main study correlation results using dataset segmented by cluster size
| Run 1 (N = 183) | Run 2 (N = 182) | Run 3 (N = 30) | |
|---|---|---|---|
| Influencer Tier & Engagements | 0.1952** | 0.1763* | −0.3025 |
| Influencer Tier & Disclosures | 0.1425 | 0.1223 | −0.0837 |
| Influence Power & Engagements | 0.5728*** | 0.5658*** | 0.4972** |
| Influence Power & Disclosures | 0.4646*** | 0.4559*** | 0.0527 |
| Time Elapsed & Engagements | 0.1608* | 0.1677* | −0.1880 |
| Time Elapsed & Disclosures | 0.1645* | 0.1711* | −0.1782 |
| Run 1 (N = 183) | Run 2 (N = 182) | Run 3 (N = 30) | |
|---|---|---|---|
| Influencer Tier & Engagements | 0.1952** | 0.1763* | −0.3025 |
| Influencer Tier & Disclosures | 0.1425 | 0.1223 | −0.0837 |
| Influence Power & Engagements | 0.5728*** | 0.5658*** | 0.4972** |
| Influence Power & Disclosures | 0.4646*** | 0.4559*** | 0.0527 |
| Time Elapsed & Engagements | 0.1608* | 0.1677* | −0.1880 |
| Time Elapsed & Disclosures | 0.1645* | 0.1711* | −0.1782 |
Note(s): Asterisks denote p-value. *p < 0.05, **p < 0.01 and ***p < 0.001
Anderson-Darling tests confirmed non-normal distributions, necessitating Spearman's Rank Correlation across three robustness analyses. Run 1 (n = 183, 30-node minimum) ensured adequate sample size for time series resampling. Run 2 (n = 182, 50-node threshold) increased statistical power while maintaining breadth, showing consistently positive coefficients (p < 0.05). Run 3 (n = 30, 1,000+ nodes) isolated the largest conversations to test whether fuller thread capture strengthened effects. Influence power maintained significant correlations with engagement (p < 0.01), but influencer tier and time elapsed pairings showed negative coefficients with reduced significance, suggesting measurement challenges at this threshold.
Discussion
Platform architecture and model transferability
The unplanned platform transition from Twitter/X to Reddit due to API access changes created non-equivalent comparison groups with different sampling methods and user populations. However, this pivot revealed how platform architecture may shape influencer disclosure relationships. Twitter/X's follower hierarchy model showed clear correlations between influencer tier and both engagement (r = 0.26, p < 0.01) and disclosures (r = 0.20, p < 0.01), while Reddit's community-first architecture, where users follow subreddits rather than individuals, showed weaker tier correlations.
This architectural difference created measurement challenges: Reddit's “active user count” (15-min interaction window) proved unstable for tier classification compared to Twitter/X's persistent follower counts. Despite this, influence power (eigenvector centrality) maintained significant positive correlations with both engagement and disclosures across platforms, suggesting network position matters regardless of architecture. These exploratory observations suggest community-based platforms may diffuse influencer power differently than follower hierarchy models, but designed studies with equivalent sampling are needed to confirm this. The Mann–Kendall test contributed no decline detections, consistent with research showing monotonic trend tests are poorly suited to bursty social media patterns (Lehmann, Gonçalves, Ramasco, & Cattuto, 2012; Mathioudakis & Koudas, 2010), supporting the need for adaptive methods like CPD and ARIMA in social media analysis. Reddit's fuller conversation threads also made it more straightforward to observe engagement fluctuations and eventual decline, with more complete and numerous clusters facilitating the time series analyses. Using active user counts for Reddit tier classification introduces an additional reliability concern beyond architectural non-equivalence: these counts capture a 15-min interaction window and fluctuate substantially with time of day and algorithmic promotion, making tier assignments transient rather than stable proxies for audience size; subscriber counts or longitudinally averaged activity metrics would provide more consistent classifications in future studies.
Hypothesis testing and results
The dataset showed diverse correlations, with filtering revealing predominantly positive relationships among variable pairings. Some results may have limited reliability due to the mixed-media nature of threads (text, images, videos) and limited understanding of community context.
On Reddit, community groups rather than individual users are emphasized. Individual subscriber data were limited, so the active member count (accounts interacting within a 15-min window) was used instead. However, this count fluctuates over time, making influencer tier classification transient and less reliable than Twitter/X's persistent follower counts.
In the Twitter/X corpus, significant correlations (p < 0.01) were found between influencer tier and both engagements (r = 0.2569) and disclosures (r = 0.2001). Reddit's correlations were positive (r = 0.3887, r = 0.3133) but both p-values exceeded 0.05. Given these measurement challenges, influencer tier results (H3, H4) should be interpreted cautiously, and the cross-platform comparison suggests tier effects may be platform-dependent, with stronger effects in follower hierarchy architectures. Researchers studying influencer tiers on community-based platforms may not benefit from the tier assignment method used here.
For H1 and H2, influence power showed positive correlations with engagements and PII disclosures in the Reddit dataset, though variations may exist between image/video-initiated and text-initiated clusters. In the Twitter/X dataset, this was confirmed only after filtering smaller clusters, as eigenvector centrality values were significantly affected by cluster size.
For H5 and H6, time series analysis indicated post-peak declines for both engagements and disclosures, with 76.50% of Reddit clusters showing engagement declines and 66.12% showing disclosure declines. Among declining clusters, more than half achieved consensus across at least three trend detection methods. These hypotheses are supported, though caution is advised given that many conversations were still in progress when data collection ended.
Across all analyses, Twitter/X showed significant relationships for influence power with engagements and disclosures after filtering smaller clusters, while Reddit showed predominantly positive correlations at p < 0.05 or p < 0.01 across variable pairings. ARIMA analysis confirmed descending trends in most Reddit clusters and approximately half of the pilot dataset. Table 6 summarizes hypotheses and associated results per phase.
Hypothesis testing and results summary
| Hypothesis | Result |
|---|---|
| H1: Influence Power positively correlates with follower engagements | Supported |
| H2: Influence Power positively correlates with PII disclosure detections | Supported |
| H3: Influencer Tier positively correlates with follower engagements | Supported* |
| H4: Influencer Tier positively correlates with PII disclosure detections | Supported* |
| H5: As time elapses, the engagements in a cluster will decrease | Supported |
| H6: As time elapses, the PII disclosure detections in a cluster will decrease | Supported |
| Hypothesis | Result |
|---|---|
| Supported | |
| Supported | |
| Supported* | |
| Supported* | |
| Supported | |
| Supported |
Note(s): Asterisks (*) denote reliability concerns from platform architectural differences. Caution advised
Limitations
This study faced several limitations. Firstly, the focus started exclusively on a single social media platform, X (Twitter), which underwent various changes during the data collection, leading to issues such as the unavailability of full conversation histories and reduced post availability due to updated API limits. The analysis expanded to Reddit's community-based architecture, providing exploratory cross-platform insights while introducing measurement challenges for influencer tier classification.
Additionally, the study relied solely on eigenvector centrality to measure influence, thereby overlooking factors such as community involvement and cross-platform activity. Manual validation of automated PII detection was not feasible due to the dataset size. As with any NER-based approach, automated detection introduces false positives and false negatives (Macri et al., 2023). Published evaluations of Microsoft Presidio report variable performance depending on configuration and domain, with precision estimates ranging from approximately 0.51 in baseline clinical evaluations to about 0.89 in customized implementations (Alrazihi, Biswas, & George, 2025; Kotevski et al., 2022).
To provide conservative uncertainty bounds using parameters reported within a single empirical evaluation, precision and recall estimates from the baseline study (precision = 0.51, recall = 0.74) were applied to the observed disclosure rate of 29.41%. This yields a precision-adjusted lower estimate of approximately 15.00%, representing confirmed true detections, and a recall-corrected prevalence estimate of approximately 20.27% that accounts for likely missed cases. Although the observed rate may overestimate absolute prevalence under conservative assumptions, the dominant PII categories identified in this study, including person names, datetime expressions and professional identifiers, are commonly reported in NER research to achieve comparatively higher precision due to stronger contextual cues (Macri et al., 2023). Since the same detection pipeline was consistently used throughout the dataset, it would be expected to introduce predominantly non-differential measurement error, which may reduce effect sizes but is unlikely to invalidate the correlational findings.
Data collection was intentionally limited to 7 AM through 10 PM Mountain Standard Time to allow for human oversight during data pulls, excluding overnight trending submissions. Reddit's “hot submissions” feature was selected as the closest analog to Twitter/X's trending conversations to maintain methodological consistency across platforms despite architectural differences. Its score is calculated based on how many upvotes vs downvotes it has received (Kuutila, Rantala, Li, Hosio, & Mäntylä, 2024). This approach resulted in some clusters being captured before peak activity, potentially obscuring full trend patterns. The study also did not account for varying content types in trending posts or the impact of post length differences, as Reddit allows much longer posts compared to X (Twitter).
Finally, the study was completed entirely without obtaining input from the platform users as to why they were interacting with the post or user. An individual's motivations to engage will differ throughout their time on the platform. Obtaining their input on why they engage in combination with the actual observed behavior limited the study by not accounting for individual differences, motivators and psychological factors.
Implications for theory and practice
This study shows a connection between influencer reach, measured by eigenvector centrality and the sharing of PII on social media. As reach grows, so does the chance for engagement and information disclosure, with an average of 29.41% of 122,904 posts containing PII (adjusted estimate: 15.00–20.27% after accounting for automated detection errors) in higher-interaction threads. Though viral discussions can start from non-influencers and platform algorithms may also play a role, influencers and marketers should be aware that some exchanges might put their followers' privacy at risk. Importantly, it was not determined whether conversation starters actively encouraged sharing or whether sensitive data were exchanged through media. These disclosures were made publicly by users; while regulations like GDPR and CCPA control how platforms handle data, user-initiated public disclosures create unique privacy challenges that go beyond current laws.
The observed behavioral shifts can be understood through Social Learning and Social Identity theories. Social Learning Theory suggests influencers can be targeted to model desirable privacy behaviors, while also accounting for undesirable disclosure patterns. Social Identity Theory suggests that group identity and shared norms-based campaigns can drive privacy-aware behaviors as part of a community's core values. Platform algorithms that prioritize high-engagement content may inadvertently amplify PII-containing threads, compounding individual privacy risks through increased visibility. Users must adopt privacy-conscious practices to reduce exposure to fraud or discrimination, and targeted campaigns (influencer-led or otherwise) can address these challenges when traditional educational efforts fall short.
A between-subjects experiment could assign participants to simulated social media threads where an influencer either discloses or withholds PII (independent variable 1: influencer disclosure condition) across hierarchical and community-based platform designs (independent variable 2: platform architecture). This design directly operationalizes the two key variables that emerged from the observational findings, providing the controlled, comparable sampling test needed to establish causal inference where the unplanned platform transition could not. Participant disclosure behavior, measured by the number and sensitivity of PII items shared in a mock posting task, would serve as the primary dependent variable, with a secondary self-report measure of perceived disclosure norms to capture attitudinal shifts. Based on the present findings, we predict that influencer disclosure increases follower PII sharing and that this effect is stronger under hierarchical designs, consistent with the observed attenuation on Reddit relative to Twitter/X.
Conclusion
This study examined how social media influencers impact followers' PII disclosures, revealing that influence power (eigenvector centrality) significantly and positively correlates with both engagement and disclosure rates across platforms (RQ1). Influencers with greater network reach drove higher engagement and disclosure rates, with 29.41% of posts in the Reddit dataset containing PII (adjusted estimate: 15.00% to 20.27% after accounting for automated detection error). Temporal analysis showed declining trends in both engagement (79.78% of threads) and disclosures (66.12% of threads), supporting H5 and H6.
The unplanned platform transition from Twitter/X to Reddit provided exploratory insights into how platform architecture moderates influencer effects (RQ2). Twitter/X's follower hierarchy model showed stronger correlations between influencer tier and disclosures (r = 0.26, p < 0.01) compared to Reddit's community-based structure, suggesting community-driven platforms may inherently diffuse influencer power. However, influence power maintained significant positive correlations across both architectures, indicating network position matters regardless of platform design. Temporal patterns were examined using a triangulated approach combining ARIMA, CUSUM Drift and CPD, each capturing distinct aspects of nonlinear, bursty engagement data. Convergent findings across methods strengthen confidence in the observed trends and offer a replicable framework for short-window social media analysis. The Reddit tier findings (H3, H4) warrant caution, as active user counts used for tier classification fluctuate within short windows, producing transient assignments less stable than Twitter/X's persistent follower counts. Likewise, the 29.41% PII disclosure rate reflects automated NER output, and the precision-adjusted estimate of 15.00% to 20.27% is the more defensible figure for downstream policy or design recommendations.
Future research should use designed platform comparisons with equivalent sampling to test whether community-based architectures offer privacy-protective effects. Longitudinal studies with complete conversation threads would clarify patterns of temporal decay, and enhanced PII detection using machine learning classifiers beyond NER could lower false positive rates while maintaining PII-Codex's categorization framework. Including user motivations through survey data would complement behavioral observations, addressing why users disclose PII despite privacy concerns. Standardizing influencer tier classification on community-based platforms through subscriber counts or longitudinally averaged activity metrics, rather than instantaneous active user counts, would also resolve the measurement instability identified here and enable more reliable cross-platform comparisons.
Ethics statement
This research used publicly accessible social media posts from Twitter/X and Reddit. Raw post content was not retained following analysis. All datasets (Rosado, 2024) were sanitized by replacing original platform identifiers such as user IDs, post IDS, usernames and similar identifiers with unique internal identifiers to prevent re-identification. No IRB approval was required as the study involved no direct interaction with human participants.

