Synthetic maintenance data generation for industrial assets based on historic statistical distribution using pseudo-random algorithm

Diaz Vivas, Sebastian; Tarantino Alvarado, Rocco; Aranguren Zambrano, Sandra; Tabares Pozos, Alejandra

doi:10.1108/JQME-03-2025-0020

Purpose

The article aims to address the challenge of partial or complete absence of maintenance data records for industrial assets by generating synthetic maintenance data under a high-quality maintenance data structure established in the framework of International Organization for Standardization (ISO) 14224:2016. The preceding contributes to maintenance engineering, a strategy to obtain meaningful synthetic data in maintenance management analysis without exposing industrial assets to failures that may lead to undesired consequences.

Design/methodology/approach

The research was conducted under an experimental study aimed at generating synthetic maintenance data from historical statistical distributions of industrial assets. For experimental purposes, based on the criticality of the studied process context, the research was carried out on a centrifugal pump, with its primary data source from the Offshore Reliability Data Handbook (OREDA), from which the four failure modes with the highest failure rate and the non-maintainable components related to the failure rate by probability were selected. The data were processed using Python 3.10.12, using a methodology of standardizing the data structure, for which a pseudo-code was established.

Findings

The article addresses the generation of synthetic maintenance data using historical statistical distributions from the OREDA. Two sets of synthetic data were obtained for a centrifugal pump, with the second set maintaining originality by defining the maximum failure rate as the mean of the global failure rate based on accurate data, demonstrated with an error of 1.96%. This approach allows for objective decision-making when forecasting different scenarios, as the synthetic data set acquires its dynamics dependent on the statistical distribution of the failure rate by failure modes, evidenced by the error in the standard deviation.

Originality/value

The article focuses on generating synthetic maintenance data by developing an algorithm based on internationally recognized statistical distributions aligned with the international standards of ISO 14224:2016. This approach aims to create a synthetic maintenance dataset with maintenance records from which maintenance variables and indicators can be derived. These derived insights enable maintenance optimization through data-driven decision-making feedback loops.

1. Introduction

Technological evolution has transformed maintenance management into a critical function for industrial competitiveness, leading companies to prioritize strategies that maximize asset performance and operational profitability (Mora, 2009). In this context, data has emerged as a pivotal industrial asset, enabling pivotal insights and informed decision-making through diverse analytical strategies (Merkt, 2019; Bekar et al., 2020; Bousdekis et al., 2021; Filz et al., 2021; Sajid et al., 2021; Abbate et al., 2022; Cui et al., 2022). These data-driven approaches facilitate the iterative optimization of maintenance management via continuous feedback mechanisms (Ciliberti et al., 2019; Tarantino, 2021; Diaz et al., 2023).

However, a significant obstacle to implementing these advanced analytics is the frequent scarcity or complete absence of high-quality maintenance data records. This often forces a reliance on established, yet inherently limited, maintenance techniques like TPM, RCM, and FMEA (Mora, 2009; International Organization for Standardization, 2016). While foundational, such approaches fall short of the comprehensive, data-driven engineering analysis required for holistic asset management – which is necessary to optimize strategies without compromising assets, production, or safety (Jones, 1995; SAE, International, 2009). This challenge is critical in high-consequence environments where fault impacts and data integration are vital (Hannam, 1997). A further complication involves ensuring data quality and reliability, which demands a well-planned acquisition process that prioritizes essential assets and protects their integrity (Díaz, 2023; Diaz et al., 2023; International Organization for Standardization, 2016).

To address this, ISO 14224:2016 outlines a structured framework for data acquisition. Figure 1 illustrates the planning procedure, which involves steps such as determining data acquisition processes, verifying and researching data sources, defining the maintenance data to acquire, setting time, population, and operation parameters, ensuring uniformity in failure definition and classification, and training staff. Subsequently, Figure 2 depicts the acquisition procedure, which includes accessing consequential data, interpreting data, transferring data to a database, and evaluating and analyzing gained insights.

Figure 1

A vertical flow diagram shows sequential steps in data acquisition and preparation.

View large Download slide

The vertical flow starts at the top with a box labeled “Determining Data Acquisition Processes”. A downward arrow leads to the second box labeled “Verifying and Re-searching Data Sources”. A downward arrow then connects to the third box labeled “Determining Maintenance Data to Acquire”. From this box, another downward arrow leads to the fourth box labeled “Determining Time, Population, and Operation Parameters”. A further downward arrow connects to the fifth box labeled “Determining Uniformity in Failure Definition and Classification”. Finally, a downward arrow leads to the last box at the bottom labeled “Training Staff”.

Planning procedure for acquiring quality data according to ISO 14224:2016

Figure 2

A vertical flow diagram shows steps for acquiring quality data.

View large Download slide

The vertical flow starts at the top with a box labeled “Accessing Consequential Data”. A downward arrow leads to the second box labeled “Interpreting Data”. Another downward arrow connects to the third box labeled “Data Transfer to Database”. From this box, a further downward arrow leads to the final box at the bottom labeled “Evaluating and Analysis Gained Insights”.

Procedure for acquiring quality data according to ISO 14224:2016

The escalating value of data as a convertible asset for business intelligence underscores the need to overcome inherent data issues, which can be categorized into concerns regarding privacy release and sensitivity, data bias and variance, the need for increased data robustness, and limited or zero data availability (Cole et al., 2015; Mannino and Abouzied, 2019; Dankar and Ibrahim, 2021; Jordon et al., 2022). Synthetic data emerges as a viable solution to these challenges, defined as artificially generated data that acquires the statistical and phenomenological properties of a real dataset (Dankar and Ibrahim, 2021; Jordon et al., 2022). Methodological advancements highlight that synthetic datasets vary in fitness for purpose, necessitating tailored evaluation protocols and practical metrics to enhance reproducibility (Lautrup et al., 2024; Giuffrè and Shung, 2023).

Synthetic data offers significant benefits, including privacy protection through the absence of personally identifiable information, improved accessibility by creating larger datasets, flexibility to replicate specific statistical characteristics, and the potential to enhance original data quality (El Emam et al., 2020). Various generation techniques exist, such as those based on Bayesian networks, copulas, parametric fitting (e.g. Monte Carlo methods), and non-parametric trees (Mannino and Abouzied, 2019; Dankar and Ibrahim, 2021; El Emam et al., 2020; Li et al., 2020; Jordon et al., 2022; Okagbue et al., 2020).

Within reliable maintenance engineering, synthetic data generation employs diverse strategies, including machine learning, such as generative adversarial networks (GANs) and mathematical functions, maintaining statistical fidelity (Lakshmanan et al., 2023; Martínez-Heredia and Ventura, 2025).. Contemporary research trends encompass digital twin frameworks for physics-informed data streams and Bayesian methods for uncertainty-aware reliability estimates, providing a context where a parameterized pseudo-random generator offers a reproducible baseline for transparent benchmarking (Zio and Miqueles, 2024; Liu et al., 2024; Pan et al., 2024; Zheng et al., 2024). To address the challenge of absent maintenance records, this work develops a pseudo-random algorithm grounded in the high-quality data attributes prescribed by ISO 14224:2016 (Diaz et al., 2023; Díaz, 2023) and leveraging statistical distributions from the OREDA database (SINTEF, 2009), where failure rates follow a gamma distribution. This methodology ensures the generation of valid and practically useful synthetic maintenance data for industrial assets.

Applied to centrifugal pumps in an operational chemical engineering setting —specifically, the ethanol-water separation process at the University of Pamplona — our approach generated two distinct datasets (⁠ $X_{1}$ and $X_{2}$ ⁠). Dataset $X_{2}$ demonstrated strong alignment with OREDA's mean failure rate (exhibiting only a 1.96% error), validating its statistical fidelity. Despite higher deviations in standard deviation and maximum values, both datasets yield realistic maintenance records suitable for engineering analysis and decision-making without jeopardizing actual assets or operations.

2. Methodology

The generation of synthetic maintenance data requires a robust statistical foundation and a structured methodology. This study utilizes the OREDA database as its primary source for failure mode distributions, providing historically validated reliability data from major petrochemical companies (SINTEF, 2009). OREDA's structure organizes failure data by asset taxonomy, with statistical distributions categorized by failure mode severity (critical, degraded and incipient) and including key metrics such as failure rates, active repair hours, and man-hours.

For this research, the critical risk state was selected, with failure rates following a gamma distribution as confirmed by Kolmogorov-Smirnov and χ² goodness-of-fit tests (ibid.). The probability density function is given by:

f (x) = \frac{{\frac{θ}{\hat{σ}}}^{(\frac{θ^{2}}{\hat{σ}})}}{Γ (\frac{θ^{2}}{\hat{σ}})} x^{(\frac{θ^{2}}{\hat{σ}} - 1)} e^{(- \frac{θ}{\hat{σ}} x)},

(1)

and its cumulative distribution function by:

F (x) = \frac{γ (\frac{θ^{2}}{\hat{σ}}, \frac{θ}{\hat{σ}} x)}{Γ (\frac{θ^{2}}{\hat{σ}})},

(2)

where θ* represents the mean failure rate and $\hat{σ}$ the standard deviation.

The methodology also incorporates probability distributions for failure modes relative to non-maintainable components, ensuring the total probability sums to 100% as expressed by:

\sum_{i = 1}^{n} \sum_{j = 1}^{m} X_{j i} = 100 %,

(3)

where j denotes failure modes, i non-maintainable components, and X_ji the occurrence probability.

The synthetic data generation workflow, depicted in Figure 3, is implemented through Algorithm 1. This algorithm details the modular steps from statistical initialization to dataset cleaning, ensuring reproducibility and transparency.

Figure 3

A vertical flow diagram illustrates the methodology for generating synthetic maintenance data.

View large Download slide

The vertical flowchart begins at the top with a small oval labeled “Start”. A downward arrow leads to a parallelogram labeled “Data Structure Based on Dictionaries”. Another downward arrow connects to a parallelogram labeled “Static Operational Data”. From this box, a downward arrow leads to a parallelogram labeled “Iterate equals 0, Maximum Failure Rate Area, Number of Desired Records”. A downward arrow then leads to a rectangular process box labeled “Generate Pseudo-randomly Failure Rate by Failure Mode”. This is followed by another rectangular box labeled “Pseudo-random Selection of Failure Mode”. A downward arrow connects to the next rectangular box labeled “Pseudo-random Selection of Maintenance Required Occurrence”. Another downward arrow leads to a rectangular box labeled “Generate Pseudo-randomly of Maintenance Required Data”. Below this, a rectangular box is labeled “Pseudo-random Selection of Maintenance Type (Unplanned)”. A downward arrow then leads to a rectangular box labeled “Pseudo-random generation of Active Maintenance Hours”. The next rectangular box below is labeled “Maintenance Cost Generation”. A further downward arrow connects to a rectangular box labeled “Add Maintenance Record to the Storage Dataset”. From this box, a downward arrow leads to a diamond-shaped decision box labeled “Iterate less than Number of Desired Records”. The decision has two branches: the “No” branch on the left loops continues downward to a rectangular box labeled “Maintenance Window Calculation”, while a “Yes” branch on the right continues upward and loops back to “Generate Pseudo-randomly Failure Rate by Failure Mode”. A downward arrow from “Maintenance Window Calculation” then leads to a rectangular box labeled “Filter Dataset”. Below this, another rectangular box is labeled “Asset Failure Rate Calculation”. A downward arrow connects to a second diamond-shaped decision box labeled “lambda less than lambda subscript max”. From this decision, the “No” branch loops back upward to “Iterate equals 0, Maximum Failure Rate Area, Number of Desired Records”, while the “Si” branch continues downward to a small oval at the bottom labeled “End”.

Methodology for synthetic maintenance data generation

Algorithm 1.

Synthetic Data Generation for Maintenance Asset Records

1: Function SyntheticDataGeneration()
2: Input: Failure rate distribution parameters, component costs, maintenance parameters
3: Output: Synthetic maintenance dataset including Failure Mode, Component, TBF, Maintenance Cost, etc.
4: # Step 1: Initialization
5: Define constants and distributions (e.g. mean and standard deviation for failure rates)
6: Initialize data structures for storing generated records
7: # Step 2: Data Sampling
8: Generate random values [r₁, r₂, r₃, r₄, r₅] from N(0, 1)
9: # Step 3: Failure Mode and Component Selection
10: Sample Failure Mode and Component using probabilistic sampling with r₁
11: Sample Failure Description using probabilistic sampling with r₂
12: # Step 4: Failure Rate and TBF
13: Compute parameters $β = \frac{μ}{σ}$ and α = β ⋅ μ based on failure rate distributions
14: Calculate Failure Rate $λ = F_{X}^{- 1} (r_{3}, β, α)$
15: Calculate $T B F = - \ln (\frac{r_{4}}{λ})$
16: # Step 5: Dataset Filtering and Adjustments
17: while $\frac{1}{λ_{max}} \leq T B F \leq λ_{min}$ do
18: Filter dataset records based on TBF constraints
19: end while
20: Select subset n₂ from n₁ records
21: # Step 6: Calculate Maintenance Costs
22: Calculate Failure Date = Start Operation Date + TBF
23: Sort the dataset by Failure Date in ascending order
24: Calculate ManHours = $F_{X}^{- 1} (r_{5}, μ, σ)$ based on manhour distribution
25: Compute Maintenance Cost = Component Cost[Component] + (ManHours × ManHour Cost)
26: Update TBF = Current Failure Date - Last Failure Date (in hours)
27: Calculate Downtime = ManHours + (1 + Average Administrative Time Percentage)
28: # Step 7: Repeat and Clean Dataset
29: while Additional Filtering Needed do
30: Reapply filters to the dataset
31: end while
32: Clean dataset to finalize output
33: Return Synthetic maintenance dataset

Initialization: The process begins with defining the key statistical distributions and parameters, such as the mean and variance for failure rates and component costs. These parameters will shape the random variables generated later in the process. Next, data structures are initialized to store records that will eventually contain attributes such as failure modes, components, Time Between Failures (TBF), and maintenance costs.

Sampling Random Values: Once initialization is complete, random values are generated to simulate various aspects of the maintenance process. Values such as r₁, r₂, r₃, r₄, and r₅ are drawn from a standard normal distribution, N(0, 1), which will later be used in probabilistic sampling.

Failure Mode and Component Selection: Using the generated random values, failure modes and components are selected for each record. The value r₁ is applied within cumulative probability distributions for failure modes and components, allowing for probabilistic sampling. Similarly, r₂ is used to choose a failure description based on the cumulative probabilities associated with each failure mode.

Compute Failure Rate and TBF: After selecting the failure mode and component, the next step is to compute the failure rate (λ) and TBF for each record. Parameters β and α are calculated based on the mean and variance of the failure rate distributions. Using these parameters along with r₃, the failure rate λ is determined by inverting the cumulative distribution function, $F_{X}^{- 1} (r_{3}, β, α)$ ⁠. The TBF is then calculated as $T B F = - \ln (\frac{r_{4}}{λ})$ ⁠.

Dataset Filtering and Adjustments: With preliminary data generated, the dataset undergoes filtering based on the TBF values to ensure that records meet realistic operational limits. Only TBF values that fall within a specified range (between $\frac{1}{λ_{max}}$ and λ_min) are kept. This step ensures that the synthetic data aligns with practical operational expectations.

Calculate Maintenance Costs and Downtime: The maintenance cost and downtime associated with each failure are then calculated. The failure date is determined by adding TBF to the start operation date. Man-hours are computed using distribution-based calculations. Maintenance cost is calculated by combining the component cost with the product of man-hours and the man-hour cost rate. Downtime is calculated by adding man-hours and an additional administrative time percentage.

Final Filtering and Dataset Cleaning: After calculating costs and downtime, the dataset undergoes a final filtering and cleaning stage. Additional filters are reapplied if necessary to ensure that all records meet the predefined conditions. The dataset is then organized and prepared for output.

This structured process yields a synthetic maintenance dataset with realistic attributes, including varied failure modes, components, times between failures, costs, and downtimes, based on statistically grounded parameters and distributions.

3. Case study and results

3.1 Operational context

The selected study field for the research was the chemical engineering laboratory at the University of Pamplona, whereby the ethanol-water mixture separation subprocess was selected, focusing on a plate column with liquid feed, direct steam injection, and the possibility of having the top product as vapor or liquid through a condenser that can be used as total or partial condenser. The column feed can come from a feed tank or the rectification column (see Figure 4).

Figure 4

A diagram presents a schematic of a boiler steam and heat exchange process.

View large Download slide

The diagram shows a detailed process piping and instrumentation layout arranged horizontally, with multiple process units, pipelines, valves, instruments, and control loops interconnected. On the left side, a heat exchanger labeled “INTERCAMBIADORE-300” is shown at the top. An inlet line labeled “INVINODEFERMENTACIÓN” enters the system and passes through control and manual valves, including valves labeled “G V-202” and “G V-201”, before entering a vertical storage tank labeled “T K-400, 250 L”. The tank includes level instrumentation with a level transmitter labeled “L T 400” and a level controller labeled “L C 400”, connected by dashed signal lines to a set point labeled “S P”. Multiple inlet and outlet nozzles are shown on the tank, each fitted with valves labeled “B V-101”, “B V-109”, and “B V-104”. Below the left section, a pump labeled “P-400” is connected via pipelines and valves, including “G V-207” and “G V-205”. A speed controller labeled “S C 400” is shown connected to the pump with dashed control lines and a set point indicator. Further down, a second pump labeled “P-405” is shown with a corresponding speed controller labeled “S C 405”, also connected by dashed signal lines and a set point. In the central section, a large header labeled “VAPORDECALDERA” runs horizontally across the diagram, representing the boiler steam line. Along this line are several control and measurement instruments, including a flow controller labeled “F C-300”, a flow transmitter labeled “F T-300”, and temperature instruments labeled “T C-101” and “T E-101”. A control valve assembly highlighted in the diagram includes valves labeled “B V-109”, “F C V-100”, “B V-110”, and “G V-204”, connected in series on the steam line. On the right side, a tall vertical column labeled “C-100” is shown. The column includes multiple side connections with valves and temperature elements labeled “T E-103”, as well as a level transmitter labeled “L T-100” near the lower section. A level controller labeled “L C-100” is connected to the column by dashed control lines and a set point labeled “S P”. At the lower right, a storage tank labeled “T K-405, 250 L” is shown. The tank includes a level transmitter labeled “L T-405” and an outlet valve labeled “B V-111”. The tank is connected to upstream process lines via valves and piping.

Process P&ID

Figure 4

View large Download slide

The diagram shows a detailed process piping and instrumentation layout arranged horizontally, with multiple process units, pipelines, valves, instruments, and control loops interconnected. On the left side, a heat exchanger labeled “INTERCAMBIADORE-300” is shown at the top. An inlet line labeled “INVINODEFERMENTACIÓN” enters the system and passes through control and manual valves, including valves labeled “G V-202” and “G V-201”, before entering a vertical storage tank labeled “T K-400, 250 L”. The tank includes level instrumentation with a level transmitter labeled “L T 400” and a level controller labeled “L C 400”, connected by dashed signal lines to a set point labeled “S P”. Multiple inlet and outlet nozzles are shown on the tank, each fitted with valves labeled “B V-101”, “B V-109”, and “B V-104”. Below the left section, a pump labeled “P-400” is connected via pipelines and valves, including “G V-207” and “G V-205”. A speed controller labeled “S C 400” is shown connected to the pump with dashed control lines and a set point indicator. Further down, a second pump labeled “P-405” is shown with a corresponding speed controller labeled “S C 405”, also connected by dashed signal lines and a set point. In the central section, a large header labeled “VAPORDECALDERA” runs horizontally across the diagram, representing the boiler steam line. Along this line are several control and measurement instruments, including a flow controller labeled “F C-300”, a flow transmitter labeled “F T-300”, and temperature instruments labeled “T C-101” and “T E-101”. A control valve assembly highlighted in the diagram includes valves labeled “B V-109”, “F C V-100”, “B V-110”, and “G V-204”, connected in series on the steam line. On the right side, a tall vertical column labeled “C-100” is shown. The column includes multiple side connections with valves and temperature elements labeled “T E-103”, as well as a level transmitter labeled “L T-100” near the lower section. A level controller labeled “L C-100” is connected to the column by dashed control lines and a set point labeled “S P”. At the lower right, a storage tank labeled “T K-405, 250 L” is shown. The tank includes a level transmitter labeled “L T-405” and an outlet valve labeled “B V-111”. The tank is connected to upstream process lines via valves and piping.

Process P&ID

Based on an analysis conducted during the research, the assets with high criticality involved in the subprocess are the centrifugal pumps P − 400 and P − 405, located in the P&ID of Figure 4.

Similarly, four failure modes were determined for the study that are common in the mentioned assets: External Leak (ELU), Abnormal Instrument Reading (AIR), Vibration (VIB), and Breakdown (BRD) ^[1].

For this reason, a dataset of synthetic maintenance data for centrifugal pumps with four failure modes was generated based on the methodology for generating synthetic maintenance data and the constructed programming algorithms.

3.2 Statistical information characterization

The OREDA provided statistical information based on statistical and percentage distributions per failure mode and non-maintainable components (SINTEF, 2009, pp.138-145).

To satisfy (2) so that the data resembles actual behavior, it is determined, from the statistical distribution of the overall failure rate found in the OREDA during critical phases of the asset, (4). It is worth noting that the values found are failure rates per million hours (10⁶ h).

\begin{align} λ \in [Λ_{\min}; Λ_{\max}] : Λ_{\min} & = 3 \times 1 0^{- 4}, \\ Λ_{\max} & = 136.71, \\ μ & = 28.08, \\ σ & = 56.95 . \end{align}

(4)

Under the table structure, these distributions were characterized, allowing for the definition of the data structures proposed in the algorithms presented in this article.

In Table 1, the statistical distribution of failure rates about the selected failure modes for the asset is observed.

Table 1

Statistical distribution of failure rate by failure mode

Failure mode	min	μ	max	σ
BRD	0.51	2.45	5.59	1.63
VIB	0.51	3.82	9.68	3.0
ELU	0.0	3.24	11.62	15.60
AIR	0.0	0.31	1.72	0.89

Table 2 shows the probability of a failure mode caused by a non-maintainable asset component.

Table 2

Occurrence probability of failure mode per non-maintainable component

Failure mode	Non-maintainable component	Failure probability [0; 1]
BRD	Bearing	0
BRD	Cabling and junction boxes	0
BRD	Casing	0.41
BRD	Control unit	0
…	…	…
AIR	Wiring	0.89
AIR	Valves	0

Based on a market analysis, the cost associated with replacing each non-maintainable component of the asset has been defined in Colombian pesos (COP). The prices of the components are expressed in monetary units (see Table 3).

Table 3

Associated costs per non-maintainable component

Component	Unit price (COP)
Bearing	150,000.00
Cabling and junction boxes	100,000.00
Casing	50,000.00
Control unit	300,000.00
…	…
Wiring	30,000.00
Valves	40,000.00

Finally, Table 4 defines statistical information regarding the active maintenance hours required per failure mode of the asset.

Table 4

Active maintenance hours

Failure mode	μ	max
BRD	15	30
VIB	27	77
ELU	15	45
AIR	44	44

3.3 Synthetic maintenance dataset obtained

Two synthetic maintenance datasets were generated for the centrifugal pump under study based on the four identified failure modes and the provided statistical information. The datasets include maintenance records with relevant and non-redundant maintenance data, as required by ISO 14224:2016. These two data sets will be called $X_{1}$ and $X_{2}$ ⁠, respectively. The online Supplementary Material includes Dataset $X_{1}$ and Dataset $X_{2}$ ⁠.

These two data sets were generated with different statistical properties. Data set $X_{1}$ behaves in such a way that it satisfies the ranges of the observed set in (4), while data set $X_{2}$ follows the behavior observed in (5). Despite the latter, data set $X_{2}$ retains statistical characteristics that can be considered as actual for synthetic data, considering that it also complies with (4).

\begin{align} λ \in [Λ_{\min}; Λ_{\max}] : Λ_{\min} & = 3 \times 1 0^{- 4}, \\ Λ_{\max} & = μ, \\ μ & = 28.08, \\ σ & = 56.95 . \end{align}

(5)

During the code execution, technological limitations were encountered that made it impossible to increase the expected number of maintenance records (K) without affecting the expected statistical properties, thus obtaining the data sets described by (6) and (7).

X_{1} = \{x_{i, j} ∣ i = 1,2, \dots, n; j = 1,2, \dots, K; 2 \leq K \leq 140\}

(6)

X_{2} = \{x_{i, j} ∣ i = 1,2, \dots, n; j = 1,2, \dots, K; 2 \leq K \leq 30\}

(7)

When the variable filtering process is applied to the maintenance data sets as seen in (6) and (7), the synthetic maintenance data sets with similar statistical and business characteristics are obtained (see 2, 4, 8, and 9).

H_{1} = \{x_{i, j} ∣ i = 1,2, \dots, n; j = 1,2, \dots, m; m \leq K\}

(8)

H_{2} = \{x_{i, j} ∣ i = 1,2, \dots, n; j = 1,2, \dots, m; m \leq K\}

(9)

Table 5 presents a representative subset of the records generated for the dataset $X_{1}$ ⁠. This subset illustrates different maintenance records, whether corrective or preventive, with all the corresponding data for each record.

Table 5

Subset of synthetic maintenance data generated for centrifugal pump

	Registration date	Maintenance Type	Replaced component	Failure mode	Active maintenance hours	Costs	UT	DT	TTR	TBF	OT
0	2016–08-27 15:30:11.141296	Corrective	Instrument, pressure	AIR	44.0	405752.0	3273.451392	44.0	44.0	3273.451392	3317.451392
1	2029–10-23 15:34:54.101196	Preventive	Instrument, flow	AIR	44.0	395752.0	1536.894218	44.0	44.0	1536.894218	1580.894218
2	2078–10-21 19:13:21.852802	Corrective	Instrument, flow	AIR	44.0	395752.0	8748.577526	44.0	44.0	8748.577526	8792.577526
3	2085–02-13 11:29:20.306991	Preventive	Instrument, pressure	AIR	44.0	405752.0	168077.471786	44.0	44.0	168077.471786	168121.471786

	Registration date	Maintenance Type	Replaced component	Failure mode	Active maintenance hours	Costs	UT	DT	TTR	TBF	OT
0	2016–08-27 15:30:11.141296	Corrective	Instrument, pressure	AIR	44.0	405752.0	3273.451392	44.0	44.0	3273.451392	3317.451392
1	2029–10-23 15:34:54.101196	Preventive	Instrument, flow	AIR	44.0	395752.0	1536.894218	44.0	44.0	1536.894218	1580.894218
2	2078–10-21 19:13:21.852802	Corrective	Instrument, flow	AIR	44.0	395752.0	8748.577526	44.0	44.0	8748.577526	8792.577526
3	2085–02-13 11:29:20.306991	Preventive	Instrument, pressure	AIR	44.0	405752.0	168077.471786	44.0	44.0	168077.471786	168121.471786

One hundred and seventeen (117) filtered maintenance synthetic data records were obtained for dataset $X_{1}$ ⁠. Meanwhile, twenty-six (26) filtered maintenance synthetic data records were obtained for dataset $X_{2}$ ⁠.

The statistical comparison between the synthetic datasets and the OREDA benchmark is summarized in Table 6, which presents key failure rate metrics. Subsequently, the Time Between Failures (TBF) distributions are visualized in Figures 5–7.

Table 6

Global failure rate (per 10⁶ h)

Dataset	Min	μ	σ	Max
OREDA	3 × 10^–4	28.08	56.95	136.71
$X_{1}$	5.70	129.20	38.08	84388.18
$X_{2}$	2.74	27.53	13.72	536.80

Figure 5

A probability density curve shows distribution of time between failures.

View large Download slide

The horizontal axis labeled “Time Between Failures” ranges from 0 to 1.5 multiplied by 10 to the 5th power, with tick marks at 0, 0.5, 1.0, and 1.5. The vertical labeled “Probability Density” axis ranges from 0 to 2 multiplied by 10 to the negative 5th power, with tick marks at 0, 1, and 2. A smooth probability density curve is plotted. The curve begins at 0.6 on the vertical axis at a horizontal value near 0.1 multiplied by 10 to the 5th power. It rises steadily to a peak slightly above 2 multiplied by 10 to the negative 5th power at a horizontal value around 0.35 multiplied by 10 to the 5th power. After reaching the peak, the curve declines sharply, approaching 0 on the vertical axis by the time the horizontal value reaches 0.9 multiplied by 10 to the 5th power. From that point onward up to 1.5 multiplied by 10 to the 5th power, the curve remains close to 0. Note: All numerical data values are approximated.

Statistical distribution of TBF for the asset

Figure 6

A probability density curve shows time between failures decreasing rapidly from an initial peak.

View large Download slide

The vertical axis labeled “Probability Density” ranges from 0.0 to 3.5 multiplied by 10 to the negative 5 power, with increments of 0.5 multiplied by 10 to the negative 5 power. The horizontal axis labeled “Time Between Failures” ranges from 0.0 to 1.5 multiplied by 10 to the 5th power, with increments of 0.5 multiplied by 10 to 5th power. A smooth probability density curve starts at its maximum value near 3.5 multiplied by 10 to the negative 5 power at time 0. The curve drops sharply as time increases, reaching 1.5 multiplied by 10 to the negative 5 power near 0.1 multiplied by 10 to the 5 power. Note: All numerical data values are approximated.

Statistical distribution of TBF for the asset obtained from Synthetic Maintenance Dataset $X_{1}$

Figure 7

A probability density curve represents variation in time between failures through a single peak.

View large Download slide

The vertical axis labeled “Probability Density” ranges from 0 to 8 multiplied by 10 to the negative 6 power, with visible tick marks at 0, 2, 4, 6, and 8 multiplied by 10 to the negative 6 power. The horizontal axis labeled “Time Between Failures” ranges from 0 to 1.5 multiplied by 10 to the 5 power, with tick marks at 0, 0.5, 1.0, and 1.5 multiplied by 10 to the 5 power. A smooth probability density curve begins near a value slightly above 8 multiplied by 10 to the negative 6 power at time 0. The curve rises slightly to its highest point just above 8 multiplied by 10 to the negative 6 power at a horizontal value around 0.2 multiplied by 10 to the 5 power. After reaching this peak, the curve steadily declines and approaches 0 on the vertical axis and 1.7 multiplied by 10 to the 5 power on the horizontal axis. Note: All numerical data values are approximated.

Statistical Distribution of TBF for the asset obtained from Synthetic Maintenance Dataset $X_{2}$

Figure 5 depicts the gamma distribution of TBF derived from OREDA data, serving as the reference for synthetic data generation. Figure 6 shows the TBF distribution for dataset $X_{1}$ ⁠, which exhibits a higher mean failure rate and distinct spread compared to OREDA. Figure 7 presents the TBF distribution for dataset $X_{2}$ ⁠, which aligns closely with OREDA's mean.

As observed in Table 6, dataset $X_{2}$ demonstrates better alignment with OREDA's mean failure rate. The percentage errors for the mean (%_μ), standard deviation (%_σ), and maximum value (%_max) are calculated as follows (Montgomery, 2004):

%_{μ} = |\frac{28.08 - 27.53}{28.08}| \cdot 100 % = 1.96 % .

(10)

%_{σ} = |\frac{56.95 - 13.72}{56.95}| \cdot 100 % = 75.91 % .

(11)

%_{\max} = |\frac{136.71 - 536.80}{136.71}| \cdot 100 % = 292.66 % .

(12)

4. Conclusion

This paper established a methodology and developed pseudo-code algorithms to generate synthetic maintenance data for industrial assets. By implementing pseudo-random functions within statistical, probabilistic, and exponential-variate distributions based on the OREDA, the approach successfully produces realistic maintenance records.

The results demonstrate the method's practical value. As evidenced in Table 6, the $X_{2}$ dataset shows a notably strong performance, with its mean value exhibiting a minimal error of just 1.96% compared to the OREDA distribution (10). This high degree of accuracy in replicating the central tendency underscores the model's effectiveness. While larger errors were observed for the standard deviation (75.91%) and the maximum global failure rate (292.66%), these are not inherently unfavorable. Instead, they reflect the stochastic nature of the pseudo-random generation across multiple failure modes, which imbues each synthetic dataset with a unique variability from which valuable maintenance engineering insights can be extracted.

Ultimately, this work provides a robust, low-risk tool for simulation and decision-making in asset management. The generated datasets, exemplified by the centrifugal pump category in Table 5, possess statistical and behavioral fidelity to real-world data. This enables organizations to test maintenance strategies, train models, and plan asset lifecycles efficiently and safely—without jeopardizing operational integrity, production, safety, or the environment. Thus, the research bridges a critical gap between theory and practice by leveraging synthetic maintenance data generation as a foundational seed for advanced applications, such as digital twins and machine learning models, which facilitates the incorporation of these and other data-driven tools into decision-making processes to continuously improve maintenance management.

This work is supported by University of Pamplona and University of the Andes, Colombia.

Note

1.

Although four failure modes were determined, this does not mean that more failure modes cannot be considered.

The supplementary material for this article can be found online.

References

Abbate

,

R.

,

Caterino

,

M.

,

Fera

,

M.

and

Caputo

,

F.

(

2022

), “

Maintenance digital twin using vibration data

”,

Procedia Computer Science

, Vol.

200

, pp.

546

-

555

,

ISSN: 18770509

, doi:

https://doi.org/10.1016/j.procs.2022.01.252

.

Google Scholar

Crossref

Bekar

,

E.T.

,

Nyqvist

,

P.

and

Skoogh

,

A.

(

2020

), “

An intelligent approach for data pre-processing and analysis in predictive maintenance with an industrial case study

”,

Advances in Mechanical Engineering

, Vol.

12

No.

5

, 168781402091920,

ISSN: 1687-8140

, doi:

https://doi.org/10.1177/1687814020919207

.

Google Scholar

Bousdekis

,

A.

,

Lepenioti

,

K.

,

Apostolou

,

D.

and

Mentzas

,

G.

(

2021

), “

A review of data-driven decision making methods for industry 4.0 maintenance applications

”,

Electronics

, Vol.

10

No.

7

,

828

, doi:

https://doi.org/10.3390/electronics10070828

.

Google Scholar

Crossref

Ciliberti

,

V.A.

,

Østebø

,

R.

,

Selvik

,

J.T.

and

Alhanati

,

F.J.S.

(

2019

), “D041S055R003,

Optimize safety and profitability by use of the ISO 14224 standard and big data analytics

”, doi:

https://doi.org/10.4043/29634-MS

.

Google Scholar

Cole

,

D.

,

Nelson

,

J.

and

McDaniel

,

B.

(

2015

), “

Benefits and risks of big data

”,

SAIS 2015

.

Google Scholar

Cui

,

P.-H.

,

Wang

,

J.-Q.

and

Yang

,

Li

(

2022

), “

Data-driven modelling, analysis and improvement of multistage production systems with predictive maintenance and product quality

”,

International Journal of Production Research

, Vol.

60

No.

22

, pp.

6848

-

6865

,

ISSN: 0020-7543

, doi:

https://doi.org/10.1080/00207543.2021.1962558

.

Google Scholar

Crossref

Dankar

,

F.K.

and

Ibrahim

,

M.

(

2021

), “

Fake it till you make it: guidelines for effective synthetic data generation

”,

Applied Sciences

, Vol.

11

No.

5

,

2158

,

ISSN: 2076-3417

, doi:

https://doi.org/10.3390/app11052158

.

Google Scholar

Crossref

Díaz

,

S.

(

2023

),

Metodología de Análisis de Datos de Mantenimiento en la Industria Aplicando la Norma ISO 14224:2016 mediante el Uso de Ciencia de Datos y Machine Learning en el Contexto del Metamantenimiento

,

Tesis de Pregrado. Universidad de Pamplona

.

Google Scholar

Diaz

,

S.

,

Tarantino

,

R.

and

Aranguren

,

S.

(

2023

), “

Metodología de Análisis de Datos aplicado al Metamantenimiento Industrial

”,

Congreso Internacional de Electrónica y Tecnologías de Avanzada 16

,

Universidad de Pamplona

.

Google Scholar

El Emam

,

K.

,

Mosquera

,

L.

and

Hoptroff

,

R.

(

2020

), in

Hassell

,

J.

,

Collins

,

C.

and

Faucher

,

C.

(Eds),

Practical Synthetic Data Generation

, (1st ed.) ,

O’Reilly Media

.

Google Scholar

Filz

,

M.-A.

,

Langner

,

J.E.B.

,

Herrmann

,

C.

and

Thiede

,

S.

(

2021

), “

Data-driven failure mode and effect analysis (FMEA) to enhance maintenance planning

”,

Computers in Industry

, Vol.

129

, 103451,

ISSN: 01663615

, doi:

https://doi.org/10.1016/j.compind.2021.103451

.

Google Scholar

Giuffrè

,

M.

and

Shung

,

D.L.

(

2023

), “

Harnessing the power of synthetic data in healthcare: innovation, application, and privacy

”,

Npj Digital Medicine

, Vol.

6

No.

1

, p.

186

, doi:

https://doi.org/10.1038/s41746-023-00927-3

.

Google Scholar

Crossref

PubMed

Hannam

,

R.

(

1997

),

Computer Integrated Manufacturing: From Concepts to Realisation

, (1st ed.) ,

Addison Wesley Longman

,

Harlow

.

Google Scholar

International Organization for Standardization

(

2016

),

ISO 14224:2016

.

Jones

,

R.B.

(

1995

),

Risk-Based Management: A Reliability-Centered Approach

,

Gulf Publishing

,

Houston, TX

.

Google Scholar

Jordon

,

J.

,

Szpruch

,

L.

,

Houssiau

,

F.

,

Bottarelli

,

M.

,

Cherubin

,

G.

,

Maple

,

C.

,

Cohen

,

S.

and

Weller

,

A.

(

2022

),

Synthetic Data - What, Why and How?

,

The Alan Turing Institute

.

Google Scholar

Lakshmanan

,

K.

,

Tessicini

,

F.

,

Gil

,

A.J.

and

Auricchio

,

F.

(

2023

), “

A fault prognosis strategy for an external gear pump using machine learning algorithms and synthetic data generation methods

”,

Applied Mathematical Modelling

, Vol.

123

, pp.

348

-

372

,

ISSN: 0307904X

, doi:

https://doi.org/10.1016/j.apm.2023.07.001

.

Google Scholar

Crossref

Lautrup

,

A.D.

,

Hyrup

,

T.

,

Zimek

,

A.

and

Schneider-Kamp

,

P.

(

2024

), “

Systematic review of generative modelling tools and utility metrics for fully synthetic tabular data

”,

ACM Computing Surveys

, Vol.

57

No.

4

, pp.

1

-

38

, doi:

https://doi.org/10.1145/3704437

.

Google Scholar

Crossref

Li

,

Z.

,

Yue

,

Z.

and

Fu

,

J.

(

2020

), “

SynC: a copula based framework for generating synthetic data from aggregated sources

”. In:

IEEE

, pp.

571

-

578

. ISBN:

[PubMed]

, doi:

https://doi.org/10.1109/ICDMW51313.2020.00082

.

Google Scholar

Liu

,

Y.

,

Feng

,

J.

,

Lu

,

J.

and

Zhou

,

S.

(

2024

), “

A review of digital twin capabilities, technologies, and applications based on the maturity model

”,

Advanced Engineering Informatics

, Vol.

62

, 102592, doi:

https://doi.org/10.1016/j.aei.2024.102592

.

Google Scholar

Mannino

,

M.

and

Abouzied

,

A.

(

2019

), “

Is this real? Generating synthetic data that looks real

”. In:

ACM

, pp.

549

-

561

. ISBN:

[PubMed]

, doi:

https://doi.org/10.1145/3332165.3347866

.

Google Scholar

Martínez-Heredia

,

A.M.

and

Ventura

,

S.

(

2025

), “

Weak supervision: a survey on predictive maintenance

”,

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

, Vol.

15

No.

2

, e70022, doi:

https://doi.org/10.1002/widm.70022

.

Google Scholar

Merkt

,

O.

(

2019

), “

On the use of predictive models for improving the quality of industrial maintenance: an analytical literature review of maintenance strategies

”, Vol.

18

, pp.

693

-

704

, doi:

https://doi.org/10.15439/2019F101

.

Google Scholar

Montgomery

,

D.

(

2004

),

Diseño y Análisis de Experimentos

, (2nd ed.) ,

Limusa Wiley

,

Mexico D.F

.

Google Scholar

Mora

,

L.

(

2009

),

Mantenimiento. Planeación, Ejecución Y Control

, (1st ed.) ,

Alfaomega Grupo Editor

,

Mexico D.F

.

Google Scholar

Okagbue

,

H.

,

Adamu

,

M.O.

and

Anake

,

T.A.

(

2020

), “

Approximations for the inverse cumulative distribution function of the gamma distribution used in wireless communication

”,

Heliyon

, Vol.

6

No.

11

, e05523,

ISSN: 24058440

, doi:

https://doi.org/10.1016/j.heliyon.2020.e05523

.

Google Scholar

Pan

,

J.

,

Sun

,

B.

,

Wu

,

Z.

,

Yi

,

Z.

,

Feng

,

Q.

,

Ren

,

Y.

and

Wang

,

Z.

(

2024

), “

Probabilistic remaining useful life prediction without lifetime labels: a Bayesian deep learning and stochastic process fusion method

”,

Reliability Engineering and System Safety

, Vol.

250

, 110313, doi:

https://doi.org/10.1016/j.ress.2024.110313

.

Google Scholar

SAE, International

(

2009

),

SAE, international JA1011, evaluation criteria for reliability-centered maintenance RCM

,

Standard. SAE International

.

Sajid

,

S.

,

Haleem

,

A.

,

Bahl

,

S.

,

Javaid

,

M.

,

Goyal

,

T.

and

Mittal

,

M.

(

2021

), “

Data science applications for predictive maintenance and materials science in context to Industry 4.0

”,

Materials Today: Proceedings

, Vol.

45

, pp.

4898

-

4905

,

ISSN: 22147853

, doi:

https://doi.org/10.1016/j.matpr.2021.01.357

.

Google Scholar

Crossref

SINTEF

(

2009

),

OREDA: Offshore Reliability Data Handbook

, (5th ed.) , Vol.

1

,

OREDA Participants

,

Trondheim

,

[PubMed]

.

Tarantino

,

R.

(

2021

),

Metamantenimiento: Una Propuesta para Incrementar la Confiabilidad en los Procesos Industriales

.

Google Scholar

Zheng

,

X.

,

Yao

,

W.

,

Xu

,

Y.

and

Wang

,

N.

(

2024

), “

Algorithms for Bayesian network modeling and reliability inference of complex multistate systems with common cause failure

”, In:

Reliability Engineering and System Safety

, Vol.

241

, 109663, doi:

https://doi.org/10.1016/j.ress.2023.109663

.

Google Scholar

Zio

,

E.

and

Miqueles

,

L.

(

2024

), “

Digital Twins in safety analysis, risk assessment and emergency management

”,

Reliability Engineering and System Safety

, Vol.

246

, 110040, doi:

https://doi.org/10.1016/j.ress.2024.110040

.

Google Scholar

2025

Sebastian Diaz Vivas, Rocco Tarantino Alvarado, Sandra Aranguren Zambrano and Alejandra Tabares Pozos

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence.

Synthetic maintenance data generation for industrial assets based on historic statistical distribution using pseudo-random algorithm

1. Introduction

2. Methodology

3. Case study and results

3.1 Operational context

3.2 Statistical information characterization

3.3 Synthetic maintenance dataset obtained

4. Conclusion

Note

References

Supplementary data

Email Alerts

Cited By

Synthetic maintenance data generation for industrial assets based on historic statistical distribution using pseudo-random algorithm Open Access

1. Introduction

2. Methodology

3. Case study and results

3.1 Operational context

3.2 Statistical information characterization

3.3 Synthetic maintenance dataset obtained

4. Conclusion

Note

References

Supplementary data

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Synthetic maintenance data generation for industrial assets based on historic statistical distribution using pseudo-random algorithm