The article aims to address the challenge of partial or complete absence of maintenance data records for industrial assets by generating synthetic maintenance data under a high-quality maintenance data structure established in the framework of International Organization for Standardization (ISO) 14224:2016. The preceding contributes to maintenance engineering, a strategy to obtain meaningful synthetic data in maintenance management analysis without exposing industrial assets to failures that may lead to undesired consequences.
The research was conducted under an experimental study aimed at generating synthetic maintenance data from historical statistical distributions of industrial assets. For experimental purposes, based on the criticality of the studied process context, the research was carried out on a centrifugal pump, with its primary data source from the Offshore Reliability Data Handbook (OREDA), from which the four failure modes with the highest failure rate and the non-maintainable components related to the failure rate by probability were selected. The data were processed using Python 3.10.12, using a methodology of standardizing the data structure, for which a pseudo-code was established.
The article addresses the generation of synthetic maintenance data using historical statistical distributions from the OREDA. Two sets of synthetic data were obtained for a centrifugal pump, with the second set maintaining originality by defining the maximum failure rate as the mean of the global failure rate based on accurate data, demonstrated with an error of 1.96%. This approach allows for objective decision-making when forecasting different scenarios, as the synthetic data set acquires its dynamics dependent on the statistical distribution of the failure rate by failure modes, evidenced by the error in the standard deviation.
The article focuses on generating synthetic maintenance data by developing an algorithm based on internationally recognized statistical distributions aligned with the international standards of ISO 14224:2016. This approach aims to create a synthetic maintenance dataset with maintenance records from which maintenance variables and indicators can be derived. These derived insights enable maintenance optimization through data-driven decision-making feedback loops.
1. Introduction
Technological evolution has transformed maintenance management into a critical function for industrial competitiveness, leading companies to prioritize strategies that maximize asset performance and operational profitability (Mora, 2009). In this context, data has emerged as a pivotal industrial asset, enabling pivotal insights and informed decision-making through diverse analytical strategies (Merkt, 2019; Bekar et al., 2020; Bousdekis et al., 2021; Filz et al., 2021; Sajid et al., 2021; Abbate et al., 2022; Cui et al., 2022). These data-driven approaches facilitate the iterative optimization of maintenance management via continuous feedback mechanisms (Ciliberti et al., 2019; Tarantino, 2021; Diaz et al., 2023).
However, a significant obstacle to implementing these advanced analytics is the frequent scarcity or complete absence of high-quality maintenance data records. This often forces a reliance on established, yet inherently limited, maintenance techniques like TPM, RCM, and FMEA (Mora, 2009; International Organization for Standardization, 2016). While foundational, such approaches fall short of the comprehensive, data-driven engineering analysis required for holistic asset management – which is necessary to optimize strategies without compromising assets, production, or safety (Jones, 1995; SAE, International, 2009). This challenge is critical in high-consequence environments where fault impacts and data integration are vital (Hannam, 1997). A further complication involves ensuring data quality and reliability, which demands a well-planned acquisition process that prioritizes essential assets and protects their integrity (Díaz, 2023; Diaz et al., 2023; International Organization for Standardization, 2016).
To address this, ISO 14224:2016 outlines a structured framework for data acquisition. Figure 1 illustrates the planning procedure, which involves steps such as determining data acquisition processes, verifying and researching data sources, defining the maintenance data to acquire, setting time, population, and operation parameters, ensuring uniformity in failure definition and classification, and training staff. Subsequently, Figure 2 depicts the acquisition procedure, which includes accessing consequential data, interpreting data, transferring data to a database, and evaluating and analyzing gained insights.
The vertical flow starts at the top with a box labeled “Determining Data Acquisition Processes”. A downward arrow leads to the second box labeled “Verifying and Re-searching Data Sources”. A downward arrow then connects to the third box labeled “Determining Maintenance Data to Acquire”. From this box, another downward arrow leads to the fourth box labeled “Determining Time, Population, and Operation Parameters”. A further downward arrow connects to the fifth box labeled “Determining Uniformity in Failure Definition and Classification”. Finally, a downward arrow leads to the last box at the bottom labeled “Training Staff”.Planning procedure for acquiring quality data according to ISO 14224:2016
The vertical flow starts at the top with a box labeled “Determining Data Acquisition Processes”. A downward arrow leads to the second box labeled “Verifying and Re-searching Data Sources”. A downward arrow then connects to the third box labeled “Determining Maintenance Data to Acquire”. From this box, another downward arrow leads to the fourth box labeled “Determining Time, Population, and Operation Parameters”. A further downward arrow connects to the fifth box labeled “Determining Uniformity in Failure Definition and Classification”. Finally, a downward arrow leads to the last box at the bottom labeled “Training Staff”.Planning procedure for acquiring quality data according to ISO 14224:2016
The vertical flow starts at the top with a box labeled “Accessing Consequential Data”. A downward arrow leads to the second box labeled “Interpreting Data”. Another downward arrow connects to the third box labeled “Data Transfer to Database”. From this box, a further downward arrow leads to the final box at the bottom labeled “Evaluating and Analysis Gained Insights”.Procedure for acquiring quality data according to ISO 14224:2016
The vertical flow starts at the top with a box labeled “Accessing Consequential Data”. A downward arrow leads to the second box labeled “Interpreting Data”. Another downward arrow connects to the third box labeled “Data Transfer to Database”. From this box, a further downward arrow leads to the final box at the bottom labeled “Evaluating and Analysis Gained Insights”.Procedure for acquiring quality data according to ISO 14224:2016
The escalating value of data as a convertible asset for business intelligence underscores the need to overcome inherent data issues, which can be categorized into concerns regarding privacy release and sensitivity, data bias and variance, the need for increased data robustness, and limited or zero data availability (Cole et al., 2015; Mannino and Abouzied, 2019; Dankar and Ibrahim, 2021; Jordon et al., 2022). Synthetic data emerges as a viable solution to these challenges, defined as artificially generated data that acquires the statistical and phenomenological properties of a real dataset (Dankar and Ibrahim, 2021; Jordon et al., 2022). Methodological advancements highlight that synthetic datasets vary in fitness for purpose, necessitating tailored evaluation protocols and practical metrics to enhance reproducibility (Lautrup et al., 2024; Giuffrè and Shung, 2023).
Synthetic data offers significant benefits, including privacy protection through the absence of personally identifiable information, improved accessibility by creating larger datasets, flexibility to replicate specific statistical characteristics, and the potential to enhance original data quality (El Emam et al., 2020). Various generation techniques exist, such as those based on Bayesian networks, copulas, parametric fitting (e.g. Monte Carlo methods), and non-parametric trees (Mannino and Abouzied, 2019; Dankar and Ibrahim, 2021; El Emam et al., 2020; Li et al., 2020; Jordon et al., 2022; Okagbue et al., 2020).
Within reliable maintenance engineering, synthetic data generation employs diverse strategies, including machine learning, such as generative adversarial networks (GANs) and mathematical functions, maintaining statistical fidelity (Lakshmanan et al., 2023; Martínez-Heredia and Ventura, 2025).. Contemporary research trends encompass digital twin frameworks for physics-informed data streams and Bayesian methods for uncertainty-aware reliability estimates, providing a context where a parameterized pseudo-random generator offers a reproducible baseline for transparent benchmarking (Zio and Miqueles, 2024; Liu et al., 2024; Pan et al., 2024; Zheng et al., 2024). To address the challenge of absent maintenance records, this work develops a pseudo-random algorithm grounded in the high-quality data attributes prescribed by ISO 14224:2016 (Diaz et al., 2023; Díaz, 2023) and leveraging statistical distributions from the OREDA database (SINTEF, 2009), where failure rates follow a gamma distribution. This methodology ensures the generation of valid and practically useful synthetic maintenance data for industrial assets.
Applied to centrifugal pumps in an operational chemical engineering setting —specifically, the ethanol-water separation process at the University of Pamplona — our approach generated two distinct datasets ( and ). Dataset demonstrated strong alignment with OREDA's mean failure rate (exhibiting only a 1.96% error), validating its statistical fidelity. Despite higher deviations in standard deviation and maximum values, both datasets yield realistic maintenance records suitable for engineering analysis and decision-making without jeopardizing actual assets or operations.
2. Methodology
The generation of synthetic maintenance data requires a robust statistical foundation and a structured methodology. This study utilizes the OREDA database as its primary source for failure mode distributions, providing historically validated reliability data from major petrochemical companies (SINTEF, 2009). OREDA's structure organizes failure data by asset taxonomy, with statistical distributions categorized by failure mode severity (critical, degraded and incipient) and including key metrics such as failure rates, active repair hours, and man-hours.
For this research, the critical risk state was selected, with failure rates following a gamma distribution as confirmed by Kolmogorov-Smirnov and χ2 goodness-of-fit tests (ibid.). The probability density function is given by:
and its cumulative distribution function by:
where θ* represents the mean failure rate and the standard deviation.
The methodology also incorporates probability distributions for failure modes relative to non-maintainable components, ensuring the total probability sums to 100% as expressed by:
where j denotes failure modes, i non-maintainable components, and Xji the occurrence probability.
The synthetic data generation workflow, depicted in Figure 3, is implemented through Algorithm 1. This algorithm details the modular steps from statistical initialization to dataset cleaning, ensuring reproducibility and transparency.
The vertical flowchart begins at the top with a small oval labeled “Start”. A downward arrow leads to a parallelogram labeled “Data Structure Based on Dictionaries”. Another downward arrow connects to a parallelogram labeled “Static Operational Data”. From this box, a downward arrow leads to a parallelogram labeled “Iterate equals 0, Maximum Failure Rate Area, Number of Desired Records”. A downward arrow then leads to a rectangular process box labeled “Generate Pseudo-randomly Failure Rate by Failure Mode”. This is followed by another rectangular box labeled “Pseudo-random Selection of Failure Mode”. A downward arrow connects to the next rectangular box labeled “Pseudo-random Selection of Maintenance Required Occurrence”. Another downward arrow leads to a rectangular box labeled “Generate Pseudo-randomly of Maintenance Required Data”. Below this, a rectangular box is labeled “Pseudo-random Selection of Maintenance Type (Unplanned)”. A downward arrow then leads to a rectangular box labeled “Pseudo-random generation of Active Maintenance Hours”. The next rectangular box below is labeled “Maintenance Cost Generation”. A further downward arrow connects to a rectangular box labeled “Add Maintenance Record to the Storage Dataset”. From this box, a downward arrow leads to a diamond-shaped decision box labeled “Iterate less than Number of Desired Records”. The decision has two branches: the “No” branch on the left loops continues downward to a rectangular box labeled “Maintenance Window Calculation”, while a “Yes” branch on the right continues upward and loops back to “Generate Pseudo-randomly Failure Rate by Failure Mode”. A downward arrow from “Maintenance Window Calculation” then leads to a rectangular box labeled “Filter Dataset”. Below this, another rectangular box is labeled “Asset Failure Rate Calculation”. A downward arrow connects to a second diamond-shaped decision box labeled “lambda less than lambda subscript max”. From this decision, the “No” branch loops back upward to “Iterate equals 0, Maximum Failure Rate Area, Number of Desired Records”, while the “Si” branch continues downward to a small oval at the bottom labeled “End”.Methodology for synthetic maintenance data generation
The vertical flowchart begins at the top with a small oval labeled “Start”. A downward arrow leads to a parallelogram labeled “Data Structure Based on Dictionaries”. Another downward arrow connects to a parallelogram labeled “Static Operational Data”. From this box, a downward arrow leads to a parallelogram labeled “Iterate equals 0, Maximum Failure Rate Area, Number of Desired Records”. A downward arrow then leads to a rectangular process box labeled “Generate Pseudo-randomly Failure Rate by Failure Mode”. This is followed by another rectangular box labeled “Pseudo-random Selection of Failure Mode”. A downward arrow connects to the next rectangular box labeled “Pseudo-random Selection of Maintenance Required Occurrence”. Another downward arrow leads to a rectangular box labeled “Generate Pseudo-randomly of Maintenance Required Data”. Below this, a rectangular box is labeled “Pseudo-random Selection of Maintenance Type (Unplanned)”. A downward arrow then leads to a rectangular box labeled “Pseudo-random generation of Active Maintenance Hours”. The next rectangular box below is labeled “Maintenance Cost Generation”. A further downward arrow connects to a rectangular box labeled “Add Maintenance Record to the Storage Dataset”. From this box, a downward arrow leads to a diamond-shaped decision box labeled “Iterate less than Number of Desired Records”. The decision has two branches: the “No” branch on the left loops continues downward to a rectangular box labeled “Maintenance Window Calculation”, while a “Yes” branch on the right continues upward and loops back to “Generate Pseudo-randomly Failure Rate by Failure Mode”. A downward arrow from “Maintenance Window Calculation” then leads to a rectangular box labeled “Filter Dataset”. Below this, another rectangular box is labeled “Asset Failure Rate Calculation”. A downward arrow connects to a second diamond-shaped decision box labeled “lambda less than lambda subscript max”. From this decision, the “No” branch loops back upward to “Iterate equals 0, Maximum Failure Rate Area, Number of Desired Records”, while the “Si” branch continues downward to a small oval at the bottom labeled “End”.Methodology for synthetic maintenance data generation
Synthetic Data Generation for Maintenance Asset Records
1: Function SyntheticDataGeneration()
2: Input: Failure rate distribution parameters, component costs, maintenance parameters
3: Output: Synthetic maintenance dataset including Failure Mode, Component, TBF, Maintenance Cost, etc.
4: # Step 1: Initialization
5: Define constants and distributions (e.g. mean and standard deviation for failure rates)
6: Initialize data structures for storing generated records
7: # Step 2: Data Sampling
8: Generate random values [r1, r2, r3, r4, r5] from N(0, 1)
9: # Step 3: Failure Mode and Component Selection
10: Sample Failure Mode and Component using probabilistic sampling with r1
11: Sample Failure Description using probabilistic sampling with r2
12: # Step 4: Failure Rate and TBF
13: Compute parameters and α = β ⋅ μ based on failure rate distributions
14: Calculate Failure Rate
15: Calculate
16: # Step 5: Dataset Filtering and Adjustments
17: while do
18: Filter dataset records based on TBF constraints
19: end while
20: Select subset n2 from n1 records
21: # Step 6: Calculate Maintenance Costs
22: Calculate Failure Date = Start Operation Date + TBF
23: Sort the dataset by Failure Date in ascending order
24: Calculate ManHours = based on manhour distribution
25: Compute Maintenance Cost = Component Cost[Component] + (ManHours × ManHour Cost)
26: Update TBF = Current Failure Date - Last Failure Date (in hours)
27: Calculate Downtime = ManHours + (1 + Average Administrative Time Percentage)
28: # Step 7: Repeat and Clean Dataset
29: while Additional Filtering Needed do
30: Reapply filters to the dataset
31: end while
32: Clean dataset to finalize output
33: Return Synthetic maintenance dataset
Initialization: The process begins with defining the key statistical distributions and parameters, such as the mean and variance for failure rates and component costs. These parameters will shape the random variables generated later in the process. Next, data structures are initialized to store records that will eventually contain attributes such as failure modes, components, Time Between Failures (TBF), and maintenance costs.
Sampling Random Values: Once initialization is complete, random values are generated to simulate various aspects of the maintenance process. Values such as r1, r2, r3, r4, and r5 are drawn from a standard normal distribution, N(0, 1), which will later be used in probabilistic sampling.
Failure Mode and Component Selection: Using the generated random values, failure modes and components are selected for each record. The value r1 is applied within cumulative probability distributions for failure modes and components, allowing for probabilistic sampling. Similarly, r2 is used to choose a failure description based on the cumulative probabilities associated with each failure mode.
Compute Failure Rate and TBF: After selecting the failure mode and component, the next step is to compute the failure rate (λ) and TBF for each record. Parameters β and α are calculated based on the mean and variance of the failure rate distributions. Using these parameters along with r3, the failure rate λ is determined by inverting the cumulative distribution function, . The TBF is then calculated as .
Dataset Filtering and Adjustments: With preliminary data generated, the dataset undergoes filtering based on the TBF values to ensure that records meet realistic operational limits. Only TBF values that fall within a specified range (between and λmin) are kept. This step ensures that the synthetic data aligns with practical operational expectations.
Calculate Maintenance Costs and Downtime: The maintenance cost and downtime associated with each failure are then calculated. The failure date is determined by adding TBF to the start operation date. Man-hours are computed using distribution-based calculations. Maintenance cost is calculated by combining the component cost with the product of man-hours and the man-hour cost rate. Downtime is calculated by adding man-hours and an additional administrative time percentage.
Final Filtering and Dataset Cleaning: After calculating costs and downtime, the dataset undergoes a final filtering and cleaning stage. Additional filters are reapplied if necessary to ensure that all records meet the predefined conditions. The dataset is then organized and prepared for output.
This structured process yields a synthetic maintenance dataset with realistic attributes, including varied failure modes, components, times between failures, costs, and downtimes, based on statistically grounded parameters and distributions.
3. Case study and results
3.1 Operational context
The selected study field for the research was the chemical engineering laboratory at the University of Pamplona, whereby the ethanol-water mixture separation subprocess was selected, focusing on a plate column with liquid feed, direct steam injection, and the possibility of having the top product as vapor or liquid through a condenser that can be used as total or partial condenser. The column feed can come from a feed tank or the rectification column (see Figure 4).
The diagram shows a detailed process piping and instrumentation layout arranged horizontally, with multiple process units, pipelines, valves, instruments, and control loops interconnected. On the left side, a heat exchanger labeled “INTERCAMBIADORE-300” is shown at the top. An inlet line labeled “INVINODEFERMENTACIÓN” enters the system and passes through control and manual valves, including valves labeled “G V-202” and “G V-201”, before entering a vertical storage tank labeled “T K-400, 250 L”. The tank includes level instrumentation with a level transmitter labeled “L T 400” and a level controller labeled “L C 400”, connected by dashed signal lines to a set point labeled “S P”. Multiple inlet and outlet nozzles are shown on the tank, each fitted with valves labeled “B V-101”, “B V-109”, and “B V-104”. Below the left section, a pump labeled “P-400” is connected via pipelines and valves, including “G V-207” and “G V-205”. A speed controller labeled “S C 400” is shown connected to the pump with dashed control lines and a set point indicator. Further down, a second pump labeled “P-405” is shown with a corresponding speed controller labeled “S C 405”, also connected by dashed signal lines and a set point. In the central section, a large header labeled “VAPORDECALDERA” runs horizontally across the diagram, representing the boiler steam line. Along this line are several control and measurement instruments, including a flow controller labeled “F C-300”, a flow transmitter labeled “F T-300”, and temperature instruments labeled “T C-101” and “T E-101”. A control valve assembly highlighted in the diagram includes valves labeled “B V-109”, “F C V-100”, “B V-110”, and “G V-204”, connected in series on the steam line. On the right side, a tall vertical column labeled “C-100” is shown. The column includes multiple side connections with valves and temperature elements labeled “T E-103”, as well as a level transmitter labeled “L T-100” near the lower section. A level controller labeled “L C-100” is connected to the column by dashed control lines and a set point labeled “S P”. At the lower right, a storage tank labeled “T K-405, 250 L” is shown. The tank includes a level transmitter labeled “L T-405” and an outlet valve labeled “B V-111”. The tank is connected to upstream process lines via valves and piping.Process P&ID
The diagram shows a detailed process piping and instrumentation layout arranged horizontally, with multiple process units, pipelines, valves, instruments, and control loops interconnected. On the left side, a heat exchanger labeled “INTERCAMBIADORE-300” is shown at the top. An inlet line labeled “INVINODEFERMENTACIÓN” enters the system and passes through control and manual valves, including valves labeled “G V-202” and “G V-201”, before entering a vertical storage tank labeled “T K-400, 250 L”. The tank includes level instrumentation with a level transmitter labeled “L T 400” and a level controller labeled “L C 400”, connected by dashed signal lines to a set point labeled “S P”. Multiple inlet and outlet nozzles are shown on the tank, each fitted with valves labeled “B V-101”, “B V-109”, and “B V-104”. Below the left section, a pump labeled “P-400” is connected via pipelines and valves, including “G V-207” and “G V-205”. A speed controller labeled “S C 400” is shown connected to the pump with dashed control lines and a set point indicator. Further down, a second pump labeled “P-405” is shown with a corresponding speed controller labeled “S C 405”, also connected by dashed signal lines and a set point. In the central section, a large header labeled “VAPORDECALDERA” runs horizontally across the diagram, representing the boiler steam line. Along this line are several control and measurement instruments, including a flow controller labeled “F C-300”, a flow transmitter labeled “F T-300”, and temperature instruments labeled “T C-101” and “T E-101”. A control valve assembly highlighted in the diagram includes valves labeled “B V-109”, “F C V-100”, “B V-110”, and “G V-204”, connected in series on the steam line. On the right side, a tall vertical column labeled “C-100” is shown. The column includes multiple side connections with valves and temperature elements labeled “T E-103”, as well as a level transmitter labeled “L T-100” near the lower section. A level controller labeled “L C-100” is connected to the column by dashed control lines and a set point labeled “S P”. At the lower right, a storage tank labeled “T K-405, 250 L” is shown. The tank includes a level transmitter labeled “L T-405” and an outlet valve labeled “B V-111”. The tank is connected to upstream process lines via valves and piping.Process P&ID
Based on an analysis conducted during the research, the assets with high criticality involved in the subprocess are the centrifugal pumps P − 400 and P − 405, located in the P&ID of Figure 4.
Similarly, four failure modes were determined for the study that are common in the mentioned assets: External Leak (ELU), Abnormal Instrument Reading (AIR), Vibration (VIB), and Breakdown (BRD) [1].
For this reason, a dataset of synthetic maintenance data for centrifugal pumps with four failure modes was generated based on the methodology for generating synthetic maintenance data and the constructed programming algorithms.
3.2 Statistical information characterization
The OREDA provided statistical information based on statistical and percentage distributions per failure mode and non-maintainable components (SINTEF, 2009, pp.138-145).
To satisfy (2) so that the data resembles actual behavior, it is determined, from the statistical distribution of the overall failure rate found in the OREDA during critical phases of the asset, (4). It is worth noting that the values found are failure rates per million hours (106 h).
Under the table structure, these distributions were characterized, allowing for the definition of the data structures proposed in the algorithms presented in this article.
In Table 1, the statistical distribution of failure rates about the selected failure modes for the asset is observed.
Statistical distribution of failure rate by failure mode
| Failure mode | min | μ | max | σ |
|---|---|---|---|---|
| BRD | 0.51 | 2.45 | 5.59 | 1.63 |
| VIB | 0.51 | 3.82 | 9.68 | 3.0 |
| ELU | 0.0 | 3.24 | 11.62 | 15.60 |
| AIR | 0.0 | 0.31 | 1.72 | 0.89 |
| Failure mode | min | μ | max | σ |
|---|---|---|---|---|
| BRD | 0.51 | 2.45 | 5.59 | 1.63 |
| VIB | 0.51 | 3.82 | 9.68 | 3.0 |
| ELU | 0.0 | 3.24 | 11.62 | 15.60 |
| AIR | 0.0 | 0.31 | 1.72 | 0.89 |
Table 2 shows the probability of a failure mode caused by a non-maintainable asset component.
Occurrence probability of failure mode per non-maintainable component
| Failure mode | Non-maintainable component | Failure probability [0; 1] |
|---|---|---|
| BRD | Bearing | 0 |
| BRD | Cabling and junction boxes | 0 |
| BRD | Casing | 0.41 |
| BRD | Control unit | 0 |
| … | … | … |
| AIR | Wiring | 0.89 |
| AIR | Valves | 0 |
| Failure mode | Non-maintainable component | Failure probability [0; 1] |
|---|---|---|
| BRD | Bearing | 0 |
| BRD | Cabling and junction boxes | 0 |
| BRD | Casing | 0.41 |
| BRD | Control unit | 0 |
| … | … | … |
| AIR | Wiring | 0.89 |
| AIR | Valves | 0 |
Based on a market analysis, the cost associated with replacing each non-maintainable component of the asset has been defined in Colombian pesos (COP). The prices of the components are expressed in monetary units (see Table 3).
Associated costs per non-maintainable component
| Component | Unit price (COP) |
|---|---|
| Bearing | 150,000.00 |
| Cabling and junction boxes | 100,000.00 |
| Casing | 50,000.00 |
| Control unit | 300,000.00 |
| … | … |
| Wiring | 30,000.00 |
| Valves | 40,000.00 |
| Component | Unit price (COP) |
|---|---|
| Bearing | 150,000.00 |
| Cabling and junction boxes | 100,000.00 |
| Casing | 50,000.00 |
| Control unit | 300,000.00 |
| … | … |
| Wiring | 30,000.00 |
| Valves | 40,000.00 |
Finally, Table 4 defines statistical information regarding the active maintenance hours required per failure mode of the asset.
3.3 Synthetic maintenance dataset obtained
Two synthetic maintenance datasets were generated for the centrifugal pump under study based on the four identified failure modes and the provided statistical information. The datasets include maintenance records with relevant and non-redundant maintenance data, as required by ISO 14224:2016. These two data sets will be called and , respectively. The online Supplementary Material includes Dataset and Dataset .
These two data sets were generated with different statistical properties. Data set behaves in such a way that it satisfies the ranges of the observed set in (4), while data set follows the behavior observed in (5). Despite the latter, data set retains statistical characteristics that can be considered as actual for synthetic data, considering that it also complies with (4).
During the code execution, technological limitations were encountered that made it impossible to increase the expected number of maintenance records (K) without affecting the expected statistical properties, thus obtaining the data sets described by (6) and (7).
When the variable filtering process is applied to the maintenance data sets as seen in (6) and (7), the synthetic maintenance data sets with similar statistical and business characteristics are obtained (see 2, 4, 8, and 9).
Table 5 presents a representative subset of the records generated for the dataset . This subset illustrates different maintenance records, whether corrective or preventive, with all the corresponding data for each record.
Subset of synthetic maintenance data generated for centrifugal pump
| Registration date | Maintenance Type | Replaced component | Failure mode | Active maintenance hours | Costs | UT | DT | TTR | TBF | OT | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016–08-27 15:30:11.141296 | Corrective | Instrument, pressure | AIR | 44.0 | 405752.0 | 3273.451392 | 44.0 | 44.0 | 3273.451392 | 3317.451392 |
| 1 | 2029–10-23 15:34:54.101196 | Preventive | Instrument, flow | AIR | 44.0 | 395752.0 | 1536.894218 | 44.0 | 44.0 | 1536.894218 | 1580.894218 |
| 2 | 2078–10-21 19:13:21.852802 | Corrective | Instrument, flow | AIR | 44.0 | 395752.0 | 8748.577526 | 44.0 | 44.0 | 8748.577526 | 8792.577526 |
| 3 | 2085–02-13 11:29:20.306991 | Preventive | Instrument, pressure | AIR | 44.0 | 405752.0 | 168077.471786 | 44.0 | 44.0 | 168077.471786 | 168121.471786 |
| Registration date | Maintenance Type | Replaced component | Failure mode | Active maintenance hours | Costs | UT | DT | TTR | TBF | OT | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016–08-27 15:30:11.141296 | Corrective | Instrument, pressure | AIR | 44.0 | 405752.0 | 3273.451392 | 44.0 | 44.0 | 3273.451392 | 3317.451392 |
| 1 | 2029–10-23 15:34:54.101196 | Preventive | Instrument, flow | AIR | 44.0 | 395752.0 | 1536.894218 | 44.0 | 44.0 | 1536.894218 | 1580.894218 |
| 2 | 2078–10-21 19:13:21.852802 | Corrective | Instrument, flow | AIR | 44.0 | 395752.0 | 8748.577526 | 44.0 | 44.0 | 8748.577526 | 8792.577526 |
| 3 | 2085–02-13 11:29:20.306991 | Preventive | Instrument, pressure | AIR | 44.0 | 405752.0 | 168077.471786 | 44.0 | 44.0 | 168077.471786 | 168121.471786 |
One hundred and seventeen (117) filtered maintenance synthetic data records were obtained for dataset . Meanwhile, twenty-six (26) filtered maintenance synthetic data records were obtained for dataset .
The statistical comparison between the synthetic datasets and the OREDA benchmark is summarized in Table 6, which presents key failure rate metrics. Subsequently, the Time Between Failures (TBF) distributions are visualized in Figures 5–7.
Global failure rate (per 106 h)
| Dataset | Min | μ | σ | Max |
|---|---|---|---|---|
| OREDA | 3 × 10–4 | 28.08 | 56.95 | 136.71 |
| 5.70 | 129.20 | 38.08 | 84388.18 | |
| 2.74 | 27.53 | 13.72 | 536.80 |
| Dataset | Min | μ | σ | Max |
|---|---|---|---|---|
| OREDA | 3 × 10–4 | 28.08 | 56.95 | 136.71 |
| 5.70 | 129.20 | 38.08 | 84388.18 | |
| 2.74 | 27.53 | 13.72 | 536.80 |
The horizontal axis labeled “Time Between Failures” ranges from 0 to 1.5 multiplied by 10 to the 5th power, with tick marks at 0, 0.5, 1.0, and 1.5. The vertical labeled “Probability Density” axis ranges from 0 to 2 multiplied by 10 to the negative 5th power, with tick marks at 0, 1, and 2. A smooth probability density curve is plotted. The curve begins at 0.6 on the vertical axis at a horizontal value near 0.1 multiplied by 10 to the 5th power. It rises steadily to a peak slightly above 2 multiplied by 10 to the negative 5th power at a horizontal value around 0.35 multiplied by 10 to the 5th power. After reaching the peak, the curve declines sharply, approaching 0 on the vertical axis by the time the horizontal value reaches 0.9 multiplied by 10 to the 5th power. From that point onward up to 1.5 multiplied by 10 to the 5th power, the curve remains close to 0. Note: All numerical data values are approximated.Statistical distribution of TBF for the asset
The horizontal axis labeled “Time Between Failures” ranges from 0 to 1.5 multiplied by 10 to the 5th power, with tick marks at 0, 0.5, 1.0, and 1.5. The vertical labeled “Probability Density” axis ranges from 0 to 2 multiplied by 10 to the negative 5th power, with tick marks at 0, 1, and 2. A smooth probability density curve is plotted. The curve begins at 0.6 on the vertical axis at a horizontal value near 0.1 multiplied by 10 to the 5th power. It rises steadily to a peak slightly above 2 multiplied by 10 to the negative 5th power at a horizontal value around 0.35 multiplied by 10 to the 5th power. After reaching the peak, the curve declines sharply, approaching 0 on the vertical axis by the time the horizontal value reaches 0.9 multiplied by 10 to the 5th power. From that point onward up to 1.5 multiplied by 10 to the 5th power, the curve remains close to 0. Note: All numerical data values are approximated.Statistical distribution of TBF for the asset
The vertical axis labeled “Probability Density” ranges from 0.0 to 3.5 multiplied by 10 to the negative 5 power, with increments of 0.5 multiplied by 10 to the negative 5 power. The horizontal axis labeled “Time Between Failures” ranges from 0.0 to 1.5 multiplied by 10 to the 5th power, with increments of 0.5 multiplied by 10 to 5th power. A smooth probability density curve starts at its maximum value near 3.5 multiplied by 10 to the negative 5 power at time 0. The curve drops sharply as time increases, reaching 1.5 multiplied by 10 to the negative 5 power near 0.1 multiplied by 10 to the 5 power. Note: All numerical data values are approximated.Statistical distribution of TBF for the asset obtained from Synthetic Maintenance Dataset
The vertical axis labeled “Probability Density” ranges from 0.0 to 3.5 multiplied by 10 to the negative 5 power, with increments of 0.5 multiplied by 10 to the negative 5 power. The horizontal axis labeled “Time Between Failures” ranges from 0.0 to 1.5 multiplied by 10 to the 5th power, with increments of 0.5 multiplied by 10 to 5th power. A smooth probability density curve starts at its maximum value near 3.5 multiplied by 10 to the negative 5 power at time 0. The curve drops sharply as time increases, reaching 1.5 multiplied by 10 to the negative 5 power near 0.1 multiplied by 10 to the 5 power. Note: All numerical data values are approximated.Statistical distribution of TBF for the asset obtained from Synthetic Maintenance Dataset
The vertical axis labeled “Probability Density” ranges from 0 to 8 multiplied by 10 to the negative 6 power, with visible tick marks at 0, 2, 4, 6, and 8 multiplied by 10 to the negative 6 power. The horizontal axis labeled “Time Between Failures” ranges from 0 to 1.5 multiplied by 10 to the 5 power, with tick marks at 0, 0.5, 1.0, and 1.5 multiplied by 10 to the 5 power. A smooth probability density curve begins near a value slightly above 8 multiplied by 10 to the negative 6 power at time 0. The curve rises slightly to its highest point just above 8 multiplied by 10 to the negative 6 power at a horizontal value around 0.2 multiplied by 10 to the 5 power. After reaching this peak, the curve steadily declines and approaches 0 on the vertical axis and 1.7 multiplied by 10 to the 5 power on the horizontal axis. Note: All numerical data values are approximated.Statistical Distribution of TBF for the asset obtained from Synthetic Maintenance Dataset
The vertical axis labeled “Probability Density” ranges from 0 to 8 multiplied by 10 to the negative 6 power, with visible tick marks at 0, 2, 4, 6, and 8 multiplied by 10 to the negative 6 power. The horizontal axis labeled “Time Between Failures” ranges from 0 to 1.5 multiplied by 10 to the 5 power, with tick marks at 0, 0.5, 1.0, and 1.5 multiplied by 10 to the 5 power. A smooth probability density curve begins near a value slightly above 8 multiplied by 10 to the negative 6 power at time 0. The curve rises slightly to its highest point just above 8 multiplied by 10 to the negative 6 power at a horizontal value around 0.2 multiplied by 10 to the 5 power. After reaching this peak, the curve steadily declines and approaches 0 on the vertical axis and 1.7 multiplied by 10 to the 5 power on the horizontal axis. Note: All numerical data values are approximated.Statistical Distribution of TBF for the asset obtained from Synthetic Maintenance Dataset
Figure 5 depicts the gamma distribution of TBF derived from OREDA data, serving as the reference for synthetic data generation. Figure 6 shows the TBF distribution for dataset , which exhibits a higher mean failure rate and distinct spread compared to OREDA. Figure 7 presents the TBF distribution for dataset , which aligns closely with OREDA's mean.
As observed in Table 6, dataset demonstrates better alignment with OREDA's mean failure rate. The percentage errors for the mean (%μ), standard deviation (%σ), and maximum value (%max) are calculated as follows (Montgomery, 2004):
4. Conclusion
This paper established a methodology and developed pseudo-code algorithms to generate synthetic maintenance data for industrial assets. By implementing pseudo-random functions within statistical, probabilistic, and exponential-variate distributions based on the OREDA, the approach successfully produces realistic maintenance records.
The results demonstrate the method's practical value. As evidenced in Table 6, the dataset shows a notably strong performance, with its mean value exhibiting a minimal error of just 1.96% compared to the OREDA distribution (10). This high degree of accuracy in replicating the central tendency underscores the model's effectiveness. While larger errors were observed for the standard deviation (75.91%) and the maximum global failure rate (292.66%), these are not inherently unfavorable. Instead, they reflect the stochastic nature of the pseudo-random generation across multiple failure modes, which imbues each synthetic dataset with a unique variability from which valuable maintenance engineering insights can be extracted.
Ultimately, this work provides a robust, low-risk tool for simulation and decision-making in asset management. The generated datasets, exemplified by the centrifugal pump category in Table 5, possess statistical and behavioral fidelity to real-world data. This enables organizations to test maintenance strategies, train models, and plan asset lifecycles efficiently and safely—without jeopardizing operational integrity, production, safety, or the environment. Thus, the research bridges a critical gap between theory and practice by leveraging synthetic maintenance data generation as a foundational seed for advanced applications, such as digital twins and machine learning models, which facilitates the incorporation of these and other data-driven tools into decision-making processes to continuously improve maintenance management.
This work is supported by University of Pamplona and University of the Andes, Colombia.
Note
Although four failure modes were determined, this does not mean that more failure modes cannot be considered.
The supplementary material for this article can be found online.

