Elevator door system failures are a leading cause of elevator malfunctions, impacting safety and operational efficiency. Existing anomaly detection methods often overlook the relative positional relationships among time-series data sequences. This study aims to propose a novel spatial-temporal information fusion approach to accurately identify abnormal states in elevator door systems.
This paper develops an integrated spatial-temporal fusion framework for anomaly detection. First, operational time-series data are structured into a node graph using a K-Nearest Neighbor (KNN) based method to model complex inter-sequence interactions. Subsequently, a Graph Convolutional Network (GCN) is employed to extract local dependencies and global spatial information, while a Bidirectional Long Short-Term Memory network (BiLSTM) captures temporal evolutionary characteristics. A multi-source information-driven feature fusion mechanism is then designed to enhance model robustness. The proposed method is experimentally validated using real-world elevator door operating data, including vibration and video-derived signals.
The experimental results demonstrate that the proposed method effectively identifies abnormal elevator door states, achieving a high detection accuracy of 96.27%. This confirms the framework's effectiveness and reliability in practical scenarios.
This research presents a novel KNN-based graph construction method that captures dynamic dependencies between time-series sequences based on their relative positions. Furthermore, it develops an integrated framework that concurrently models spatial structural relationships and temporal dynamics, overcoming the limitations of methods that treat these dimensions separately. Finally, it introduces a multi-source feature fusion mechanism that leverages the complementarity of information from different sources and dimensions, significantly enhancing the model's representation capability and robustness under complex operating conditions.
1. Introduction
As a critical subsystem for passenger access, elevator door performance significantly impacts both safety and operational efficiency. Statistics indicate that malfunctions in elevator door systems account for over 50% of all elevator failures, constituting a leading cause of passenger entrapments, service disruptions, and even fatal incidents such as shearing injuries and falls (Lan et al., 2021; Wang et al., 2018). Given the structural complexity and variable operating conditions of elevator door systems, performance degradation often manifests as gradual, latent processes that can culminate in sudden failures. Consequently, reliance solely on traditional periodic inspections or reactive repairs proves insufficient for achieving effective fault prediction and precise performance assessment (An et al., 2021). Therefore, the development of intelligent, real-time anomaly detection methods for elevator door systems is crucial for enhancing reliability, ensuring passenger safety, and optimizing maintenance strategies.
In recent years, elevator operation condition monitoring and fault diagnosis techniques have made significant progress (Esteban et al., 2016; Wang et al., 2024). Numerous researchers have evaluated and predicted the overall health state of elevators utilizing vibration analysis, operational data mining, artificial intelligence algorithms, and related approaches. Recent advances in deep learning for fault diagnosis in mechanical and electromechanical systems, such as multi-scale capsule networks (Lu et al., 2025), vision transformers with time-frequency fusion (Xiao et al., 2025a), and novel convolutional attention mechanisms (Wang et al., 2025), have demonstrated strong feature extraction capabilities. However, their direct application to the specific spatio-temporal dependency problem in elevator door systems remains unexplored. For instance, Niu monitored elevator conditions through time-domain and frequency-domain analysis of vibration signals (Niu et al., 2025). Pan proposed an elevator risk assessment method based on an integration of fuzzy comprehensive evaluation and machine learning (Pan et al., 2023). Guo presented a semi-supervised two-stage deep learning network, comprising a feature selector and a classifier, for elevator condition assessment (Guo et al., 2024a). However, anomaly detection and performance evaluation research on door systems, high-incidence areas for elevator failures that directly impact passenger safety, remains relatively scarce. Existing methods assessing the overall elevator state often struggle to capture the subtle performance degradation characteristics unique to door systems. Moreover, a real-time evaluation system capable of assessing the dynamic performance of door systems under complex interaction conditions is notably lacking. Under such conditions, failures often arise from chain reactions and spatio-temporal coupling among multiple component degradations during operation. For example, belt wear may affect door speed, which in turn exacerbates vibration induced by worn sliders. Therefore, research focused on anomaly detection methods for door systems is of significant importance for advancing predictive maintenance and enabling early fault warning in elevators.
Traditional signal processing and shallow learning methods usually rely on handcrafted features and assume fixed temporal or statistical patterns. However, when applied to elevator door systems, such assumptions often become restrictive. Elevator door anomalies frequently exhibit strong nonlinearity, temporal dependency, and structural coupling among signal segments, which are difficult to capture using conventional approaches. Graph convolutional networks provide a natural way to model such non-Euclidean structural relationships, while deep sequential models enable automatic extraction of hierarchical temporal features. As a result, combining graph convolutional networks with deep temporal models provides a suitable framework for modeling elevator door anomalies.
To address the challenge of capturing the performance characteristics of the door system and the lack of a dynamic evaluation framework, this paper proposes a method that integrates temporal and spatial features from multi-source information to enable accurate detection of elevator door anomalies. This research aims to construct a framework capable of simultaneously capturing spatial and temporal dynamic features and achieving high-precision anomaly state identification. The main contributions of this work are summarized as follows:
A KNN-based method is proposed for constructing a graph structure from time-series data. This structure characterizes spatial correlations between sensor nodes while effectively capturing dynamic dependencies among sequences based on their relative positions.
An integrated spatiotemporal information processing framework is developed. Within this framework, graph convolutional networks extract spatial features encompassing both local node dependencies and global spatial topology. Concurrently, bidirectional long short-term memory networks are employed to capture long-term evolutionary patterns and short-term fluctuation characteristics within the temporal dimension.
To overcome the limitations of single-source signals, a feature fusion mechanism leveraging multi-source information is proposed to enhance representation and detection accuracy. This mechanism fully leverages the complementarity of information from different sources and dimensions, thereby enhancing the model's reliability under complex operating conditions.
The remainder of the paper is organized as follows: Section 2 introduces related work on spatial-temporal and multi-source information fusion. Section 3 details the proposed methodology for multi-source spatiotemporal fusion. Section 4 describes the data acquisition process and presents experimental validation of the proposed method. Finally, Section 5 concludes the paper.
2. Related work
In recent years, with the popularization of industrial Internet of Things (IoT) technology, multi-dimensional data such as vibration, sound, and video of door opening and closing generated during the operation of elevator door systems have been realized to be collected in real time (Xie et al., 2024; Hsu et al., 2020; Pan et al., 2024). These data contain rich operational state and safety information, which not only reflect the dynamic characteristics of the system evolving over time, but also reflect the interrelationship of the states of the system at different locations. Owing to their advantage in modeling non-Euclidean spatial relationships in sensor networks, graph convolutional networks have found widespread application in areas such as traffic flow prediction and social behavior analysis since Kipf and Welling introduced the semi-supervised framework (Kipf and Welling, 2016). Yang combined capsule networks with GCN for harmonic drive compound fault diagnosis (Yang et al., 2023a). Seo proposed a method combining graph autoencoders with graph convolutional networks to model dynamic relationships among subsystems in hydrogen extraction systems, enabling effective anomaly detection and fault diagnosis (Seo et al., 2024). Chen proposed the SA-GCN framework, which integrates structural analysis prior knowledge with measurement data (Chen et al., 2022). Xiao proposed MV-TriGCN, a semi-supervised multi-view learning framework that enhances GCN stability and generalization through an improved triplet loss, diverse view graph construction, and stepwise training (Xiao et al., 2025b). Most of the above studies focus on modeling the spatial structural features of sensors. However, in practice, sensor data often contain both temporal evolution patterns and spatial dependencies, demonstrating tightly intertwined spatial-temporal characteristics with distinct yet interacting feature dimensions. Therefore, a comprehensive consideration of both temporal and spatial dimensions is needed to more fully explore the data's intrinsic structure and potential semantics.
Furthermore, in the field of intelligent operation and maintenance of industrial equipment, the integration of multi-source sensor data for spatial-temporal modeling has emerged as a core technical approach for condition monitoring and performance evaluation (Wu et al., 2024; Han et al., 2025; Lv et al., 2025). Fathizadan proposed a spatio-temporal anomaly detection framework using convolutional LSTM autoencoders and control charts to effectively identify anomalies in additive manufacturing processes (Fathizadan et al., 2024). Sun performed fault diagnosis through temporal node connection modeling (Sun and Yin, 2025). Guo proposed a time–frequency domain interpretation method for CNNs in bearing fault diagnosis, enhancing model transparency through Grad-CAM and gradient-ascent-based kernel visualization (Guo et al., 2024b). Allen proposed the KESA framework, which combines deep learning with domain knowledge for spatiotemporal fault detection and interpretation in complex industrial processes (Allen et al., 2024). Addressing the limitations of multi-sensor long-term sequence modeling in structural health monitoring, Yang utilized a two-layer attention mechanism to enhance LSTM's ability to capture spatial-temporal dynamic features (Yang et al., 2023b). Zhou realized spatial-temporal joint modeling through the coupling of multi-view graph attention and temporal convolution (Zhou and Wang, 2024). However, the above methods generally define the graph structure based on the static correlations of sensors, while ignoring the dynamic dependencies between time series data points based on their relative temporal positions. This omission of temporal association patterns hinders the model's ability to fully capture the complex spatiotemporal interactions during equipment operation, consequently limiting diagnostic accuracy and robustness.
Although deep learning has significantly promoted the development of intelligent fault diagnosis techniques, methods relying on a single signal source often face a generalization bottleneck due to insufficient feature characterization capability under complex and severe working conditions. For this reason, multi-source heterogeneous data fusion technology has become an important research direction. Currently, fault diagnosis research based on information fusion is conducted at three levels, i.e., the data level, feature level, and decision level. Zhang employed a residual pyramid algorithm to separately fuse acoustic and vibration signals from multiple spatial locations, generating two fused acoustic-vibration signals, and subsequently constructed a multi-source fault feature set (Zhang et al., 2022a). Feng proposed a multisource state space-based method for tool RUL prediction that models multistage degradation using a Wiener process, achieving improved prediction accuracy by integrating historical and real-time monitoring data (Feng et al., 2025a). Xu proposed a dynamic feature selection matrix optimization information integration method to transform a multi-source information system into a single-source information system (Xu and Li, 2025). Feng proposed a diagnostic framework based on vibration, electrostatic, and infrared multi-source heterogeneous data fusion, and achieved online diagnosis of gearbox faults through data conversion and feature fusion (Feng et al., 2025b). While these fusion methods have shown promise in fault diagnosis, they also raise concerns regarding data transformation and feature redundancy, as well as decision ambiguity.
In summary, current research has made significant progress in graph structure construction and multi-source data fusion, especially in modeling spatial structure using GCN and capturing temporal dependence with the help of models such as LSTM. However, most of the existing methods deal with the temporal and spatial dimensions separately, ignoring the interactive characteristics of temporal location and structural relationships among sequences, which makes it difficult to comprehensively portray the dynamic evolution mechanism in equipment operation. To this end, this paper proposes a multi-source information anomaly detection method that fuses temporal and spatial features. This method accurately identifies abnormal states of the elevator door system by deeply extracting latent temporal and spatial dependencies from the data and effectively incorporating them within a multi-source information fusion framework.
3. Methodology
3.1 Problem formulation and overall architecture
Elevator door anomaly detection focuses on identifying abnormal operating states during door opening and closing processes based on multi-source time series data. In practical operation, door system faults usually do not appear as isolated signal fluctuations, but rather manifest as localized deviations in temporal evolution and abnormal interactions among different signal segments. Such patterns are typically associated with interactions among multiple degraded components during door operation.
Therefore, elevator door anomaly detection is treated as a spatio temporal pattern recognition task, in which both the temporal evolution of individual signals and the dependency relationships among time series segments are considered. Effective detection requires capturing long-term trends and short-term fluctuations in the temporal domain, as well as modeling the structural relationships that reflect similarity and interaction among segments under different operating conditions. Based on this consideration, a spatio temporal modeling framework is developed to support accurate identification of abnormal door operation states.
To address this problem, this study proposes a multi-source spatio temporal information fusion framework for elevator door anomaly detection, as illustrated in Figure 1. The overall framework integrates temporal feature extraction, spatial dependency modeling, and multi-source feature fusion to achieve accurate identification of abnormal door operation states.
Specifically, the proposed framework consists of a temporal modeling module, a spatial modeling module, and a feature fusion and classification module. The input of the framework includes multi-source signals collected during elevator door operation, such as video-derived speed sequences and vibration signals. These signals are first preprocessed and segmented into time series samples, which serve as the basic units for subsequent spatio temporal modeling.
In the temporal modeling module, CNN are employed to extract local temporal features, while BiLSTM are used to capture long term temporal dependencies. In addition, a multi-head attention pooling mechanism is introduced to emphasize time steps that are more sensitive to anomaly characteristics, thereby enhancing the discriminative ability of temporal features.
In parallel, the spatial modeling module aims to characterize the dependency relationships among time series segments. Each segment is treated as a node in the graph, and a K nearest neighbor based graph construction strategy is used to establish similarity connections among nodes. Graph convolutional networks are then applied to the constructed graph to extract spatial features that reflect localized and sample specific interactions associated with abnormal door behavior.
Finally, the temporal features and spatial features extracted from different sources are fused to form a unified spatio temporal representation. This fused representation is fed into a fully connected classifier to output the anomaly category of the elevator door system.
The structure of the fusion module is expressed as:
Where represents the fused spatial-temporal feature set. represents the multi-source temporal feature set. represents the multi-source spatial feature set. represents the weight, and represents the bias. epresents the output probability distribution over each state category.
Through this integrated spatio temporal modeling framework, the proposed method enables effective anomaly detection under complex operating conditions.
3.2 KNN graph construction
Selecting an appropriate graph construction strategy is crucial for modeling the dynamic interactions in elevator door operation. Existing approaches include fully connected graphs, correlation-based graphs, and physical topology-based graphs. Fully connected graphs introduce dense and redundant connections, which may lead to over-smoothing and reduced discriminative ability in graph convolutional networks (Li et al., 2018). Correlation-based graph construction methods rely on global statistical dependency and typically produce static graphs, which may be insufficient to characterize localized and transient interactions in time-series segments (Wu et al., 2020). Physical topology-based graphs require explicit prior knowledge of sensor layouts or mechanical structures; however, in elevator door systems, the interactions among temporal segments are primarily driven by motion consistency and feature similarity rather than fixed spatial adjacency.
To address these limitations, we employ a KNN based approach to construct a dynamic, sample-specific graph. This method connects each time series segment to its k most similar neighbors in the feature space, effectively capturing localized relationships that are indicative of anomalous patterns. The KNN graph is adaptive to each sample and preserves locality, making it particularly suitable for detecting the subtle, localized deviations characteristic of elevator door faults (Qi et al., 2021; Seo et al., 2024).
The detailed construction process is as follows.
A time series sample of length can be represented as:
Before constructing the node graph, global normalization of the sequences is required to eliminate numerical scale differences between different samples. The standardization formula is as follows:
where represents the standardized sample value at time . denotes the original time series sample value at time . and stand for the global mean and standard deviation across all time points, respectively.
Subsequently, the sliding window method is used to extract node features. At time , a sliding window, consisting of the time point itself and the subsequent time points, serves as the node input. The features of the th graph node are then constructed as follows:
This approach effectively captures the dynamic changes of local time slices and enhances the temporal awareness of nodes.
The graph structure data is further constructed based on the node features and the topology of the graph. It can be represented as follows:
where is the node feature matrix, with being the number of nodes and being the feature dimension of a single node. represents the edge set. represents the adjacency matrix of the graph.
To establish structural connections between nodes, the edge set is constructed using the KNN method with Euclidean distance (Qi et al., 2021). For each node , the nearest non-self nodes in the feature space are selected to form a neighbor set , which takes the following form:
Where and are the normalized node features. denotes the Euclidean distance. And denotes the selection of the neighboring nodes with the smallest distance.
While constructing edges, a gaussian kernel function is used to determine the edge weights to enhance the model's ability to model neighbor similarity:
where is the kernel function bandwidth coefficient, which controls the attenuation of adjacency strength. The construction of the time series graph structure using a sliding window and K-nearest neighbors is illustrated in Figure 2.
3.3 Spatio-temporal feature extraction based on TSGCN
Based on the KNN constructed dynamic graph described in Section 3.2, the spatio temporal feature extraction stage aims to jointly model temporal evolution patterns and dependency relationships among time series segments during elevator door operation. In this work, each time series segment is treated as a graph node, while edges represent similarity based interactions captured by the KNN graph. Under this formulation, elevator door anomaly detection requires simultaneous characterization of temporal dynamics within each node and spatial dependencies across nodes.
To meet this requirement, a temporal spatial graph convolutional network (TSGCN) is employed as the core feature extraction model. The TSGCN integrates temporal modeling modules with graph convolution operations, enabling effective extraction of both time-domain characteristics and structural information embedded in the constructed graph. This design allows the model to capture localized abnormal patterns and their propagation across related segments, which are critical for accurate detection of elevator door anomalies.
3.3.1 Time dimension model
To extract short-term fluctuations and long-term structural features in time series data, a temporal modeling module consisting of CNN, BiLSTM and multi-head attention pooling is proposed. The module sequentially extracts local dynamic patterns by convolution, captures global temporal dependencies by BiLSTM, and weightedly aggregates key time steps by the multi-head attention mechanism, finally generating discriminative sequence-level representations.
This module captures uptabr changes, local patterns, or instantaneous pattern variations in signals within a short period, leveraging the local receptive field sliding window of convolutional neural network. It is well-suited for information with obvious local structures, such as vibration signals and image velocity curves. Two layers of 1-D convolution and pooling operations are used to compress the sequence length and enhance the extraction of local features.
Subsequently, the convolved sequence features are fed into a bidirectional LSTM network to integrate contextual information and establish long-term dependencies in the temporal dimension. Compared to a unidirectional LSTM, BiLSTM employs parallel computation with both forward and backward LSTM units, effectively merging preceding and succeeding information at each time step to capture the signal's global temporal structure more comprehensively.
In BiLSTM, the forward layer computes sequentially from to , storing the forward hidden state at each time step. Conversely, the backward layer computes in reverse order from to , recording the backward hidden state (Schuster and Paliwal, 1997). Finally, the hidden states from both directions are concatenated to form the output representation for the current time step, which is then used for subsequent decision-making or feature fusion.
The computational procedure can be formalized as:
where represents the input at time step . is the activation function. denotes the output transformation, and is the weight parameters.
To obtain a global representation of the entire sequence, a multi-attention pooling mechanism is introduced to replace the traditional average pooling or maximum pooling. This mechanism learns different attention weights, enabling the selection of the most discriminative time step information for the classification task. The computational process is as follows:
where represents the feature dimension of the key/query in each attention head.
Multiple attention heads compute attention responses in parallel across different subspaces. These responses are then concatenated and projected to obtain a unified representation.
The overall time dimension model is shown in Figure 3.
3.3.2 Spatial dimension model
Spatial dimension models aim to characterize the structural relationships between different nodes within a signal. Given the low dimensionality and limited expressive power of original node features, a linear mapping is employed to increase their dimensionality, as expressed below:
where is the ascending weight matrix and is the bias vector.
After that, the upscaled features are fed into the GCN to construct the structural relationships between nodes. The adjacency matrix of the graph is , where indicates that node and node are connected. A symmetric normalization method is used to construct the graph propagation matrix, expressed as follows (Seo et al., 2024):
where is the identity matrix and is the corresponding degree matrix.
The propagation and updating process of the th layer graph convolution is expressed as:
where denotes the node features of the th layer. is the learnable parameter of the layer, and is the nonlinear activation function.
The overall spatial dimensional model is shown in Figure 4.
The pseudocode of TSGCN for anomaly detection is presented in Algorithm 1. During each epoch, the training process is executed, and subsequently, validation is performed. The system saves the model that achieves the best performance on the validation dataset. Upon reaching the maximum epoch, the optimal model is deployed to identify anomaly types within the test set.
Spatial-temporal information fusion for anomaly detection
Input: Multi-modal time series data: video sequence , vibration sequence graph structure data: ,
Output: Anomaly category prediction results
1. Preprocessing stage:
2. Time series are extracted from video and sensors
3. KNN graph structures , are constructed based on signal features
4. The dataset is divided into training, validation, and test sets
5. Training and validation stage:
6. Network parameters are randomly initialized
7. Inputs: hidden layer dimension, learning rate , batch size , number of training epochs
8. for in training epochs do
9. A batch of training data is sampled:
10. Temporal features are extracted using BiLSTM
11. Application of CNN to extract local temporal patterns
12. Encoding time series using BiLSTM
13. Implementation of multi-attention pooling
14. BiLSTM features of video and vibration are concatenated
15. Spatial features are extracted using GCN
16. GCN modules are applied to and respectively
17. Graph features are extracted through global average pooling
18. Graph convolutional features of video and vibration are concatenated
19. Temporal and spatial features are fused
20. BiLSTM and GCN features are concatenated
21. The result is input to a fully connected layer classifier
22. Calculate the crossentropy loss
23. Backpropagate and update the parameters
24. Evaluate model performance on a validation set
25. End
26. Testing stage:
27. Input: test set
28. Output: Anomaly type prediction results
Source(s): Algorithm created by authors
4. Experimental verification
To validate the effectiveness of the TSGCN-based anomaly detection method, experiments were conducted on the elevator door system dataset. To further evaluate its performance, we selected several classical anomaly detection methods as comparison baselines. Meanwhile, the influence of key parameters in TSGCN on the model performance is deeply analyzed.
4.1 Experimental data setup
4.1.1 Data acquisition
The experimental data were collected from elevator doors in normal operation within an apartment building. Faults were artificially introduced to obtain data representing different states, including Normal, Slowdown, Jamming, and Abnormal Door Closing. The elevator doors have a center-opening structure. Vibration data were collected by installing an attitude measurement sensor in the gap between the landing door and the car door on one side of the car. This sensor integrates a three-axis accelerometer, a three-axis gyroscope, a three-axis angle sensor, and a three-axis magnetometer. The sensor supports data storage and export, with an output frequency ranging from 0.2 Hz to 200 Hz, and its sampling frequency is independently configurable. Video data, capturing the opening and closing process, were collected by deploying a camera in the corner of the car. The video frame rate was 30 fps, while the vibration sensor's sampling frequency was set to 50 Hz. Figure 5 shows the installation location of the acquisition equipment.
We simulated various states on the actual elevator door to obtain normal and abnormal data. Specifically, slowing down was simulated by inserting objects between the door and the column to increase the running friction. Abnormal door closing was simulated by placing obstacles between the two doors so that the door could not close. And jamming abnormal data were collected directly from the actual faulty elevator door.
4.1.2 Data processing
Each collected sample contains a complete door opening and closing operation. Vibration data were selected from the X, Y, and Z axes acceleration signals. The X-axis represents the lateral vibration of the elevator door, the Y-axis represents the vertical vibration, and the Z-axis represents the fore-and-aft vibration. The three-dimensional vibration signals is shown in Figure 6.
The process of video data processing is as follows. First, each video sample is framed, and the edge position of the door panel in each frame is extracted using edge detection algorithms. Then, the pixel displacement between consecutive frames is calculated and converted to physical speed using a calibrated scale factor. An illustrative example of this process is shown in Figure 7, where raw video frames are transformed into a door displacement curve and subsequently into a speed curve.
Based on the extracted door displacement information, the opening and closing running curves of the elevator door are obtained, as shown in Figure 8.
Due to the varying lengths of the collected sample sequences, sequence alignment and padding mechanisms are introduced into the batch processing stage to ensure uniform input dimensions. This leverages LSTM's capability to handle variable-length sequences. Specifically, the samples within each batch are first sorted in descending order according to their sequence lengths. Then, a uniform padding strategy extends all sequences to the length of the longest sequence in the current batch. The padding value is set to zero to avoid interfering with valid features. To preserve the original temporal structure, the actual length information is transmitted to the network as input during sequence packing, eliminating the impact of padding values on model training.
4.2 Experimental results
4.2.1 Evaluation indicators
The vibration and video datasets comprises 802 samples. In accordance with common practice, 60% of the data was randomly selected to form the training set, with 20% allocated to the validation set and 20% to the test set. The specific number of training, validation and test samples are given in Table 1.
Accuracy and F1 score are used as core metrics to evaluate the model performance. Accuracy represents the proportion of samples correctly predicted by the model out of the total samples, reflecting the overall classification accuracy. The F1 score, integrating precision and recall, is suitable for evaluating performance with imbalanced data or in scenarios where misclassification occurs. The F1 score ranges from 0 to 1, with higher values indicating a better balance between precision and recall. The formulations of the evaluation metrics are as follows:
where , , , and represent the numbers of true positive, false positive, true negative, and false negative samples, respectively.
4.2.2 Network structure parameterization
4.2.3 Experimental results
To validate the proposed method, we conducted experiments on a dataset that we collected from real elevator operations, containing both video and vibration signals. The classification results of the method are shown in the confusion matrix in Figure 9. The performance of the test set as follows: out of 161 test samples, 155 fault categories were correctly detected, achieving a detection accuracy of 96.27%, which demonstrates high overall accuracy. Notably, the recognition rate for jamming faults reached 100%. However, misclassification occurred in the normal state (label 0), slowdown (label 1), and abnormal door closing (label 3). Further analysis revealed that while these three states are distinct, they were all artificially simulated on the same elevator. This may have led to pattern overlap in the sensor signals, consequently hindering the model's ability to effectively discriminate features. Subsequent research will focus on introducing more types of sensor data to improve the differentiation ability of similar faults.
The features are further visualized using the UMPA method and the results are shown in Figure 10. As the figure shows, the feature distributions for “Normal” and “Slowdown” labels are highly similar and difficult to distinguish accurately, consistent with the confusion matrix results.
4.3 Further discussion
4.3.1 Hyperparameter sensitivity analysis
In order to explore the influence of hyperparameters on model performance, we conducted a hyperparameter sensitivity analysis to determine appropriate values. During KNN graph construction, the number of node neighbors (K value) and the distance metric significantly influence the classification results. A K value that is too small results in a sparse graph structure, which, while beneficial for capturing local key point dependencies, may overlook long-range related node information, leading to insufficient information propagation. Conversely, a K value that is too large creates a dense graph structure and increases the information propagation path. This can alleviate the locality limitation of the adjacency structure but may introduce redundant or noisy connections, resulting in over-smoothing or overfitting.
In this study, we consider several commonly used distance metrics to explore their impact on temporal-spatial graph construction, including Euclidean, Manhattan, Chebyshev, and Minkowski distances. Euclidean distance measures the straight-line distance between two points in a multidimensional space and aligns well with the natural continuity of time series data. Manhattan distance, defined as the sum of absolute axial differences, emphasizes edge features and is more sensitive to small perturbations. Chebyshev distance focuses on the maximum coordinate difference, making it suitable for detecting extreme change points. Minkowski distance introduces a tunable parameter , which reduces to Manhattan distance when , Euclidean distance when , and Chebyshev distance as . Therefore, we specifically analyze the effects of Euclidean, Manhattan, and Chebyshev distances on the experimental results. Based on comparative evaluation, Euclidean distance is selected as the primary metric for subsequent graph construction due to its superior performance and consistency with temporal smoothness.
In addition, a single-factor iterative experiment was conducted on the learning rate and dropout using the controlled variable method to analyze the model's sensitivity to key hyperparameters. Only the target hyperparameters are adjusted in each round of experiments, and the rest of the parameters are kept unchanged. Specifically, the settings were as follows: learning rate , and .
The model's performance on the validation set was recorded for each set of parameter configurations, and the results are shown in Figure 11. The optimal hyperparameters selected in the experiments are: , , , and Euclidean distance for the distance metric. As shown in Figure 11(a), when , the graph structure is too sparse due to the insufficient number of neighbors, failing to fully capture the long-range dependencies between nodes. Consequently, information propagation is insufficient, leading to low accuracy. When , the graph structure becomes dense due to the excessive , introducing a large number of redundant and noisy connections, which results in the over-averaging of node features and the destruction of local key information. When , the model partially alleviates the locality limitation by increasing the number of neighbors, and the accuracy improves. For the distance metric in Figure 11(b), Manhattan distance is more sensitive to noise, resulting in a slight decrease in accuracy, while Chebyshev distance performs poorly because it ignores temporal continuity. Regarding the learning rate parameter in Figure 11(c), values that are too small or too large will hinder model performance. A smaller learning rate leads to slow convergence, while a larger one causes the parameters to update excessively, making the model skip the optimal solution and resulting in unstable performance. As for the dropout in Figure 11(d), an excessively high dropout rate causes the model to underfit, leading to a significant loss of information during the entire training process, while a rate that is too low may not be sufficient to prevent overfitting.
4.3.2 Comparison with common single-sensor anomaly detection methods
Currently, most anomaly detection methods rely on single-type sensor signals, especially vibration signals, to determine the operating status of a system. To verify the effectiveness of the proposed multi-source fusion anomaly detection method, we selected single-sensor modeling methods such as LiConvFormer (Yan et al., 2024), TCN (Zhang et al., 2022b), CNN (Liu et al., 2019), BiLSTM (Abebe et al., 2024), and GCN (Chen et al., 2022) as comparison baselines. Each model was trained and evaluated under the same data conditions, and the performance comparison results are shown in Figure 12.
The numerical results are shown in Table 4. We can find that the proposed model outperforms other methods in all four metrics. It is worth noting that the proposed scheme also contains BiLSTM and GCN models. As shown in Table 4, the BiLSTM and GCN models yield lower detection results for both video and vibration data compared to the proposed TSGCN method. Since the BiLSTM model only analyzes the temporal dependency, it ignores the spatial structure information contained in the time series data. Conversely, the GCN focuses solely on spatial structure information, lacking the ability to capture sequential dependencies between data points. In contrast to traditional schemes, the proposed TSGCN leverages a wider range of information and can selectively prioritize different data sources.
4.3.3 Comparison with non-temporal modeling approaches
To further validate the effectiveness of the spatial-temporal fusion model, we compared the proposed multi-source spatiotemporal graph method with non-spatiotemporal methods that disregard temporal dependencies or spatial structures. These methods typically rely on statistical feature extraction or simple network structures and perform holistic encoding of input signals. They often fail to consider the dynamic evolution of time series or the structural relationships between sequences, which makes it difficult for them to capture key features of fault evolution. The comparative methods include Multilayer Perceptron (MLP) that directly classifies flattened input signals (Rawat et al., 2018), xLSTM enhancing temporal modeling of long sequences with exponential gating and modified memory structure (Beck et al., 2024), mixCNN extracting richer spatial features through a hybrid convolution design with residual connections (Zhao and Jiao, 2023), and ResCISTA-Net extending CISTA by adding residual blocks for better feature extraction (Rao et al., 2024). Table 5 presents their performance comparison.
As shown in Table 5, the TSGCN model achieves the best performance in all four metrics and far outperforms the other compared methods. The MLP model's performance is the poorest, as it completely disregards the signal's time-series structure and spatial correlations. While ResCISTA-Net leverages residual blocks to improve low-level feature extraction, it also neglects temporal and spatial structures, hindering its ability to effectively identify complex fault patterns. xLSTM focuses solely on enhancing the capture of long-sequence time dependencies, without considering the spatial arrangement of data points. On the contrary, mixCNN only extracts spatial features and does not introduce dynamic evolution of time series. Both of them fail to effectively fuse the spatial-temporal synergy information of time series signals, which leads to poor extraction of low-resolution anomaly categories.
4.3.4 Comparison with state-of-the-art spatio-temporal models
To provide a comprehensive comparison with existing spatio-temporal modeling approaches, this paper conducts experiments against several representative advanced methods, including MTGNN (Wu et al., 2020), ASTGNN (Guo et al., 2022), STFGNN (Li and Zhu, 2020) and STSGCN (Sofianos et al., 2021). The above models are all representative graph neural network methods in the fields of multivariate time series modeling and spatio-temporal feature learning in recent years, capable of modeling temporal dependencies and structural correlation information from different perspectives.
Among them, MTGNN adaptively learns the graph structure through the graph learning layer and combines temporal convolution for spatio-temporal modeling, making it an advanced model for multivariate time series prediction. ASTGNN introduces an attention mechanism in the convolution of spatio-temporal graphs to dynamically learn the importance weights between different time steps and spatial nodes. STFGNN designed a graph structure that integrates spatio-temporal information and utilized the parallel GCN module to extract spatio-temporal features respectively. STSGCN constructs local spatiotemporal maps and simultaneously captures local spatiotemporal correlations by using dedicated convolutional modules. All models adopted the same data preprocessing procedures and input features as in this study, and the model hyperparameters were all tuned to achieve their best performance. The comparison results are shown in Table 6.
As can be seen from Table 6, the TSGCN model we proposed significantly outperforms the other four advanced spatiotemporal graph models in all evaluation indicators. This performance difference indicates that models relying on global or fixed graph structures have limitations in capturing local and transient abnormal patterns existing in the operation of elevator doors. TSGCN can more effectively represent local spatio-temporal patterns by constructing sampler k-nearest neighbor graphs and jointly modeling spatio-temporal dynamic dependencies, thereby significantly improving the performance of anomaly detection.
5. Conclusion
This study proposes a novel multi-source spatial-temporal information fusion model for the accurate recognition of elevator door operation states and anomaly detection. Advanced feature engineering techniques are employed across both temporal and spatial domains to comprehensively capture the temporal dynamics of sensor signals as well as their latent structural correlations. Using a dataset gathered through our own data acquisition of elevator door operations, we conducted experiments to systematically analyze the effects of hyperparameters, including the K value and distance metric used in graph construction, on model performance. Furthermore, the proposed method was compared with traditional single-sensor models, methods lacking spatiotemporal modeling, and several representative state-of-the-art spatiotemporal graph based models. The experimental results demonstrate that the proposed multi-source spatial-temporal fusion model outperforms the comparative methods in accuracy and F1 score, validating the effectiveness and advantages of fusing spatial-temporal structures for complex state recognition. In summary, the spatial-temporal graph neural network-based anomaly detection model for elevator door systems, as developed in this paper, exhibits promising performance and application potential. This is achieved through the integration of multi-source information, spatiotemporal dependencies, and graph structure modeling, which enables more effective characterization of localized abnormal patterns compared with existing spatiotemporal approaches. It offers both theoretical support and a methodological framework for multi-modal abnormal state recognition within elevator systems. Future research will focus on further optimizing the model architecture, enhancing its ability to identify abnormal states in scenarios with limited samples, and exploring the incorporation of a wider range of sensor data to improve the model's generalization and robustness.













